Browsed by
Category: Sports

Modeling Hit Rates Between Minor League Levels

Modeling Hit Rates Between Minor League Levels

Working on figuring out the hit rates for minor leaguer batters between levels. I’d like to take the hit rates(i.e. singles(1B/PA), doubles(2B/PA), triples(3B/PA) and HRs(HR/PA) ) a player had at their previous minor league level and use that data to predict how a player will do at the following level. Similar data has been used as in the previous articles on walk rates and strike out rates. This data set covered 2011-2015 and players with a minimum of 200 PA’s were included in the resulting model. Below are the graphs for each level, models and some thoughts.

A to A+

A to A+ Hit Rates

A theme throughout the graphs will show that the correlation numbers for singles and home runs are high but very low for doubles and triples. These same low correlation numbers for doubles and triples were found in previous research by Matt Klassen at Fangraphs.

Linear models:

  •  A+ Single Rate = (A single rate)*0.53520 + 0.07452
  • A+ Double Rate = (A double rate)*.36379 + .02929
  • A+ Triple Rate = (A triple rate)*.403826 + .004743
  • A+ HR Rate = (A HR rate)*.633131 + .0006235

A+ to AA

A+ to AA Hit Rates

Linear models:

  • AA Single Rate = (A+ Single rate)*.48235 + .07969
  • AA Double Rate = (A+ Double rate)*.22680 + .03389
  • AA Triple Rate = (A+ Triple rate)*.377505 + .003751
  • AA HR Rate = (A+ HR rate)*.534897 + .007925


AA to AAA Hit Rate

Linear models:

  • AAA Single Rate = (AA Single Rate)*.52767 + .07912
  • AAA Double Rate = (AA Double Rate)*.248769 + .03645
  • AAA Triple Rate = (AA Triple Rate)*.355865 + .003757
  • AAA HR Rate = (AA HR Rate)*.58037 + .00881

Whats Next:

  • Perform some validation on the above models
  • Combine the models you’ve generated to predict OBP/SLG/OPS
  • Make models that skip levels
  • Make code more efficient so you can do this faster



Modeling Strikeout Rate between minor league levels

Modeling Strikeout Rate between minor league levels

In this post I’ll go over my results for predicting strikeout rates between minor league levels. This article will cover the following:


Data Wrangling

Graphs and Correlation

Model and Evaluation


This time around I’ve change my approach up so I can do some cross-validation. The article will cover data from 2004-2015 but I’ll be training my model on data from 2004-2013 and evaluating it using the 2014-2015 data. The data itself consists of 39,349 data points and came from Baseball Reference . The data points represent minor league data from Short Season(SS-A) to AAA ball. I end up removing the SS-A data because currently I’m only modeling data between the full season leagues(A-AAA). Also, players data points were only included if they had >=200 plate appearances.

Data Wrangling

In order to model the data between minor league levels I need to do some data wrangling to get the dataframe in the format I need. The original data has each players season as a different entry.

Snippet from original dataframe. Each entry represents a year and minor league level the stats are for.

In order for me to graph and get correlation values between minor league levels I need all this data on one row with the stats for each level represented by a column. Below you can see a snippet of the dataframe I use for my analysis:

Snippet of correlation dataframe.

Notice how in the dataframe above all the stats I need for each level have been merged into one row.

Graphs and Correlation

Graphs showing the scatter plot and regression lines for the levels of minor league data I modeled.

As you can see from the graphs above a positive linear relationship exists for strike out rate between the minor league levels(A to A+, A+ to AA, AA to AAA) I’ve analyzed. Here are the correlation values for each level:

  • A to A+ :  0.7532319
  • A+ to AA : 0.7717004
  • AA to AAA : 0.7666475

From the numbers above and graphs you can see a ‘strong’ positive correlation exists for the strikeout rate between levels.

Model and Evaluation

The models for the regression line in the graphs above are:

  • A to A+ : A+ SO Rate = .7598*(A SO Rate) + .04591
  • A+ to AA: AA SO Rate = .83204*(A+ SO Rate) + .03608
  • AA to AAA: AAA SO Rate = .80664*(AA SO Rate) + .04147

The ‘Doing Data Science‘ book suggests using R-squared, p-values, and cross-validation to validate linear models. For this article I’ll be using R-squared and cross-validation:

  • A to A+: .5674
  • A+ to AA: .5955
  • AA to AAA: .5877

To do cross validation I’m going to use the data  from 2014-2015. This dataset consists of  of 8198 points. I performed the same steps I described above in the data wrangling section and that bought the dataframe I do my analysis on down to 427 points. The correlation numbers remained strong per level:

  • A to A+: 0.7366793
  • A+ to AA: 0.729288
  • AA to AAA: 0.7794951

Here is a graph showing the regression line against the 2014-2015 data:


To tell how often I’m correct or not I once again used the classification provided by fangraphs in this chart:

Picture retrieved from

This time using the average difference between classifications of K% and got that to be .0291667. So if my model is more than ~.03 off the actual error rate then I say it’s wrong for that data point. Here are my results for each level:

A to A+:

  • Incorrect: 48
  • Correct: 66
  • Percentage Correct: 57.89%

A+ to AA:

  • Incorrect:78
  • Correct: 93
  • Percentage Correct: 54.39%

AA to AAA:

  • Incorrect: 52
  • Correct: 74
  • Percentage Correct: 58.73


Modeling Walk Rate between minor league levels

Modeling Walk Rate between minor league levels

After reading through Projecting X by Mike Podhorzer I decided to try and predict some rate statistics between minor league levels. Mike states in his book “Projecting rates makes it dramatically easier to adjust a forecast if necessary.”; therefore if a player is injured or will only have a certain number of plate appearances that year I can still attempt to project performance. The first rate statistic I’m going to attempt project is Walk Rate between minor league levels. This article will cover the following:

Raw Data

Data Cleaning

Correlation and Graphs

Model and Results


Raw Data

For my model I used data from Baseball Reference and am using the last 7 years of minor league data(2009-2015). Accounting for the Short Season A(SS-A) to AAA affiliates I ended up with over 28,316 data points for my analysis.

Data Cleaning

I’m using R and the original dataframe I had put all the data from each year in different rows. In order to do the calculations I wanted to do I needed to move each players career minor league data to the same row. Also I noticed I needed to filter on plate appearances during a season to make sure I’m getting rid of noise. For example, a player on a rehab assignment in the minor leagues or a player who ended up getting injured for most of the year so they only had 50-100 plate appearances. The min plate appearances I ended up settling on was 200 for a player to be factored into the model. Another thing I’m doing to remove noise is only attempting to model player performance between full season leagues(A, A+, AA, AAA). Once the cleaning of the data was done I had the following data points for each level:

  • A to A+ : 1129
  • A+ to A: 1023
  • AA to AAA: 705

Correlation and Graphs

I was able to get strong correlation numbers for walk rate between minor league levels. You can see the results below:

  • A to A+ : .6301594
  • A+ to AA: .6141332
  • AA to AAA: .620662

Here’s the graphs for each level:




Model and Results

The linear models for each level are:

  • A to A+: A+ BB% = .63184*(A BB%) + .02882
  • A+ to AA: AA BB% = .6182*(A+ BB%) + .0343
  • AA to AAA: AAA BB% = .5682(AA BB%) + .0342

In order to interpret the success or failure of my results I compared how close I was to getting the actual walk rate. Fangraphs has a great rating scale for walk rate at the Major League level:

Image from Fangraphs 

The image above gives a classification for multiple levels of walk rates. While based on major league data it’s a good starting point for me to decide a margin of error for my model. The mean difference between each level in the Fangraphs table is .0183333, I ended up rounding and made my margin for error .02. So if my predicted value for a players walk rate was within .02 of being correct I counted counted the model as correct for the player and if my error was greater than that it was wrong. Here are the models results for each level:

  • A to A+
    • Incorrect: 450
    • Correct: 679
    • Percentage Correct: ~.6014
  • A+ to A
    • Incorrect: 445
    • Correct: 578
    • Percentage Correct: ~.565
  • AA to AAA
    • Incorrect: 278
    • Correct: 427
    • Percentage Correct: ~.6056

When I moved the cutoff up a percentage to .03 the models results drastically improve:

  • A to A+
    • Incorrect: 228
    • Correct: 901
    • Percentage Correct: ~.798
  • A+ to AA
    • Incorrect: 246
    • Correct: 777
    • Percentage Correct: ~.7595
  • AA to AAA
    • Incorrect: 144
    • Correct: 561
    • Percentage Correct: ~.7957


Numbers are cool but where are the actual examples. Ok, lets start off with my worst prediction. The largest error I had between levels was A to A+ and the error was >10%(~.1105). The player in this case was Joey Gallo a quick glance at the player page will show his A walk rate was only .1076 and his A+ walk rate was .2073 which is a 10% improvement between levels. So why did this happen and why didn’t my model do a better job of predicting this. Currently the model is only accounting for the previous seasons walk rate but what if the player is getting a lot of hits at one level and stops swinging as much on the next. In Gallo’s case he only had a .245 BA his year at A ball so that wasn’t the case. More investigation is required to see how the model can get closer on edge cases like this.

Gallo Dataframe Snippet

The lowest I was able to set the error too and still come back with results was ~.00004417. That very close prediction belongs too Erik Gonzalez. Don’t know Erik Gonzalez so I continued to look for results setting the min error to .0002 brought back Stephen Lombardozzi as one of my six results. Lombo’s interesting to hard core Nats fans(like myself) but wanted to continue to look for a more notable name. Finally after upping the number to .003 for A to A+ data I was able to see that the model successfully predicted Houston Astro’s multi-time All Star 2B walk rate for Jose Altuve walk rate within a .003 margin of error.

Altuve Dataframe snippet


Whats Next:

  • Improve model to get a lower max error
  • Predict Strike out rate between levels
  • Predicting more advanced statistics like woba/ops/wrc


Correlation between Salary Cap and Winning?

Correlation between Salary Cap and Winning?

After doing my initial blog looking at how much each team is spending per position group. I wanted to take a look to see if there was any correlation between how much teams are spending on a position group and winning. To do this I needed to merge the cap data from spotrac  and season summary data from pro-football-reference . I merged these datasets over the last 5 years but it’d be interesting to try and find data since the salary cap was in place(1994). Here’s a graph of my yearly findings for the last 5 years(2011-2015).


Quick review on correlation from Pearson:

  • .00-.19 “very weak”
  • .20-.39 “weak”
  •  .40-.59 “moderate”
  • .60-.79 “strong”
  • .80-1.0 “very strong”

As you can see from the graph the correlation numbers aren’t exactly high. I believe that’s because the best players aren’t necessarily getting paid the most money. For example, before last year Russel Wilson was on his rookie contract and the Seahawks were making the playoffs year after year and only paying Russel $749,176  a year. Now he know doubt had a lot to do with the Seahawks winning before and going forward but his Salary before his new contract wouldn’t have correlated much to winning. Examples like this can be found at each position. This is why it’s necessary to have a good Front Office to continually bring in young talent that can contribute at a lower price. Looking at actual on the field stats and trying to correlate would be a much better exercise than trying to merge cap data with correlation.

Position Correlation
DB 0.05356608
DL -0.10102064
LB 0.14434313
OL -0.0783075
QB 0.08256776
RB -0.0917516
ST -0.05109986
TE -0.04013766
WR 0.07491824

Over the last 5 years there’s not one correlation that’s greater than ‘very weak’. But the positions that do have positive correlations DB, LB, QB, and WR are the positions that GM’s over the last 5 years have been willing to pay. Left Tackle is another position that has been getting paid very well in the league because Left Tackle’s are usually the one’s protecting a QB’s blind spot.

This data emphasizes the importance of drafting well because spending money on a particular position does not correlate with your team winning. Also interesting to note the positions that do have positive correlations must be that way because those players have made it to their second/third contracts and are now getting the big contracts. So it makes me wonder if DB, LB, QB and WR are the positions that have the longest careers in the NFL.




Where is your favorite NFL teams cap space going…Part2(AFC)

Where is your favorite NFL teams cap space going…Part2(AFC)

Part 2 takes a look at what teams in the AFC are doing with their cap space.

AFC East

AFC North

AFC South

AFC West

AFC East

Biggest thing standing out to me below is Miami is spending a lot of money on their DL, 2 times more than league average and almost 30% of their cap space.


Buffalo Bills Miami Dolphins New England Patriots New York Jets NFL
DB 29,479,496 27,616,499 17,018,794 38,402,025 26,240,056
DL 33,279,980 43,671,818 26,648,167 23,376,055 21,559,330
LB 16,794,623 10,078,824 18,192,447 14,642,518 21,687,341
OL 23,881,580 32,341,251 28,246,427 31,934,742 25,224,340
QB 6,056,322 13,124,470 15,120,305 15,107,897 17,635,072
RB 10,860,734 4,127,886 5,546,772 7,035,895 7,603,650
ST 3,648,334 2,021,473 5,900,000 4,296,447 4,598,119
TE 7,461,559 11,253,250 14,119,263 2,457,483 7,136,520
WR 13,762,349 7,990,195 18,808,154 23,807,352 17,016,041
Total 145,224,977 152,225,666 149,600,329 161,060,414 148,700,468


team_lg cap
Buffalo Bills 145,224,977.00
Miami Dolphins 152,225,666.00
New England Patriots 149,600,329.00
New York Jets 161,060,414.00
NFL 148,700,467.53

AFC North


Baltimore Ravens Cincinnati Bengals Cleveland Browns Pittsburgh Steelers NFL
DB 32,608,731 30,917,521 36,129,121 21,018,906 26,240,056
DL 7,940,348 31,438,474 15,739,186 14,649,537 21,559,330
LB 23,899,629 17,117,432 24,706,482 29,272,432 21,687,341
OL 17,769,960 25,827,523 20,441,486 31,713,791 25,224,340
QB 24,053,334 13,750,413 16,804,891 25,414,805 17,635,072
RB 6,578,244 8,165,445 2,229,419 5,333,109 7,603,650
ST 7,540,000 5,521,667 2,833,000 2,833,000 4,598,119
TE 7,916,918 3,396,211 2,987,732 6,936,562 7,136,520
WR 12,972,115 17,898,407 13,766,584 16,140,272 17,016,041
Total 141,279,279 154,033,093 135,637,901 153,312,414 148,700,468


team_lg cap
Baltimore Ravens 141,279,279.00
Cincinnati Bengals 154,033,093.00
Cleveland Browns 135,637,901.00
Pittsburgh Steelers 153,312,414.00
NFL 148,700,467.53

AFC South


Houston Texans Indianapolis Colts Jacksonville Jaguars Tennessee Titans NFL
DB 27,192,446 21,622,749 28,048,803 31,983,674 26,240,056
DL 22,806,964 10,970,469 35,629,261 20,257,121 21,559,330
LB 27,071,210 26,746,775 17,325,094 23,470,780 21,687,341
OL 29,927,612 25,770,042 23,051,277 21,491,313 25,224,340
QB 14,412,646 19,906,667 10,414,198 7,565,734 17,635,072
RB 10,585,079 5,811,457 6,899,223 12,402,648 7,603,650
ST 3,355,000 6,750,000 3,500,000 6,235,666 4,598,119
TE 2,310,164 11,855,251 11,413,527 11,355,208 7,136,520
WR 8,709,817 17,229,506 13,006,178 18,252,871 17,016,041
Total 146,370,938 146,662,916 149,287,561 153,015,015 148,700,468


team_lg cap
Houston Texans 146,370,938.00
Indianapolis Colts 146,662,916.00
Jacksonville Jaguars 149,287,561.00
Tennessee Titans 153,015,015.00
NFL 148,700,467.53

AFC West


Denver Broncos Kansas City Chiefs Oakland Raiders San Diego Chargers NFL
DB 33,253,262 20,927,181 34,799,467 26,196,974 26,240,056
DL 15,847,747 17,431,156 13,215,815 17,830,448 21,559,330
LB 27,946,958 38,406,775 30,579,411 23,274,907 21,687,341
OL 15,654,114 21,758,228 37,579,810 25,404,464 25,224,340
QB 9,261,158 19,131,035 4,631,343 22,280,000 17,635,072
RB 9,412,971 10,055,670 9,640,686 7,144,037 7,603,650
ST 5,175,002 5,057,334 10,187,500 2,437,622 4,598,119
TE 6,886,868 3,948,789 5,235,006 7,944,722 7,136,520
WR 25,047,116 20,132,805 18,191,836 17,032,784 17,016,041
Total 148,485,196 156,848,973 164,060,874 149,545,958 148,700,468


team_lg cap
Denver Broncos 148,485,196.00
Kansas City Chiefs 156,848,973.00
Oakland Raiders 164,060,874.00
San Diego Chargers 149,545,958.00
NFL 148,700,467.53
Where is your favorite NFL teams cap space going…Part1(NFC)

Where is your favorite NFL teams cap space going…Part1(NFC)

Was interested in getting a look at how each team is spending their cap by position group. In order to get the data I went to spotrac, which is a great site for interactively managing your teams salary cap. The position groups I split that data into are:

QB: Quarterback

RB: Running back which includes Full backs as well

TE: Tight End

WR: Wide Receiver

OL: Offensive line which includes center, gaurd and tackles

ST: Special Teams which includes punter, long snapper, kicker

DL: Defensive line which includes nose tackles, defensive tacklets and defensive end

LB: Linebacker which includes ILB, OLB

DB Defensive backs which includes CB, S, FS, SS

Below are graphs and tables of each teams cap space by division.

NFC East

NFC North

NFC South

NFC West

Technical Details

NFC East


Dallas Cowboys New York Giants Philadelphia Eagles Washington Redskins NFL
DB 32,517,674 24,210,851 22,215,313 20,417,799 26,240,056
DL 10,995,953 33,863,231 14,542,212 21,341,639 21,559,330
LB 19,166,315 16,750,859 25,526,573 21,093,256 21,687,341
OL 21,659,580 19,946,109 36,196,468 28,012,731 25,224,340
QB 21,697,516 24,988,400 24,084,500 21,787,333 17,635,072
RB 9,168,992 10,994,037 13,267,311 1,392,909 7,603,650
ST 5,920,000 3,725,000 3,847,168 2,612,500 4,598,119
TE 11,742,142 1,710,047 5,326,897 8,952,278 7,136,520
WR 20,135,362 12,750,655 9,632,507 25,216,781 17,016,041
Total 153,003,534 148,939,189 154,638,949 150,827,226 148,700,468



team_lg cap
Dallas Cowboys 153,003,534.00
New York Giants 148,939,189.00
Philadelphia Eagles 154,638,949.00
Washington Redskins 150,827,226.00
NFL 148,700,467.53


NFC North


Chicago Bears Detroit Lions Green Bay Packers Minnesota Vikings NFL
DB 11,933,993 17,976,563 26,425,315 31,075,280 26,240,056
DL 10,802,270 17,658,801 14,155,762 28,816,518 21,559,330
LB 33,848,070 16,110,645 36,248,769 11,125,803 21,687,341
OL 18,892,048 16,635,996 24,802,901 38,342,094 25,224,340
QB 19,606,667 23,385,434 19,830,908 5,121,379 17,635,072
RB 2,113,756 3,389,020 3,888,337 15,089,750 7,603,650
ST 5,330,327 4,600,164 3,940,000 3,196,048 4,598,119
TE 5,299,257 9,369,926 4,639,959 9,842,136 7,136,520
WR 26,532,299 31,995,217 20,514,889 7,375,500 17,016,041
Total 134,358,687 141,121,766 154,446,840 149,984,508 148,700,468


team_lg cap
Chicago Bears 134358687
Detroit Lions 141121766
Green Bay Packers 154446840
Minnesota Vikings 149984508
NFL 148700467.5

NFC South


Atlanta Falcons Carolina Panthers New Orleans Saints Tampa Bay Buccaneers NFL
DB 12,996,775 9,366,346 30,666,335 29,760,125 26,240,056
DL 21,786,004 20,013,543 14,311,219 32,639,717 21,559,330
LB 13,193,402 16,946,538 25,871,529 14,394,757 21,687,341
OL 32,510,564 23,545,400 24,026,456 24,820,006 25,224,340
QB 26,786,474 22,310,000 32,022,864 8,792,029 17,635,072
RB 2,372,750 14,064,730 10,031,191 9,625,988 7,603,650
ST 6,743,333 5,475,667 5,220,000 3,086,830 4,598,119
TE 4,975,878 7,689,497 5,671,266 5,549,829 7,136,520
WR 26,015,153 10,490,331 5,920,718 19,669,753 17,016,041
Total 147,380,333 129,902,052 153,741,578 148,339,034 148,700,468


team_lg cap
Atlanta Falcons 147,380,333.00
Carolina Panthers 129,902,052.00
New Orleans Saints 153,741,578.00
Tampa Bay Buccaneers 148,339,034.00
NFL 148,700,467.53

NFC West


Arizona Cardinals Los Angeles Rams San Francisco 49ers Seattle Seahawks NFL
DB 27,549,850 25,316,530 19,046,652 40,990,742 26,240,056
DL 24,358,032 34,440,531 16,521,411 22,919,173 21,559,330
LB 18,545,398 16,014,777 24,957,641 14,674,279 21,687,341
OL 25,449,085 22,096,357 27,191,734 10,257,720 25,224,340
QB 21,297,712 18,203,311 18,825,554 18,547,000 17,635,072
RB 5,233,268 8,486,987 5,216,786 7,151,703 7,603,650
ST 1,450,500 4,720,995 4,652,550 5,326,667 4,598,119
TE 5,184,876 8,339,327 3,991,964 12,604,853 7,136,520
WR 26,263,175 12,138,843 12,967,161 14,146,617 17,016,041
Total 155,331,896 149,757,658 133,371,453 146,618,754 148,700,468



team_lg cap
Arizona Cardinals 155,331,896.00
Los Angeles Rams 149,757,658.00
San Francisco 49ers 133,371,453.00
Seattle Seahawks 146,618,754.00
NFL 148,700,467.53

Technical Details

The functions I needed to write to get this done are here:

[code language=”r”]

getDivisionTable <- function(divisionDf){
teamNames <- as.character(unique(divisionDf["team_lg"])[,1])
positionGroups <- as.character(unique(divisionDf["positionGroup"])[,1])

tableData <- c()
for(position in positionGroups){
positionDf <- divisionDf[divisionDf["positionGroup"] == position,]
for(i in 1:length(teamNames)){
teamPositionDf <- positionDf[positionDf["team_lg"] == teamNames[i],]
tableData <- c(tableData, prettyNum(as.numeric(teamPositionDf["capHit"]), big.mark = ","))

#Add Total To table
for(a in 1:length(teamNames)){
teamOnly <- divisionDf[divisionDf["team_lg"] == teamNames[a],]
tableData <- c(tableData, prettyNum(sum(teamOnly["capHit"]), big.mark = ","))

table <- matrix(tableData, ncol=length(teamNames), byrow=TRUE)
colnames(table) <- teamNames
rownames(table) <- c(positionGroups, "Total")


#Class to return

#Get graph’s and tables by division
generateCapOutputPerDivision <- function(allDf, division){
#Strip out each entry with a percentage and get rid of any entry left over
#that doesn’t have a cap hit
all_hitCap <- all_tm_caps[!$percentage),]
all_hitCap <- all_hitCap[!$capHit),]

#Summarise data by teamName, division, positionGroup
all_tms_pos_group <- ddply(all_hitCap, .(teamName,division,positionGroup), summarise, capNum=sum(capHit))

#Add an NFL Constant to help summarise league averages
league <- rep("NFL", nrow(all_tms_pos_group))

#Bind the league constant
all_tms_pos_group <- cbind(all_tms_pos_group, league)

#Summarize all the data by league
league_summary <- ddply(all_tms_pos_group, .(league,positionGroup), summarise, cap=mean(capNum))

#Filter by division
divisionDf <- all_tms_pos_group[all_tms_pos_group$division == division,]

#Add League summary back in here:
divisionDf <- subset(divisionDf, select = -c(division, league))
colnames(divisionDf) <- c("team_lg", "positionGroup", "capHit")
colnames(league_summary) <- c("team_lg", "positionGroup", "capHit")

#Combine date
divisionDf <- rbind(divisionDf, league_summary)

#Total by division
totalDf <- ddply(divisionDf, .(team_lg), summarise, cap=sum(capHit))

divisionTable <- getDivisionTable(divisionDf)
#divisionTable <- xtable(divisionTable)

#Put data in class

With those functions in place I was able to create the necessary graphs and tables with the following code:

[code language=”r”]
#Read in data
all_tm_caps <- getDfFromDir("data/")

#Create filter
dataFilter <- c("NFCE", "NFCN", "NFCS", "NFCW", "AFCE", "AFCN", "AFCS", "AFCW")

for(division in dataFilter){
print(paste("Division", division))
divisionData <- generateCapOutputPerDivision(all_tm_caps, division)

#Graph Division Data
ggplot(data=divisionData@divisionCapPerPositionGroup, aes(x=team_lg, capHit, fill=positionGroup )) + geom_bar(stat="identity", position="dodge") + ggtitle(paste(divisionData@division, "Cap Per Team")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Write PNG
ggsave(filename=paste(division, "PerPosition.png", sep="-"))

#Position Group
write.csv(divisionData@divisonTable, paste(division, "TablePerPositionGroup.csv"))

#Write CSV
ggplot(data=divisionData@divisionCapTotalPerTeam, aes(x=team_lg, cap)) + geom_bar(stat="identity") + ggtitle(paste(divisionData@division, " Total Cap Per Team"))

#Write PNG
ggsave(filename=paste(division, "CapTotal.png", sep="-"))

write.csv(divisionData@divisionCapTotalPerTeam, paste(division, "TableTotal.csv"))

Enriching Datasets with R

Enriching Datasets with R

If you have a simple data set and have some additional statistics you’d like to add to that dataset you can easily do that with R. Going to add fip, woba, wrc, and wraa to a couple of baseball datasets as an example of this.

To calculate FIP I first needed the following R functions:

[code language=”r”]
#Calculate FIP Constant
fip_constant_calc <- function(pitching_data){
#FIP Constant = lgERA – (((13*lgHR)+(3*(lgBB+lgHBP))-(2*lgK))/lgIP)
era = sum(pitching_data["ER"])/sum(pitching_data["IP"]) * 9
lgHR = sum(pitching_data["HR"])
lgBB = sum(pitching_data["BB"])
lgHBP = sum(pitching_data["HBP"])
lgK = sum(pitching_data["SO"])
lgIP = sum(pitching_data["IP"])

fipConstant = era – (((13*lgHR)+(3*(lgBB+lgHBP))-(2*lgK))/lgIP)
#print(paste("ERA", era))

#Calculate FIP
fip_calc <- function(pitching_data, fipConstant){
#FIP = ((13*HR)+(3*(BB+HBP))-(2*K))/IP + constant
hr = as.numeric(pitching_data["HR"])
bb = as.numeric(pitching_data["BB"])
hbp = as.numeric(pitching_data["HBP"])
k = as.numeric(pitching_data["SO"])
ip = as.numeric(pitching_data["IP"])

fip = ((13*hr)+(3*(bb+hbp))-2*(k))/ip + fipConstant


Once you have the necessary functions to calculate the stat you can use apply in R to apply the function to your dataframe.


[code language=”r”]
#Get Data
fg_pitching_data = read.csv("data/2007/FG_MLB_Pitching_Std_All.csv")
constant <- fip_constant_calc(fg_pitching_data)
fg_pitching_data$fip <- apply(fg_pitching_data, 1, fip_calc, fipConstant=constant)

Here are some R functions to calculate wOBA, wrc, and wraa.

[code language=”r”]
#Calculate wOBA based on weights using FG formula
woba_calc_weights_fg <- function(row, weights){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
ab <- as.numeric(row["AB"])
ibb <- as.numeric(row["IBB"])
sf <- as.numeric(row["SF"])

numerator <- as.numeric(weights["wobaBB"])*(bb-ibb)+as.numeric(weights["wobaHB"])*hbp+as.numeric(weights["woba1B"])*singles+as.numeric(weights["woba2B"])*doubles+as.numeric(weights["woba3B"])*triples+as.numeric(weights["wobaHR"])*hr
denom <- ab + bb – ibb + sf + hbp


wraa_calc_fg <- function(row, weights){
numerator <- as.numeric(row["woba"]) – as.numeric(weights["woba"])
denom <- as.numeric(weights["wobaScale"])

result = (numerator/denom) * (as.numeric(row["PA"]))


wrc_calc <- function(row, weights){
woba = as.numeric(row["woba"])
lgWOBA = as.numeric(weights["woba"])
wobaScale = as.numeric(weights["wobaScale"])
lgR = as.numeric(weights["lgR"])
lgPA = as.numeric(weights["lgPA"])
pa = as.numeric(row["PA"])
wrc = (((woba-lgWOBA)/wobaScale)+(lgR/lgPA))*pa


Once you have the functions necessary in place you can add the statistics to your dataset using this code:

[code language=”r”]
fg_batting_data = read.csv(&amp;quot;data/2007/FG_MLB_Batting_Std_All.csv&amp;quot;)
wobaWeights = get_woba_weights(fg_pitching_data, fg_batting_data)
fg_batting_data$woba = apply(fg_batting_data, 1, woba_calc_weights_fg, weights=wobaWeights)
fg_batting_data$wraa = apply(fg_batting_data, 1, wraa_calc_fg, weights=wobaWeights)
fg_batting_data$wrc = apply(fg_batting_data, 1, wrc_calc, weights=wobaWeights)

Note: ‘get_woba_weights’ was addressed in the Learning wOBA blog post.

Why Reynaldo?

Why Reynaldo?

The Nationals sent Lucas Giolito back down to the minors and have called up Reynaldo Lopez for his Major League debut tomorrow. So I decided to take a look at possible reasons for that decision. Giolito did ok in his first rain shortened start giving up only 1 hit in 4 innings but the 2 BB’s were a little concerning. Especially since this season he’s had a pattern of walking people to a tune of 4.3 BB/9 in the Eastern League this year. That BB/9 currently ranks in the bottom 10 in the Eastern League for pitchers with >50 IP. Giolito’s control problem reared it’s ugly head again in Giolito’s second start when he gave up 4 BB in 3.2 innings of work and gave up 4 ER. The Major Leagues isn’t the place for a young pitcher to workout their control issues. So the Nationals made a smart decision and sent Giolito down to figure out his control issues. Last year in A+ Giolito was only walking 2.58 batters per 9 innings if he can get his walk rate back down to those levels I’m sure he’ll be back up in no time. With that out of the way lets lets look at the three best Nationals pitching prospects(Austin Voth, Reynaldo Lopez and Giolito) to figure out ‘Why Reynaldo’:


As you can see in the table above they all have pretty similar ERA’s but Reynaldo is outpacing his two counter parts in K/9, K/BB, and FIP. Voth and Reynaldo have comparable WHIPs and Voth is doing the best at stranding runners on base. Then again when your striking out as many people as Reynaldo you probably don’t have too many runners on. Here’s a look at all three players ERA’s over their last 7 minor league starts:

So why Reynaldo? Well the answer seems clear to me. He’s been the best performing Nationals minor league pitcher thus far this season and has earned the start. I’m looking forward to seeing Reynaldo’s debut tomorrow hopefully he keeps striking people out and keeps the BBs at a manageable rate.

Learning wOBA

Learning wOBA

As I continue to learn R and go down the road of becoming a data scientist. I need to learn how to use and compute advanced statistics. The first advanced analytic I’m going to learn how to compute is weighted on-base average(wOBA). Weighted on-base average combines all the parts of a players offensive game and gives them all appropriate weights for their impact on the game. For example, a HR is given more weight than a BB or a Single because a HR is guarantees a team atleast one run while a BB/Single only scores a run a much lower percentage of the time. General statistics like BA/SLG/OBP only look at a piece of a players offensive game which is why wOBA is a better tool for looking at a players offensive contribution. Fangraphs does a great job of describing the details in the following links:

The links above give a lot of insight on why you should use wOBA. Tom Tango in The Book describes the formula. He also has the standard formula for wOBA that includes and doesn’t include speed on his site . I used the standard formula from his site for my initial calculations with R.

The following R code has the functions I use to calculate speed and non speed wOBA:


[code language=”r”]
#Link for standard WOBA formula
woba_calc_basic <- function(row){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
pa <- as.numeric(row["PA"])
ibb <- as.numeric(row["IBB"])
sh <- as.numeric(row["SH"])

numerator <- .7*(bb+hbp) + .9*(singles) + 1.3*(doubles + triples) + 2*hr
denominator <- pa – ibb – sh


woba_calc_speed <- function(row){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
pa <- as.numeric(row["PA"])
ibb <- as.numeric(row["IBB"])
sh <- as.numeric(row["SH"])
sb <- as.numeric(row["SB"])
cs <- as.numeric(row["CS"])

numerator <- .7*(bb+hbp) + .9*(singles) + 1.3*(doubles + triples) + 2*hr + .25*sb + -.5*cs
denominator <- pa – ibb – sh


Once I had those two formula’s in place I retrieved some data from Nationals data from baseball-reference . Then used the code below to apply the two functions to a dataframe.


[code language=”r”]
data = read.csv("data/FG_MLB_Batting_Std_All.csv")

#Apply basic woba
data$woba_basic = apply(data, 1, woba_calc_basic)

#Apply speed woba
data$woba_speed = apply(data, 1, woba_calc_speed)


Once I had that in place I noticed that Fangraphs kept mentioning ‘wOBA scale’ and how it and wOBA could be used to generate another stat called Weighted Runs Above Average . After doing some digging I found an old article written by Tom Tango on how to compute yearly wOBA from scratch . Above I’ve just used some standard weights to compute a standard wOBA. But standard wOBA doesn’t take into account the yearly offensive environment that players are playing in. Basically some years an offensive action is worth more than others based upon how the whole league is performing. Tom Tango’s article on computing yearly wOBA teaches you how to get the yearly weights and calculate wOBA scale. Continuing down this R, wOBA rabbit hole I transcribe his SQL script into R functions.

To compute the offensive environment(Step 1 and Step 2) in Tom Tangos script I use this function on pitching data:


[code language=”r”]
get_runvalues <- function(pitching_data){
#Calculate runs per out
stat <- sum(pitching_data["R"])/(sum(pitching_data["IP"])*3)
rPerOut <- c(stat)

#Calculate runs per bb
stat <- rPerOut + .14
rPerBB <- c(stat)

#Runs per HB
stat <- rPerBB + .025
rPerHB <- c(stat)

#Runs Per Single
stat <- rPerBB + .155
rPer1B <- c(stat)

#Runs Per Double
stat <- rPer1B + .3
rPer2B <- c(stat)

#Runs Per Triple
stat <- rPer2B + .27
rPer3B <- c(stat)

#Runs per HR
rHR <- 1.4

rSB <- .2

rCS <- 2*rPerOut + .075

df <- data.frame(rPerOut, rPerBB, rPerHB, rPer1B, rPer2B, rPer3B, rHR, rSB, rCS)



Step 3 in the SQL script pulls uses the run values to generate the necessary weights to calculate wOBA. The following R methods accomplish this:

[code language=”r”]
get_woba_runsMinus <- function(batting, runValues){
bbNumerator <- as.numeric(runValues["rPerBB"]) * (sum(batting["BB"]) – sum(batting["IBB"]))
hbpNumerator <- as.numeric(runValues["rPerHB"]) * sum(batting["HBP"])

singles <- sum(batting["H"]) – sum(batting["X2B"]) – sum(batting["X3B"]) – sum(batting["HR"])
singleNumerator <- as.numeric(runValues["rPer1B"]) * singles
doubleNumerator <- as.numeric(runValues["rPer2B"]) * sum(batting["X2B"])
tripleNumerator <- as.numeric(runValues["rPer3B"]) * sum(batting["X3B"])
hrNumerator <- as.numeric(runValues["rHR"]) * sum(batting["HR"])
sbNumerator <- as.numeric(runValues["rSB"]) * sum(batting["SB"])
csNumerator <- as.numeric(runValues["rCS"]) * sum(batting["CS"])

numerator <- bbNumerator + hbpNumerator + singleNumerator + doubleNumerator + tripleNumerator + hrNumerator + sbNumerator – csNumerator

#print(paste("numerator is, ", numerator))

denom <- sum(batting["AB"]) – sum(batting["H"]) + sum(batting["SF"])
#print(paste("denominator is, ", denom))


get_woba_runsPlus <- function(batting, runValues){
#Same as runMinus should probable combine and output df
bbNumerator <- as.numeric(runValues["rPerBB"]) * (sum(batting["BB"]) – sum(batting["IBB"]))
hbpNumerator <- as.numeric(runValues["rPerHB"]) * sum(batting["HBP"])

singles <- sum(batting["H"]) – sum(batting["X2B"]) – sum(batting["X3B"]) – sum(batting["HR"])
singleNumerator <- as.numeric(runValues["rPer1B"]) * singles
doubleNumerator <- as.numeric(runValues["rPer2B"]) * sum(batting["X2B"])
tripleNumerator <- as.numeric(runValues["rPer3B"]) * sum(batting["X3B"])
hrNumerator <- as.numeric(runValues["rHR"]) * sum(batting["HR"])
sbNumerator <- as.numeric(runValues["rSB"]) * sum(batting["SB"])
csNumerator <- as.numeric(runValues["rCS"]) * sum(batting["CS"])

numerator <- bbNumerator + hbpNumerator + singleNumerator + doubleNumerator + tripleNumerator + hrNumerator + sbNumerator – csNumerator
denom <- (sum(batting["BB"]) – sum(batting["IBB"])) + sum(batting["HBP"]) + sum(batting["H"])


woba_calc_lgAvgSpeed <- function(batting, runValues){
numerator <- sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["H"]) + sum(batting["SB"]) – sum(batting["CS"])
denom <- sum(batting["AB"]) + sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["SF"])


woba_calc_lgAvgReg <- function(batting, runValues){
numerator <- sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["H"])
denom <- sum(batting["AB"]) + sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["SF"])


get_woba_weights <- function(pitching, batting){
#Get the run values
run_values <- get_runvalues(pitching)

#Compute runs minus
runs_minus <- get_woba_runsMinus(batting, run_values)

runs_plus <- get_woba_runsPlus(batting, run_values)

#woba <- woba_calc_lgAvgSpeed(batting, run_values)
woba <- woba_calc_lgAvgReg(batting, run_values)

wobaScale <- 1/(runs_plus + runs_minus)

wobaBB <- (as.numeric(run_values["rPerBB"]) + runs_minus)*wobaScale
wobaHB <- (as.numeric(run_values["rPerHB"]) + runs_minus)*wobaScale
woba1B <- (as.numeric(run_values["rPer1B"]) + runs_minus)*wobaScale
woba2B <- (as.numeric(run_values["rPer2B"]) + runs_minus)*wobaScale
woba3B <- (as.numeric(run_values["rPer3B"]) + runs_minus)*wobaScale
wobaHR <- (as.numeric(run_values["rHR"]) + runs_minus)*wobaScale
wobaSB <- as.numeric(run_values["rSB"])*wobaScale
wobaCS <- as.numeric(run_values["rCS"])*wobaScale

df <- data.frame(run_values, runs_minus, runs_plus, woba, wobaScale, wobaBB, wobaHB, woba1B, woba3B, wobaHR, wobaSB, wobaCS)


With all this in place I now needed to test my functions against the table he has here . The data I ended up using for the test is from Fangraphs . The code below calculates the wOBA weights:

[code language=”r”]
fg_batting_data = read.csv("data/FG_MLB_Batting_Std_All.csv")
fg_pitching_data = read.csv("data/FG_MLB_Pitching_Std_All.csv")
min_pa = subset(fg_batting_data, PA > 80)
wobaWeights = get_woba_weights(fg_pitching_data, fg_batting_data)
wobaWeightsFiltered = get_woba_weights(fg_pitching_data, min_pa)

Here’s the output for the wOBA weights for 2007:

rPerOut rPerBB rPerHB rPer1B rPer2B rPer3B rHR rSB rCS
1 0.1796024 0.3196024 0.3446024 0.4746024 0.7746024 1.044602 1.4 0.2 0.4342048
runs_minus runs_plus woba wobaScale wobaBB wobaHB woba1B woba2B
1 0.2792543 0.5642809 0.3310523 1.185487 0.7099368 0.739574 0.8936873 1.249333
woba3B wobaHR wobaSB wobaCS
1 1.569415 1.990734 0.2370974 0.5147441

These numbers line up with the numbers from Tom Tangos table for 2007 I linked above. Here’s a look at what the side by side numbers:

label original calc my calc
year 2007 2007
rPerOut .179 0.1796024
rPerBB .32 .3196024
rPerHB .34 .3446024
rPer1B .47 .4746024
rPer2B .77 .7746024
rPer3B 1.04 1.044602
rHR 1.40 1.4
rSB .2 .2
rCS .43 .4342048
rMinus .28 .2792543
rPlus .56 .5642809
wOBA .331 .3310523
wobaScale 1.19 1.185487
wobaBB .71 .7099368
wobaHB .74 .739574
woba1B .89 .8936873
woba2B 1.25 1.249333
woba3B 1.57 1.569415
wobaHR 1.99 1.990734
wobaSB .24 .2370974
wobaCS .51 .5147441

With some simplification the weights I produced with the R code above would line up with Tom Tango’s original chart for 2007. Since I’m now able to calculate wOBA weights from scratch I’ll be able to calculate wRAA and apply wOBA in other offensive environments.

Exploratory Data Analysis using Minor League Batting Statistics

Exploratory Data Analysis using Minor League Batting Statistics

Similar to graphically looking at Nationals minor league pitching stats I wanted to do the same with their minor league hitting stats per team. I decided to look at how the Nationals minor league team OPS is doing relative to their league and level. OPS is a players OBP added to their SLG measure how good a player is doing offensively when those two metrics are taken into account.

Since pitchers also bat I needed to do some data cleaning or the numbers wouldn’t make sense. To clean the data I removed all players from data set that didn’t have more than 20 Plate Appearances this season. The original data set 3255 data points. After adding that filter I got down to 2384 data points. Here’s the layout by level:

Level Data Points
SS-A 273
A 514
A+ 507
AA 518
AAA 572

For each League at each level I wanted to get the average team ERA and compare it to how the Nationals affiliates are doing. In the below table you can see those numbers.


League/Team Level OPS
New York Penn League SS-A .628
Northwest League SS-A .653
Auburn Doubledays SS-A .599
Midwest A .645
South Atlantic A .675
Hagerstown Suns A .736
California A+ .702
Carolina A+ .672
Florida State A+ .653
Potomac Nationals A+ .679
Eastern AA .699
Texas AA .669
Southern AA .678
Harrisburg AA .685
Pacific Coast AAA .727
International AAA .677
Syracuse AAA .631

Note: Data covers the season up to 7/1/2016

Here’s a look at the data graphically:


Auburn is a small sample size so I wouldn’t pay to much attention to the short season portion of the graph just yet. Hagerstown is our best performing offensive team based on OPS. Their team OPS is better than the average OPS for the two leagues at their level(South Atlantic League and California League). Overall Hagerstown(.735) has the second best team OPS in their League(1st is Asheville .756) and third best OPS for their league(1st is Bowling Green at .764). Harrisburgs and Potomac are performing at a little over League average each. On the other end of the spectrum from Hagerstown, Syracuse has a bottom 5 team OPS.

Here’s a look at only the Nationals Minor League affiliates OPS:


In my next blog post I’m going to look at two of the catalyst of the Hagerstown offense Max Schrock and Victor Robles.

Part 2 Source Code:

[code language=”r”]

minors_batting &lt;- getDfFromDir(&quot;dataDir&quot;)
#Only use cases with more than 20PA’s
minors_batting &lt;- filter(minors_batting, PA&gt;20)
minors_batting &lt;- minors_batting[complete.cases(minors_batting),]

minors_tm_ops &lt;- ddply(minors_batting, .(Tm,Lg,Lvl,Franchise), summarise, ops=mean(OPS))
minors_lg_ops &lt;- ddply(minors_batting, .(Lg,Lvl), summarise, ops=mean(OPS))
aff_ops &lt;- ddply(minors_batting[minors_batting$Franchise==&quot;Washington Nationals&quot;,],.(Tm,Lvl), summarise, ops=mean(OPS))

lg_melt_data &lt;- melt(minors_lg_ops)
aff_melt_data &lt;- melt(aff_ops)

#Rename vars so you can bind
colnames(lg_melt_data) &lt;- c(&quot;lg_tm&quot;, &quot;lvl&quot;, &quot;variable&quot;, &quot;value&quot;)
colnames(aff_melt_data) &lt;- c(&quot;lg_tm&quot;, &quot;lvl&quot;, &quot;variable&quot;, &quot;value&quot;)
total_melt_data &lt;- rbind(lg_melt_data,aff_melt_data)

ops_graph &lt;- ggplot(data=total_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat=&quot;identity&quot;, position=&quot;dodge&quot;) + ggtitle(&quot;WSH Minors OPS Per Level&quot;)

#Graph of only the nationals
nats_ops_graph &lt;- ggplot(data=aff_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat=&quot;identity&quot;, position=&quot;dodge&quot;) + ggtitle(&quot;WSH Minors OPS&quot;)