Browsed by
Category: R

Modeling Strikeout Rate between minor league levels

Modeling Strikeout Rate between minor league levels

In this post I’ll go over my results for predicting strikeout rates between minor league levels. This article will cover the following:

Data

Data Wrangling

Graphs and Correlation

Model and Evaluation

Data

This time around I’ve change my approach up so I can do some cross-validation. The article will cover data from 2004-2015 but I’ll be training my model on data from 2004-2013 and evaluating it using the 2014-2015 data. The data itself consists of 39,349 data points and came from Baseball Reference . The data points represent minor league data from Short Season(SS-A) to AAA ball. I end up removing the SS-A data because currently I’m only modeling data between the full season leagues(A-AAA). Also, players data points were only included if they had >=200 plate appearances.

Data Wrangling

In order to model the data between minor league levels I need to do some data wrangling to get the dataframe in the format I need. The original data has each players season as a different entry.

ramosoriginaldata
Snippet from original dataframe. Each entry represents a year and minor league level the stats are for.

In order for me to graph and get correlation values between minor league levels I need all this data on one row with the stats for each level represented by a column. Below you can see a snippet of the dataframe I use for my analysis:

ramoscorrelationsnippet
Snippet of correlation dataframe.

Notice how in the dataframe above all the stats I need for each level have been merged into one row.

Graphs and Correlation

regressionlineformilbdatausedformodel
Graphs showing the scatter plot and regression lines for the levels of minor league data I modeled.

As you can see from the graphs above a positive linear relationship exists for strike out rate between the minor league levels(A to A+, A+ to AA, AA to AAA) I’ve analyzed. Here are the correlation values for each level:

  • A to A+ :  0.7532319
  • A+ to AA : 0.7717004
  • AA to AAA : 0.7666475

From the numbers above and graphs you can see a ‘strong’ positive correlation exists for the strikeout rate between levels.

Model and Evaluation

The models for the regression line in the graphs above are:

  • A to A+ : A+ SO Rate = .7598*(A SO Rate) + .04591
  • A+ to AA: AA SO Rate = .83204*(A+ SO Rate) + .03608
  • AA to AAA: AAA SO Rate = .80664*(AA SO Rate) + .04147

The ‘Doing Data Science‘ book suggests using R-squared, p-values, and cross-validation to validate linear models. For this article I’ll be using R-squared and cross-validation:

  • A to A+: .5674
  • A+ to AA: .5955
  • AA to AAA: .5877

To do cross validation I’m going to use the data  from 2014-2015. This dataset consists of  of 8198 points. I performed the same steps I described above in the data wrangling section and that bought the dataframe I do my analysis on down to 427 points. The correlation numbers remained strong per level:

  • A to A+: 0.7366793
  • A+ to AA: 0.729288
  • AA to AAA: 0.7794951

Here is a graph showing the regression line against the 2014-2015 data:

crossvalidationmultigraph

To tell how often I’m correct or not I once again used the classification provided by fangraphs in this chart:

fangraphsbbrate
Picture retrieved from http://www.fangraphs.com/library/offense/rate-stats/

This time using the average difference between classifications of K% and got that to be .0291667. So if my model is more than ~.03 off the actual error rate then I say it’s wrong for that data point. Here are my results for each level:

A to A+:

  • Incorrect: 48
  • Correct: 66
  • Percentage Correct: 57.89%

A+ to AA:

  • Incorrect:78
  • Correct: 93
  • Percentage Correct: 54.39%

AA to AAA:

  • Incorrect: 52
  • Correct: 74
  • Percentage Correct: 58.73

 

Modeling Walk Rate between minor league levels

Modeling Walk Rate between minor league levels

After reading through Projecting X by Mike Podhorzer I decided to try and predict some rate statistics between minor league levels. Mike states in his book “Projecting rates makes it dramatically easier to adjust a forecast if necessary.”; therefore if a player is injured or will only have a certain number of plate appearances that year I can still attempt to project performance. The first rate statistic I’m going to attempt project is Walk Rate between minor league levels. This article will cover the following:

Raw Data

Data Cleaning

Correlation and Graphs

Model and Results

Examples

Raw Data

For my model I used data from Baseball Reference and am using the last 7 years of minor league data(2009-2015). Accounting for the Short Season A(SS-A) to AAA affiliates I ended up with over 28,316 data points for my analysis.

Data Cleaning

I’m using R and the original dataframe I had put all the data from each year in different rows. In order to do the calculations I wanted to do I needed to move each players career minor league data to the same row. Also I noticed I needed to filter on plate appearances during a season to make sure I’m getting rid of noise. For example, a player on a rehab assignment in the minor leagues or a player who ended up getting injured for most of the year so they only had 50-100 plate appearances. The min plate appearances I ended up settling on was 200 for a player to be factored into the model. Another thing I’m doing to remove noise is only attempting to model player performance between full season leagues(A, A+, AA, AAA). Once the cleaning of the data was done I had the following data points for each level:

  • A to A+ : 1129
  • A+ to A: 1023
  • AA to AAA: 705

Correlation and Graphs

I was able to get strong correlation numbers for walk rate between minor league levels. You can see the results below:

  • A to A+ : .6301594
  • A+ to AA: .6141332
  • AA to AAA: .620662

Here’s the graphs for each level:

atoaplusbbrategraph

aplustoaamaporig

aatoaaabbrategraph

Model and Results

The linear models for each level are:

  • A to A+: A+ BB% = .63184*(A BB%) + .02882
  • A+ to AA: AA BB% = .6182*(A+ BB%) + .0343
  • AA to AAA: AAA BB% = .5682(AA BB%) + .0342

In order to interpret the success or failure of my results I compared how close I was to getting the actual walk rate. Fangraphs has a great rating scale for walk rate at the Major League level:

fangraphsbbrate
Image from Fangraphs 

The image above gives a classification for multiple levels of walk rates. While based on major league data it’s a good starting point for me to decide a margin of error for my model. The mean difference between each level in the Fangraphs table is .0183333, I ended up rounding and made my margin for error .02. So if my predicted value for a players walk rate was within .02 of being correct I counted counted the model as correct for the player and if my error was greater than that it was wrong. Here are the models results for each level:

  • A to A+
    • Incorrect: 450
    • Correct: 679
    • Percentage Correct: ~.6014
  • A+ to A
    • Incorrect: 445
    • Correct: 578
    • Percentage Correct: ~.565
  • AA to AAA
    • Incorrect: 278
    • Correct: 427
    • Percentage Correct: ~.6056

When I moved the cutoff up a percentage to .03 the models results drastically improve:

  • A to A+
    • Incorrect: 228
    • Correct: 901
    • Percentage Correct: ~.798
  • A+ to AA
    • Incorrect: 246
    • Correct: 777
    • Percentage Correct: ~.7595
  • AA to AAA
    • Incorrect: 144
    • Correct: 561
    • Percentage Correct: ~.7957

Examples

Numbers are cool but where are the actual examples. Ok, lets start off with my worst prediction. The largest error I had between levels was A to A+ and the error was >10%(~.1105). The player in this case was Joey Gallo a quick glance at the player page will show his A walk rate was only .1076 and his A+ walk rate was .2073 which is a 10% improvement between levels. So why did this happen and why didn’t my model do a better job of predicting this. Currently the model is only accounting for the previous seasons walk rate but what if the player is getting a lot of hits at one level and stops swinging as much on the next. In Gallo’s case he only had a .245 BA his year at A ball so that wasn’t the case. More investigation is required to see how the model can get closer on edge cases like this.

galloatoasnippet
Gallo Dataframe Snippet

The lowest I was able to set the error too and still come back with results was ~.00004417. That very close prediction belongs too Erik Gonzalez. Don’t know Erik Gonzalez so I continued to look for results setting the min error to .0002 brought back Stephen Lombardozzi as one of my six results. Lombo’s interesting to hard core Nats fans(like myself) but wanted to continue to look for a more notable name. Finally after upping the number to .003 for A to A+ data I was able to see that the model successfully predicted Houston Astro’s multi-time All Star 2B walk rate for Jose Altuve walk rate within a .003 margin of error.

altuvedfsnippet
Altuve Dataframe snippet

 

Whats Next:

  • Improve model to get a lower max error
  • Predict Strike out rate between levels
  • Predicting more advanced statistics like woba/ops/wrc

 

Correlation between Salary Cap and Winning?

Correlation between Salary Cap and Winning?

After doing my initial blog looking at how much each team is spending per position group. I wanted to take a look to see if there was any correlation between how much teams are spending on a position group and winning. To do this I needed to merge the cap data from spotrac  and season summary data from pro-football-reference . I merged these datasets over the last 5 years but it’d be interesting to try and find data since the salary cap was in place(1994). Here’s a graph of my yearly findings for the last 5 years(2011-2015).

rplot01

Quick review on correlation from Pearson:

  • .00-.19 “very weak”
  • .20-.39 “weak”
  •  .40-.59 “moderate”
  • .60-.79 “strong”
  • .80-1.0 “very strong”

As you can see from the graph the correlation numbers aren’t exactly high. I believe that’s because the best players aren’t necessarily getting paid the most money. For example, before last year Russel Wilson was on his rookie contract and the Seahawks were making the playoffs year after year and only paying Russel $749,176  a year. Now he know doubt had a lot to do with the Seahawks winning before and going forward but his Salary before his new contract wouldn’t have correlated much to winning. Examples like this can be found at each position. This is why it’s necessary to have a good Front Office to continually bring in young talent that can contribute at a lower price. Looking at actual on the field stats and trying to correlate would be a much better exercise than trying to merge cap data with correlation.

Position Correlation
DB 0.05356608
DL -0.10102064
LB 0.14434313
OL -0.0783075
QB 0.08256776
RB -0.0917516
ST -0.05109986
TE -0.04013766
WR 0.07491824

Over the last 5 years there’s not one correlation that’s greater than ‘very weak’. But the positions that do have positive correlations DB, LB, QB, and WR are the positions that GM’s over the last 5 years have been willing to pay. Left Tackle is another position that has been getting paid very well in the league because Left Tackle’s are usually the one’s protecting a QB’s blind spot.

This data emphasizes the importance of drafting well because spending money on a particular position does not correlate with your team winning. Also interesting to note the positions that do have positive correlations must be that way because those players have made it to their second/third contracts and are now getting the big contracts. So it makes me wonder if DB, LB, QB and WR are the positions that have the longest careers in the NFL.

 

 

 

Where is your favorite NFL teams cap space going…Part2(AFC)

Where is your favorite NFL teams cap space going…Part2(AFC)

Part 2 takes a look at what teams in the AFC are doing with their cap space.

AFC East

AFC North

AFC South

AFC West

AFC East

Biggest thing standing out to me below is Miami is spending a lot of money on their DL, 2 times more than league average and almost 30% of their cap space.

AFCE-PerPosition

Buffalo Bills Miami Dolphins New England Patriots New York Jets NFL
DB 29,479,496 27,616,499 17,018,794 38,402,025 26,240,056
DL 33,279,980 43,671,818 26,648,167 23,376,055 21,559,330
LB 16,794,623 10,078,824 18,192,447 14,642,518 21,687,341
OL 23,881,580 32,341,251 28,246,427 31,934,742 25,224,340
QB 6,056,322 13,124,470 15,120,305 15,107,897 17,635,072
RB 10,860,734 4,127,886 5,546,772 7,035,895 7,603,650
ST 3,648,334 2,021,473 5,900,000 4,296,447 4,598,119
TE 7,461,559 11,253,250 14,119,263 2,457,483 7,136,520
WR 13,762,349 7,990,195 18,808,154 23,807,352 17,016,041
Total 145,224,977 152,225,666 149,600,329 161,060,414 148,700,468

AFCE-CapTotal

team_lg cap
Buffalo Bills 145,224,977.00
Miami Dolphins 152,225,666.00
New England Patriots 149,600,329.00
New York Jets 161,060,414.00
NFL 148,700,467.53

AFC North

AFCN-PerPosition

Baltimore Ravens Cincinnati Bengals Cleveland Browns Pittsburgh Steelers NFL
DB 32,608,731 30,917,521 36,129,121 21,018,906 26,240,056
DL 7,940,348 31,438,474 15,739,186 14,649,537 21,559,330
LB 23,899,629 17,117,432 24,706,482 29,272,432 21,687,341
OL 17,769,960 25,827,523 20,441,486 31,713,791 25,224,340
QB 24,053,334 13,750,413 16,804,891 25,414,805 17,635,072
RB 6,578,244 8,165,445 2,229,419 5,333,109 7,603,650
ST 7,540,000 5,521,667 2,833,000 2,833,000 4,598,119
TE 7,916,918 3,396,211 2,987,732 6,936,562 7,136,520
WR 12,972,115 17,898,407 13,766,584 16,140,272 17,016,041
Total 141,279,279 154,033,093 135,637,901 153,312,414 148,700,468

AFCN-CapTotal

team_lg cap
Baltimore Ravens 141,279,279.00
Cincinnati Bengals 154,033,093.00
Cleveland Browns 135,637,901.00
Pittsburgh Steelers 153,312,414.00
NFL 148,700,467.53

AFC South

AFCS-PerPosition

Houston Texans Indianapolis Colts Jacksonville Jaguars Tennessee Titans NFL
DB 27,192,446 21,622,749 28,048,803 31,983,674 26,240,056
DL 22,806,964 10,970,469 35,629,261 20,257,121 21,559,330
LB 27,071,210 26,746,775 17,325,094 23,470,780 21,687,341
OL 29,927,612 25,770,042 23,051,277 21,491,313 25,224,340
QB 14,412,646 19,906,667 10,414,198 7,565,734 17,635,072
RB 10,585,079 5,811,457 6,899,223 12,402,648 7,603,650
ST 3,355,000 6,750,000 3,500,000 6,235,666 4,598,119
TE 2,310,164 11,855,251 11,413,527 11,355,208 7,136,520
WR 8,709,817 17,229,506 13,006,178 18,252,871 17,016,041
Total 146,370,938 146,662,916 149,287,561 153,015,015 148,700,468

AFCS-CapTotal

team_lg cap
Houston Texans 146,370,938.00
Indianapolis Colts 146,662,916.00
Jacksonville Jaguars 149,287,561.00
Tennessee Titans 153,015,015.00
NFL 148,700,467.53

AFC West

AFCW-PerPosition

Denver Broncos Kansas City Chiefs Oakland Raiders San Diego Chargers NFL
DB 33,253,262 20,927,181 34,799,467 26,196,974 26,240,056
DL 15,847,747 17,431,156 13,215,815 17,830,448 21,559,330
LB 27,946,958 38,406,775 30,579,411 23,274,907 21,687,341
OL 15,654,114 21,758,228 37,579,810 25,404,464 25,224,340
QB 9,261,158 19,131,035 4,631,343 22,280,000 17,635,072
RB 9,412,971 10,055,670 9,640,686 7,144,037 7,603,650
ST 5,175,002 5,057,334 10,187,500 2,437,622 4,598,119
TE 6,886,868 3,948,789 5,235,006 7,944,722 7,136,520
WR 25,047,116 20,132,805 18,191,836 17,032,784 17,016,041
Total 148,485,196 156,848,973 164,060,874 149,545,958 148,700,468

AFCW-CapTotal

team_lg cap
Denver Broncos 148,485,196.00
Kansas City Chiefs 156,848,973.00
Oakland Raiders 164,060,874.00
San Diego Chargers 149,545,958.00
NFL 148,700,467.53
Enriching Datasets with R

Enriching Datasets with R

If you have a simple data set and have some additional statistics you’d like to add to that dataset you can easily do that with R. Going to add fip, woba, wrc, and wraa to a couple of baseball datasets as an example of this.

To calculate FIP I first needed the following R functions:

[code language=”r”]
#Calculate FIP Constant
fip_constant_calc <- function(pitching_data){
#FIP Constant = lgERA – (((13*lgHR)+(3*(lgBB+lgHBP))-(2*lgK))/lgIP)
era = sum(pitching_data["ER"])/sum(pitching_data["IP"]) * 9
lgHR = sum(pitching_data["HR"])
lgBB = sum(pitching_data["BB"])
lgHBP = sum(pitching_data["HBP"])
lgK = sum(pitching_data["SO"])
lgIP = sum(pitching_data["IP"])

fipConstant = era – (((13*lgHR)+(3*(lgBB+lgHBP))-(2*lgK))/lgIP)
#print(paste("ERA", era))
return(fipConstant)
}

#Calculate FIP
fip_calc <- function(pitching_data, fipConstant){
#FIP = ((13*HR)+(3*(BB+HBP))-(2*K))/IP + constant
hr = as.numeric(pitching_data["HR"])
bb = as.numeric(pitching_data["BB"])
hbp = as.numeric(pitching_data["HBP"])
k = as.numeric(pitching_data["SO"])
ip = as.numeric(pitching_data["IP"])

fip = ((13*hr)+(3*(bb+hbp))-2*(k))/ip + fipConstant

return(fip)
}
[/code]

Once you have the necessary functions to calculate the stat you can use apply in R to apply the function to your dataframe.

 

[code language=”r”]
#Get Data
fg_pitching_data = read.csv("data/2007/FG_MLB_Pitching_Std_All.csv")
constant <- fip_constant_calc(fg_pitching_data)
fg_pitching_data$fip <- apply(fg_pitching_data, 1, fip_calc, fipConstant=constant)
[/code]

Here are some R functions to calculate wOBA, wrc, and wraa.

[code language=”r”]
#Calculate wOBA based on weights using FG formula
woba_calc_weights_fg <- function(row, weights){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
ab <- as.numeric(row["AB"])
ibb <- as.numeric(row["IBB"])
sf <- as.numeric(row["SF"])

numerator <- as.numeric(weights["wobaBB"])*(bb-ibb)+as.numeric(weights["wobaHB"])*hbp+as.numeric(weights["woba1B"])*singles+as.numeric(weights["woba2B"])*doubles+as.numeric(weights["woba3B"])*triples+as.numeric(weights["wobaHR"])*hr
denom <- ab + bb – ibb + sf + hbp

return(numerator/denom)
}

#http://www.fangraphs.com/library/offense/wraa/
wraa_calc_fg <- function(row, weights){
numerator <- as.numeric(row["woba"]) – as.numeric(weights["woba"])
denom <- as.numeric(weights["wobaScale"])

result = (numerator/denom) * (as.numeric(row["PA"]))

return(result)
}

wrc_calc <- function(row, weights){
woba = as.numeric(row["woba"])
lgWOBA = as.numeric(weights["woba"])
wobaScale = as.numeric(weights["wobaScale"])
lgR = as.numeric(weights["lgR"])
lgPA = as.numeric(weights["lgPA"])
pa = as.numeric(row["PA"])
wrc = (((woba-lgWOBA)/wobaScale)+(lgR/lgPA))*pa

return(wrc)
}
[/code]

Once you have the functions necessary in place you can add the statistics to your dataset using this code:

[code language=”r”]
fg_batting_data = read.csv(&amp;quot;data/2007/FG_MLB_Batting_Std_All.csv&amp;quot;)
wobaWeights = get_woba_weights(fg_pitching_data, fg_batting_data)
fg_batting_data$woba = apply(fg_batting_data, 1, woba_calc_weights_fg, weights=wobaWeights)
fg_batting_data$wraa = apply(fg_batting_data, 1, wraa_calc_fg, weights=wobaWeights)
fg_batting_data$wrc = apply(fg_batting_data, 1, wrc_calc, weights=wobaWeights)
[/code]

Note: ‘get_woba_weights’ was addressed in the Learning wOBA blog post.

Learning wOBA

Learning wOBA

As I continue to learn R and go down the road of becoming a data scientist. I need to learn how to use and compute advanced statistics. The first advanced analytic I’m going to learn how to compute is weighted on-base average(wOBA). Weighted on-base average combines all the parts of a players offensive game and gives them all appropriate weights for their impact on the game. For example, a HR is given more weight than a BB or a Single because a HR is guarantees a team atleast one run while a BB/Single only scores a run a much lower percentage of the time. General statistics like BA/SLG/OBP only look at a piece of a players offensive game which is why wOBA is a better tool for looking at a players offensive contribution. Fangraphs does a great job of describing the details in the following links:

The links above give a lot of insight on why you should use wOBA. Tom Tango in The Book describes the formula. He also has the standard formula for wOBA that includes and doesn’t include speed on his site . I used the standard formula from his site for my initial calculations with R.

The following R code has the functions I use to calculate speed and non speed wOBA:

 

[code language=”r”]
#Link for standard WOBA formula
#http://tangotiger.com/index.php/site/article/standard-woba
woba_calc_basic <- function(row){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
pa <- as.numeric(row["PA"])
ibb <- as.numeric(row["IBB"])
sh <- as.numeric(row["SH"])

numerator <- .7*(bb+hbp) + .9*(singles) + 1.3*(doubles + triples) + 2*hr
denominator <- pa – ibb – sh

return(numerator/denominator)
}

woba_calc_speed <- function(row){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
pa <- as.numeric(row["PA"])
ibb <- as.numeric(row["IBB"])
sh <- as.numeric(row["SH"])
sb <- as.numeric(row["SB"])
cs <- as.numeric(row["CS"])

numerator <- .7*(bb+hbp) + .9*(singles) + 1.3*(doubles + triples) + 2*hr + .25*sb + -.5*cs
denominator <- pa – ibb – sh

return(numerator/denominator)
}
[/code]

Once I had those two formula’s in place I retrieved some data from Nationals data from baseball-reference . Then used the code below to apply the two functions to a dataframe.

 

[code language=”r”]
data = read.csv("data/FG_MLB_Batting_Std_All.csv")

#Apply basic woba
data$woba_basic = apply(data, 1, woba_calc_basic)

#Apply speed woba
data$woba_speed = apply(data, 1, woba_calc_speed)

[/code]

Once I had that in place I noticed that Fangraphs kept mentioning ‘wOBA scale’ and how it and wOBA could be used to generate another stat called Weighted Runs Above Average . After doing some digging I found an old article written by Tom Tango on how to compute yearly wOBA from scratch . Above I’ve just used some standard weights to compute a standard wOBA. But standard wOBA doesn’t take into account the yearly offensive environment that players are playing in. Basically some years an offensive action is worth more than others based upon how the whole league is performing. Tom Tango’s article on computing yearly wOBA teaches you how to get the yearly weights and calculate wOBA scale. Continuing down this R, wOBA rabbit hole I transcribe his SQL script into R functions.

To compute the offensive environment(Step 1 and Step 2) in Tom Tangos script I use this function on pitching data:

 

[code language=”r”]
get_runvalues <- function(pitching_data){
#Calculate runs per out
stat <- sum(pitching_data["R"])/(sum(pitching_data["IP"])*3)
rPerOut <- c(stat)

#Calculate runs per bb
stat <- rPerOut + .14
rPerBB <- c(stat)

#Runs per HB
stat <- rPerBB + .025
rPerHB <- c(stat)

#Runs Per Single
stat <- rPerBB + .155
rPer1B <- c(stat)

#Runs Per Double
stat <- rPer1B + .3
rPer2B <- c(stat)

#Runs Per Triple
stat <- rPer2B + .27
rPer3B <- c(stat)

#Runs per HR
rHR <- 1.4

rSB <- .2

rCS <- 2*rPerOut + .075

df <- data.frame(rPerOut, rPerBB, rPerHB, rPer1B, rPer2B, rPer3B, rHR, rSB, rCS)

return(df)
}

[/code]

Step 3 in the SQL script pulls uses the run values to generate the necessary weights to calculate wOBA. The following R methods accomplish this:

[code language=”r”]
get_woba_runsMinus <- function(batting, runValues){
bbNumerator <- as.numeric(runValues["rPerBB"]) * (sum(batting["BB"]) – sum(batting["IBB"]))
hbpNumerator <- as.numeric(runValues["rPerHB"]) * sum(batting["HBP"])

singles <- sum(batting["H"]) – sum(batting["X2B"]) – sum(batting["X3B"]) – sum(batting["HR"])
singleNumerator <- as.numeric(runValues["rPer1B"]) * singles
doubleNumerator <- as.numeric(runValues["rPer2B"]) * sum(batting["X2B"])
tripleNumerator <- as.numeric(runValues["rPer3B"]) * sum(batting["X3B"])
hrNumerator <- as.numeric(runValues["rHR"]) * sum(batting["HR"])
sbNumerator <- as.numeric(runValues["rSB"]) * sum(batting["SB"])
csNumerator <- as.numeric(runValues["rCS"]) * sum(batting["CS"])

numerator <- bbNumerator + hbpNumerator + singleNumerator + doubleNumerator + tripleNumerator + hrNumerator + sbNumerator – csNumerator

#print(paste("numerator is, ", numerator))

denom <- sum(batting["AB"]) – sum(batting["H"]) + sum(batting["SF"])
#print(paste("denominator is, ", denom))

return(numerator/denom)
}

get_woba_runsPlus <- function(batting, runValues){
#Same as runMinus should probable combine and output df
bbNumerator <- as.numeric(runValues["rPerBB"]) * (sum(batting["BB"]) – sum(batting["IBB"]))
hbpNumerator <- as.numeric(runValues["rPerHB"]) * sum(batting["HBP"])

singles <- sum(batting["H"]) – sum(batting["X2B"]) – sum(batting["X3B"]) – sum(batting["HR"])
singleNumerator <- as.numeric(runValues["rPer1B"]) * singles
doubleNumerator <- as.numeric(runValues["rPer2B"]) * sum(batting["X2B"])
tripleNumerator <- as.numeric(runValues["rPer3B"]) * sum(batting["X3B"])
hrNumerator <- as.numeric(runValues["rHR"]) * sum(batting["HR"])
sbNumerator <- as.numeric(runValues["rSB"]) * sum(batting["SB"])
csNumerator <- as.numeric(runValues["rCS"]) * sum(batting["CS"])

numerator <- bbNumerator + hbpNumerator + singleNumerator + doubleNumerator + tripleNumerator + hrNumerator + sbNumerator – csNumerator
denom <- (sum(batting["BB"]) – sum(batting["IBB"])) + sum(batting["HBP"]) + sum(batting["H"])

return(numerator/denom)
}

woba_calc_lgAvgSpeed <- function(batting, runValues){
numerator <- sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["H"]) + sum(batting["SB"]) – sum(batting["CS"])
denom <- sum(batting["AB"]) + sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["SF"])

return(numerator/denom)
}

woba_calc_lgAvgReg <- function(batting, runValues){
numerator <- sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["H"])
denom <- sum(batting["AB"]) + sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["SF"])

return(numerator/denom)
}

get_woba_weights <- function(pitching, batting){
#Get the run values
run_values <- get_runvalues(pitching)

#Compute runs minus
runs_minus <- get_woba_runsMinus(batting, run_values)

runs_plus <- get_woba_runsPlus(batting, run_values)

#woba <- woba_calc_lgAvgSpeed(batting, run_values)
woba <- woba_calc_lgAvgReg(batting, run_values)

wobaScale <- 1/(runs_plus + runs_minus)

wobaBB <- (as.numeric(run_values["rPerBB"]) + runs_minus)*wobaScale
wobaHB <- (as.numeric(run_values["rPerHB"]) + runs_minus)*wobaScale
woba1B <- (as.numeric(run_values["rPer1B"]) + runs_minus)*wobaScale
woba2B <- (as.numeric(run_values["rPer2B"]) + runs_minus)*wobaScale
woba3B <- (as.numeric(run_values["rPer3B"]) + runs_minus)*wobaScale
wobaHR <- (as.numeric(run_values["rHR"]) + runs_minus)*wobaScale
wobaSB <- as.numeric(run_values["rSB"])*wobaScale
wobaCS <- as.numeric(run_values["rCS"])*wobaScale

df <- data.frame(run_values, runs_minus, runs_plus, woba, wobaScale, wobaBB, wobaHB, woba1B, woba3B, wobaHR, wobaSB, wobaCS)

return(df)
}
[/code]

With all this in place I now needed to test my functions against the table he has here . The data I ended up using for the test is from Fangraphs . The code below calculates the wOBA weights:

[code language=”r”]
fg_batting_data = read.csv("data/FG_MLB_Batting_Std_All.csv")
fg_pitching_data = read.csv("data/FG_MLB_Pitching_Std_All.csv")
min_pa = subset(fg_batting_data, PA > 80)
wobaWeights = get_woba_weights(fg_pitching_data, fg_batting_data)
wobaWeightsFiltered = get_woba_weights(fg_pitching_data, min_pa)
wobaWeights
[/code]

Here’s the output for the wOBA weights for 2007:

[code]
rPerOut rPerBB rPerHB rPer1B rPer2B rPer3B rHR rSB rCS
1 0.1796024 0.3196024 0.3446024 0.4746024 0.7746024 1.044602 1.4 0.2 0.4342048
runs_minus runs_plus woba wobaScale wobaBB wobaHB woba1B woba2B
1 0.2792543 0.5642809 0.3310523 1.185487 0.7099368 0.739574 0.8936873 1.249333
woba3B wobaHR wobaSB wobaCS
1 1.569415 1.990734 0.2370974 0.5147441
[/code]

These numbers line up with the numbers from Tom Tangos table for 2007 I linked above. Here’s a look at what the side by side numbers:

label original calc my calc
year 2007 2007
rPerOut .179 0.1796024
rPerBB .32 .3196024
rPerHB .34 .3446024
rPer1B .47 .4746024
rPer2B .77 .7746024
rPer3B 1.04 1.044602
rHR 1.40 1.4
rSB .2 .2
rCS .43 .4342048
rMinus .28 .2792543
rPlus .56 .5642809
wOBA .331 .3310523
wobaScale 1.19 1.185487
wobaBB .71 .7099368
wobaHB .74 .739574
woba1B .89 .8936873
woba2B 1.25 1.249333
woba3B 1.57 1.569415
wobaHR 1.99 1.990734
wobaSB .24 .2370974
wobaCS .51 .5147441

With some simplification the weights I produced with the R code above would line up with Tom Tango’s original chart for 2007. Since I’m now able to calculate wOBA weights from scratch I’ll be able to calculate wRAA and apply wOBA in other offensive environments.

Exploratory Data Analysis of Nationals Minor League Pitching Stats

Exploratory Data Analysis of Nationals Minor League Pitching Stats

After attending SSAC this year I decided one of the skills I need to pick up is R. Well after finally finishing Grad School I finally have time. Best way for me to learn is to actually get some data I’m interested in. Daily I look up Nationals minor league statistics to see how the upcoming Nationals are doing. So minor league data made a lot of sense for me to collect and doing Exploratory Data Analysis(seeing the data) is necessary before making the next steps in data science. I was interested in seeing how the Nationals affiliates were doing in comparison to their leagues. For pitching I decided to compare team ERA with the rest of the league. ERA is a measure of how well a pitcher is doing per nine innings. A low era is a good thing the higher the ERA the more runs a team is giving up per nine innings.

Note: Data covers the season up to 7/1/2016

So how many data points do I actually have per level?

Level Data Points
SS-A 396
A 629
A+ 737
AA 684
AAA 800

For each League at each level I wanted to get the average team ERA and compare it to how the Nationals affiliates are doing. In the below table you can see those numbers.

League/Team Level ERA
New York Penn League SS-A 3.46
Northwest League SS-A 3.71
Auburn Doubledays SS-A 3.57
Midwest A 3.55
South Atlantic A 3.79
Hagerstown Suns A 3.73
California A+ 4.15
Carolina A+ 3.80
Florida State A+ 3.49
Potomac Nationals A+ 3.61
Eastern AA 3.91
Texas AA 3.69
Southern AA 3.76
Harrisburg AA 3.72
Pacific Coast AAA 4.50
International AAA 3.65
Syracuse AAA 3.84

To see that table in another way I also put this data into a graph using R.

eraPerLevel_7-2

 

And here’s what the teams look like compared side by side in a graph:

wsh_only_era

When I originally wrote this post a couple weeks ago now Harrisburg had the best performing Nationals staff in comparison to their league and that came as no surprise since the staff was headlined by two of the Nationals top pitching prospects Lucas Giolito and Reynaldo Lopez.

Now looking at the data up to 7/1 Potomac and Auburn(small sample size) have the best team ERA’s overall. Auburn probably need to wait a little more into the season prior to making any statements since the seasons just started but currently their staff ERA is not outperforming league average. Harrisburg and Potomac teams are both outperforming their respective leagues and Hagerstown is about even with league average.  Syracuse’s ERA is a bit up compared to how other staffs in the International League are performing.

Part 2. Source Code

The data itself was retrieved in csv format from Baseball-Reference . Each teams data was put into a folder and then read into a dataframe using the following function:

[code language=”r”]
#Read files from a directory into a dataframe
#http://www.r-bloggers.com/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/
getDfFromDir <- function(csvDir){
fileList <- list.files(csvDir)

outputFromDfDir = NULL
for(file in fileList){
fullPath <- paste(csvDir, "/", file, sep="")
#print("File name is")
#print(fullPath)
if(is.null(outputFromDfDir)){
#print("Hit null")
outputFromDfDir <- read.csv(fullPath)
}else{
tmp_ds <- read.csv(fullPath)
outputFromDfDir <- rbind(outputFromDfDir, tmp_ds)
rm(tmp_ds)
#print("Hit else")
}
}

return(outputFromDfDir)
}
[/code]

Then to do the graphs and get the data I needed I used the following code:

[code language=”r”]
#install.packages("ggplot2")
#install.packages("dplyr")
library(ggplot2)
library(plyr)
library(reshape2)

#Read in all the minors pitching data from directory
minors_pitching <- getDfFromDir("sport_data/product/all/baseball-reference/2016/csv/pitching/")
summary(minors_pitching$Lvl)

#See if you can compute staff era for each team
minors_tm_era <- ddply(minors_pitching,.(Tm,Lg,Lvl,Franchise), summarise, era=(sum(ER)/sum(IP))*9)
#Remove NA rows
minors_tm_era <- minors_tm_era[complete.cases(minors_tm_era),]
minors_tm_era

#Level Data
minors_lvl_era <- ddply(minors_pitching,.(Lvl),summarise, era=(sum(ER)/sum(IP))*9)
minors_lvl_era <- minors_lvl_era[complete.cases(minors_lvl_era),]
minors_lvl_era

#League Data
minors_Lg_era <- ddply(minors_pitching,.(Lg,Lvl),summarise, era=(sum(ER)/sum(IP))*9)
minors_Lg_era <- minors_Lg_era[complete.cases(minors_Lg_era),]
minors_Lg_era

wsh_tm_era <- ddply(minors_pitching[minors_pitching$Franchise=="Washington Nationals",],.(Tm,Lvl), summarise, era=(sum(ER)/sum(IP))*9)
wsh_tm_era

lg_melt_data <- melt(minors_Lg_era)
team_melt_data <- melt(wsh_tm_era)

#Rename vars so you can bind
colnames(lg_melt_data) <- c("lg_tm", "lvl", "variable", "value")
colnames(team_melt_data) <- c("lg_tm", "lvl", "variable", "value")
total_melt_data <- rbind(lg_melt_data,team_melt_data)
total_melt_data

#Graph for era
era_graph <- ggplot(data=total_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat="identity", position="dodge") + ggtitle("WSH Minors ERA Per Level")
era_graph

#Graph of only the nationals
nats_era_graph <- ggplot(data=team_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat="identity", position="dodge") + ggtitle("WSH Minors ERA")
nats_era_graph
[/code]