Browsed by
Category: Baseball

Modeling Hit Rates Between Minor League Levels

Modeling Hit Rates Between Minor League Levels

Working on figuring out the hit rates for minor leaguer batters between levels. I’d like to take the hit rates(i.e. singles(1B/PA), doubles(2B/PA), triples(3B/PA) and HRs(HR/PA) ) a player had at their previous minor league level and use that data to predict how a player will do at the following level. Similar data has been used as in the previous articles on walk rates and strike out rates. This data set covered 2011-2015 and players with a minimum of 200 PA’s were included in the resulting model. Below are the graphs for each level, models and some thoughts.

A to A+

atoaplushitrates
A to A+ Hit Rates

A theme throughout the graphs will show that the correlation numbers for singles and home runs are high but very low for doubles and triples. These same low correlation numbers for doubles and triples were found in previous research by Matt Klassen at Fangraphs.

Linear models:

  •  A+ Single Rate = (A single rate)*0.53520 + 0.07452
  • A+ Double Rate = (A double rate)*.36379 + .02929
  • A+ Triple Rate = (A triple rate)*.403826 + .004743
  • A+ HR Rate = (A HR rate)*.633131 + .0006235

A+ to AA

aplustoaahitrates
A+ to AA Hit Rates

Linear models:

  • AA Single Rate = (A+ Single rate)*.48235 + .07969
  • AA Double Rate = (A+ Double rate)*.22680 + .03389
  • AA Triple Rate = (A+ Triple rate)*.377505 + .003751
  • AA HR Rate = (A+ HR rate)*.534897 + .007925

AA to AAA

aatoaaahitrate
AA to AAA Hit Rate

Linear models:

  • AAA Single Rate = (AA Single Rate)*.52767 + .07912
  • AAA Double Rate = (AA Double Rate)*.248769 + .03645
  • AAA Triple Rate = (AA Triple Rate)*.355865 + .003757
  • AAA HR Rate = (AA HR Rate)*.58037 + .00881

Whats Next:

  • Perform some validation on the above models
  • Combine the models you’ve generated to predict OBP/SLG/OPS
  • Make models that skip levels
  • Make code more efficient so you can do this faster

 

 

Modeling Strikeout Rate between minor league levels

Modeling Strikeout Rate between minor league levels

In this post I’ll go over my results for predicting strikeout rates between minor league levels. This article will cover the following:

Data

Data Wrangling

Graphs and Correlation

Model and Evaluation

Data

This time around I’ve change my approach up so I can do some cross-validation. The article will cover data from 2004-2015 but I’ll be training my model on data from 2004-2013 and evaluating it using the 2014-2015 data. The data itself consists of 39,349 data points and came from Baseball Reference . The data points represent minor league data from Short Season(SS-A) to AAA ball. I end up removing the SS-A data because currently I’m only modeling data between the full season leagues(A-AAA). Also, players data points were only included if they had >=200 plate appearances.

Data Wrangling

In order to model the data between minor league levels I need to do some data wrangling to get the dataframe in the format I need. The original data has each players season as a different entry.

ramosoriginaldata
Snippet from original dataframe. Each entry represents a year and minor league level the stats are for.

In order for me to graph and get correlation values between minor league levels I need all this data on one row with the stats for each level represented by a column. Below you can see a snippet of the dataframe I use for my analysis:

ramoscorrelationsnippet
Snippet of correlation dataframe.

Notice how in the dataframe above all the stats I need for each level have been merged into one row.

Graphs and Correlation

regressionlineformilbdatausedformodel
Graphs showing the scatter plot and regression lines for the levels of minor league data I modeled.

As you can see from the graphs above a positive linear relationship exists for strike out rate between the minor league levels(A to A+, A+ to AA, AA to AAA) I’ve analyzed. Here are the correlation values for each level:

  • A to A+ :  0.7532319
  • A+ to AA : 0.7717004
  • AA to AAA : 0.7666475

From the numbers above and graphs you can see a ‘strong’ positive correlation exists for the strikeout rate between levels.

Model and Evaluation

The models for the regression line in the graphs above are:

  • A to A+ : A+ SO Rate = .7598*(A SO Rate) + .04591
  • A+ to AA: AA SO Rate = .83204*(A+ SO Rate) + .03608
  • AA to AAA: AAA SO Rate = .80664*(AA SO Rate) + .04147

The ‘Doing Data Science‘ book suggests using R-squared, p-values, and cross-validation to validate linear models. For this article I’ll be using R-squared and cross-validation:

  • A to A+: .5674
  • A+ to AA: .5955
  • AA to AAA: .5877

To do cross validation I’m going to use the data  from 2014-2015. This dataset consists of  of 8198 points. I performed the same steps I described above in the data wrangling section and that bought the dataframe I do my analysis on down to 427 points. The correlation numbers remained strong per level:

  • A to A+: 0.7366793
  • A+ to AA: 0.729288
  • AA to AAA: 0.7794951

Here is a graph showing the regression line against the 2014-2015 data:

crossvalidationmultigraph

To tell how often I’m correct or not I once again used the classification provided by fangraphs in this chart:

fangraphsbbrate
Picture retrieved from http://www.fangraphs.com/library/offense/rate-stats/

This time using the average difference between classifications of K% and got that to be .0291667. So if my model is more than ~.03 off the actual error rate then I say it’s wrong for that data point. Here are my results for each level:

A to A+:

  • Incorrect: 48
  • Correct: 66
  • Percentage Correct: 57.89%

A+ to AA:

  • Incorrect:78
  • Correct: 93
  • Percentage Correct: 54.39%

AA to AAA:

  • Incorrect: 52
  • Correct: 74
  • Percentage Correct: 58.73

 

Modeling Walk Rate between minor league levels

Modeling Walk Rate between minor league levels

After reading through Projecting X by Mike Podhorzer I decided to try and predict some rate statistics between minor league levels. Mike states in his book “Projecting rates makes it dramatically easier to adjust a forecast if necessary.”; therefore if a player is injured or will only have a certain number of plate appearances that year I can still attempt to project performance. The first rate statistic I’m going to attempt project is Walk Rate between minor league levels. This article will cover the following:

Raw Data

Data Cleaning

Correlation and Graphs

Model and Results

Examples

Raw Data

For my model I used data from Baseball Reference and am using the last 7 years of minor league data(2009-2015). Accounting for the Short Season A(SS-A) to AAA affiliates I ended up with over 28,316 data points for my analysis.

Data Cleaning

I’m using R and the original dataframe I had put all the data from each year in different rows. In order to do the calculations I wanted to do I needed to move each players career minor league data to the same row. Also I noticed I needed to filter on plate appearances during a season to make sure I’m getting rid of noise. For example, a player on a rehab assignment in the minor leagues or a player who ended up getting injured for most of the year so they only had 50-100 plate appearances. The min plate appearances I ended up settling on was 200 for a player to be factored into the model. Another thing I’m doing to remove noise is only attempting to model player performance between full season leagues(A, A+, AA, AAA). Once the cleaning of the data was done I had the following data points for each level:

  • A to A+ : 1129
  • A+ to A: 1023
  • AA to AAA: 705

Correlation and Graphs

I was able to get strong correlation numbers for walk rate between minor league levels. You can see the results below:

  • A to A+ : .6301594
  • A+ to AA: .6141332
  • AA to AAA: .620662

Here’s the graphs for each level:

atoaplusbbrategraph

aplustoaamaporig

aatoaaabbrategraph

Model and Results

The linear models for each level are:

  • A to A+: A+ BB% = .63184*(A BB%) + .02882
  • A+ to AA: AA BB% = .6182*(A+ BB%) + .0343
  • AA to AAA: AAA BB% = .5682(AA BB%) + .0342

In order to interpret the success or failure of my results I compared how close I was to getting the actual walk rate. Fangraphs has a great rating scale for walk rate at the Major League level:

fangraphsbbrate
Image from Fangraphs 

The image above gives a classification for multiple levels of walk rates. While based on major league data it’s a good starting point for me to decide a margin of error for my model. The mean difference between each level in the Fangraphs table is .0183333, I ended up rounding and made my margin for error .02. So if my predicted value for a players walk rate was within .02 of being correct I counted counted the model as correct for the player and if my error was greater than that it was wrong. Here are the models results for each level:

  • A to A+
    • Incorrect: 450
    • Correct: 679
    • Percentage Correct: ~.6014
  • A+ to A
    • Incorrect: 445
    • Correct: 578
    • Percentage Correct: ~.565
  • AA to AAA
    • Incorrect: 278
    • Correct: 427
    • Percentage Correct: ~.6056

When I moved the cutoff up a percentage to .03 the models results drastically improve:

  • A to A+
    • Incorrect: 228
    • Correct: 901
    • Percentage Correct: ~.798
  • A+ to AA
    • Incorrect: 246
    • Correct: 777
    • Percentage Correct: ~.7595
  • AA to AAA
    • Incorrect: 144
    • Correct: 561
    • Percentage Correct: ~.7957

Examples

Numbers are cool but where are the actual examples. Ok, lets start off with my worst prediction. The largest error I had between levels was A to A+ and the error was >10%(~.1105). The player in this case was Joey Gallo a quick glance at the player page will show his A walk rate was only .1076 and his A+ walk rate was .2073 which is a 10% improvement between levels. So why did this happen and why didn’t my model do a better job of predicting this. Currently the model is only accounting for the previous seasons walk rate but what if the player is getting a lot of hits at one level and stops swinging as much on the next. In Gallo’s case he only had a .245 BA his year at A ball so that wasn’t the case. More investigation is required to see how the model can get closer on edge cases like this.

galloatoasnippet
Gallo Dataframe Snippet

The lowest I was able to set the error too and still come back with results was ~.00004417. That very close prediction belongs too Erik Gonzalez. Don’t know Erik Gonzalez so I continued to look for results setting the min error to .0002 brought back Stephen Lombardozzi as one of my six results. Lombo’s interesting to hard core Nats fans(like myself) but wanted to continue to look for a more notable name. Finally after upping the number to .003 for A to A+ data I was able to see that the model successfully predicted Houston Astro’s multi-time All Star 2B walk rate for Jose Altuve walk rate within a .003 margin of error.

altuvedfsnippet
Altuve Dataframe snippet

 

Whats Next:

  • Improve model to get a lower max error
  • Predict Strike out rate between levels
  • Predicting more advanced statistics like woba/ops/wrc

 

Enriching Datasets with R

Enriching Datasets with R

If you have a simple data set and have some additional statistics you’d like to add to that dataset you can easily do that with R. Going to add fip, woba, wrc, and wraa to a couple of baseball datasets as an example of this.

To calculate FIP I first needed the following R functions:

[code language=”r”]
#Calculate FIP Constant
fip_constant_calc <- function(pitching_data){
#FIP Constant = lgERA – (((13*lgHR)+(3*(lgBB+lgHBP))-(2*lgK))/lgIP)
era = sum(pitching_data["ER"])/sum(pitching_data["IP"]) * 9
lgHR = sum(pitching_data["HR"])
lgBB = sum(pitching_data["BB"])
lgHBP = sum(pitching_data["HBP"])
lgK = sum(pitching_data["SO"])
lgIP = sum(pitching_data["IP"])

fipConstant = era – (((13*lgHR)+(3*(lgBB+lgHBP))-(2*lgK))/lgIP)
#print(paste("ERA", era))
return(fipConstant)
}

#Calculate FIP
fip_calc <- function(pitching_data, fipConstant){
#FIP = ((13*HR)+(3*(BB+HBP))-(2*K))/IP + constant
hr = as.numeric(pitching_data["HR"])
bb = as.numeric(pitching_data["BB"])
hbp = as.numeric(pitching_data["HBP"])
k = as.numeric(pitching_data["SO"])
ip = as.numeric(pitching_data["IP"])

fip = ((13*hr)+(3*(bb+hbp))-2*(k))/ip + fipConstant

return(fip)
}
[/code]

Once you have the necessary functions to calculate the stat you can use apply in R to apply the function to your dataframe.

 

[code language=”r”]
#Get Data
fg_pitching_data = read.csv("data/2007/FG_MLB_Pitching_Std_All.csv")
constant <- fip_constant_calc(fg_pitching_data)
fg_pitching_data$fip <- apply(fg_pitching_data, 1, fip_calc, fipConstant=constant)
[/code]

Here are some R functions to calculate wOBA, wrc, and wraa.

[code language=”r”]
#Calculate wOBA based on weights using FG formula
woba_calc_weights_fg <- function(row, weights){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
ab <- as.numeric(row["AB"])
ibb <- as.numeric(row["IBB"])
sf <- as.numeric(row["SF"])

numerator <- as.numeric(weights["wobaBB"])*(bb-ibb)+as.numeric(weights["wobaHB"])*hbp+as.numeric(weights["woba1B"])*singles+as.numeric(weights["woba2B"])*doubles+as.numeric(weights["woba3B"])*triples+as.numeric(weights["wobaHR"])*hr
denom <- ab + bb – ibb + sf + hbp

return(numerator/denom)
}

#http://www.fangraphs.com/library/offense/wraa/
wraa_calc_fg <- function(row, weights){
numerator <- as.numeric(row["woba"]) – as.numeric(weights["woba"])
denom <- as.numeric(weights["wobaScale"])

result = (numerator/denom) * (as.numeric(row["PA"]))

return(result)
}

wrc_calc <- function(row, weights){
woba = as.numeric(row["woba"])
lgWOBA = as.numeric(weights["woba"])
wobaScale = as.numeric(weights["wobaScale"])
lgR = as.numeric(weights["lgR"])
lgPA = as.numeric(weights["lgPA"])
pa = as.numeric(row["PA"])
wrc = (((woba-lgWOBA)/wobaScale)+(lgR/lgPA))*pa

return(wrc)
}
[/code]

Once you have the functions necessary in place you can add the statistics to your dataset using this code:

[code language=”r”]
fg_batting_data = read.csv(&amp;quot;data/2007/FG_MLB_Batting_Std_All.csv&amp;quot;)
wobaWeights = get_woba_weights(fg_pitching_data, fg_batting_data)
fg_batting_data$woba = apply(fg_batting_data, 1, woba_calc_weights_fg, weights=wobaWeights)
fg_batting_data$wraa = apply(fg_batting_data, 1, wraa_calc_fg, weights=wobaWeights)
fg_batting_data$wrc = apply(fg_batting_data, 1, wrc_calc, weights=wobaWeights)
[/code]

Note: ‘get_woba_weights’ was addressed in the Learning wOBA blog post.

Why Reynaldo?

Why Reynaldo?

The Nationals sent Lucas Giolito back down to the minors and have called up Reynaldo Lopez for his Major League debut tomorrow. So I decided to take a look at possible reasons for that decision. Giolito did ok in his first rain shortened start giving up only 1 hit in 4 innings but the 2 BB’s were a little concerning. Especially since this season he’s had a pattern of walking people to a tune of 4.3 BB/9 in the Eastern League this year. That BB/9 currently ranks in the bottom 10 in the Eastern League for pitchers with >50 IP. Giolito’s control problem reared it’s ugly head again in Giolito’s second start when he gave up 4 BB in 3.2 innings of work and gave up 4 ER. The Major Leagues isn’t the place for a young pitcher to workout their control issues. So the Nationals made a smart decision and sent Giolito down to figure out his control issues. Last year in A+ Giolito was only walking 2.58 batters per 9 innings if he can get his walk rate back down to those levels I’m sure he’ll be back up in no time. With that out of the way lets lets look at the three best Nationals pitching prospects(Austin Voth, Reynaldo Lopez and Giolito) to figure out ‘Why Reynaldo’:

pitching_comp

As you can see in the table above they all have pretty similar ERA’s but Reynaldo is outpacing his two counter parts in K/9, K/BB, and FIP. Voth and Reynaldo have comparable WHIPs and Voth is doing the best at stranding runners on base. Then again when your striking out as many people as Reynaldo you probably don’t have too many runners on. Here’s a look at all three players ERA’s over their last 7 minor league starts:

So why Reynaldo? Well the answer seems clear to me. He’s been the best performing Nationals minor league pitcher thus far this season and has earned the start. I’m looking forward to seeing Reynaldo’s debut tomorrow hopefully he keeps striking people out and keeps the BBs at a manageable rate.

Learning wOBA

Learning wOBA

As I continue to learn R and go down the road of becoming a data scientist. I need to learn how to use and compute advanced statistics. The first advanced analytic I’m going to learn how to compute is weighted on-base average(wOBA). Weighted on-base average combines all the parts of a players offensive game and gives them all appropriate weights for their impact on the game. For example, a HR is given more weight than a BB or a Single because a HR is guarantees a team atleast one run while a BB/Single only scores a run a much lower percentage of the time. General statistics like BA/SLG/OBP only look at a piece of a players offensive game which is why wOBA is a better tool for looking at a players offensive contribution. Fangraphs does a great job of describing the details in the following links:

The links above give a lot of insight on why you should use wOBA. Tom Tango in The Book describes the formula. He also has the standard formula for wOBA that includes and doesn’t include speed on his site . I used the standard formula from his site for my initial calculations with R.

The following R code has the functions I use to calculate speed and non speed wOBA:

 

[code language=”r”]
#Link for standard WOBA formula
#http://tangotiger.com/index.php/site/article/standard-woba
woba_calc_basic <- function(row){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
pa <- as.numeric(row["PA"])
ibb <- as.numeric(row["IBB"])
sh <- as.numeric(row["SH"])

numerator <- .7*(bb+hbp) + .9*(singles) + 1.3*(doubles + triples) + 2*hr
denominator <- pa – ibb – sh

return(numerator/denominator)
}

woba_calc_speed <- function(row){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
pa <- as.numeric(row["PA"])
ibb <- as.numeric(row["IBB"])
sh <- as.numeric(row["SH"])
sb <- as.numeric(row["SB"])
cs <- as.numeric(row["CS"])

numerator <- .7*(bb+hbp) + .9*(singles) + 1.3*(doubles + triples) + 2*hr + .25*sb + -.5*cs
denominator <- pa – ibb – sh

return(numerator/denominator)
}
[/code]

Once I had those two formula’s in place I retrieved some data from Nationals data from baseball-reference . Then used the code below to apply the two functions to a dataframe.

 

[code language=”r”]
data = read.csv("data/FG_MLB_Batting_Std_All.csv")

#Apply basic woba
data$woba_basic = apply(data, 1, woba_calc_basic)

#Apply speed woba
data$woba_speed = apply(data, 1, woba_calc_speed)

[/code]

Once I had that in place I noticed that Fangraphs kept mentioning ‘wOBA scale’ and how it and wOBA could be used to generate another stat called Weighted Runs Above Average . After doing some digging I found an old article written by Tom Tango on how to compute yearly wOBA from scratch . Above I’ve just used some standard weights to compute a standard wOBA. But standard wOBA doesn’t take into account the yearly offensive environment that players are playing in. Basically some years an offensive action is worth more than others based upon how the whole league is performing. Tom Tango’s article on computing yearly wOBA teaches you how to get the yearly weights and calculate wOBA scale. Continuing down this R, wOBA rabbit hole I transcribe his SQL script into R functions.

To compute the offensive environment(Step 1 and Step 2) in Tom Tangos script I use this function on pitching data:

 

[code language=”r”]
get_runvalues <- function(pitching_data){
#Calculate runs per out
stat <- sum(pitching_data["R"])/(sum(pitching_data["IP"])*3)
rPerOut <- c(stat)

#Calculate runs per bb
stat <- rPerOut + .14
rPerBB <- c(stat)

#Runs per HB
stat <- rPerBB + .025
rPerHB <- c(stat)

#Runs Per Single
stat <- rPerBB + .155
rPer1B <- c(stat)

#Runs Per Double
stat <- rPer1B + .3
rPer2B <- c(stat)

#Runs Per Triple
stat <- rPer2B + .27
rPer3B <- c(stat)

#Runs per HR
rHR <- 1.4

rSB <- .2

rCS <- 2*rPerOut + .075

df <- data.frame(rPerOut, rPerBB, rPerHB, rPer1B, rPer2B, rPer3B, rHR, rSB, rCS)

return(df)
}

[/code]

Step 3 in the SQL script pulls uses the run values to generate the necessary weights to calculate wOBA. The following R methods accomplish this:

[code language=”r”]
get_woba_runsMinus <- function(batting, runValues){
bbNumerator <- as.numeric(runValues["rPerBB"]) * (sum(batting["BB"]) – sum(batting["IBB"]))
hbpNumerator <- as.numeric(runValues["rPerHB"]) * sum(batting["HBP"])

singles <- sum(batting["H"]) – sum(batting["X2B"]) – sum(batting["X3B"]) – sum(batting["HR"])
singleNumerator <- as.numeric(runValues["rPer1B"]) * singles
doubleNumerator <- as.numeric(runValues["rPer2B"]) * sum(batting["X2B"])
tripleNumerator <- as.numeric(runValues["rPer3B"]) * sum(batting["X3B"])
hrNumerator <- as.numeric(runValues["rHR"]) * sum(batting["HR"])
sbNumerator <- as.numeric(runValues["rSB"]) * sum(batting["SB"])
csNumerator <- as.numeric(runValues["rCS"]) * sum(batting["CS"])

numerator <- bbNumerator + hbpNumerator + singleNumerator + doubleNumerator + tripleNumerator + hrNumerator + sbNumerator – csNumerator

#print(paste("numerator is, ", numerator))

denom <- sum(batting["AB"]) – sum(batting["H"]) + sum(batting["SF"])
#print(paste("denominator is, ", denom))

return(numerator/denom)
}

get_woba_runsPlus <- function(batting, runValues){
#Same as runMinus should probable combine and output df
bbNumerator <- as.numeric(runValues["rPerBB"]) * (sum(batting["BB"]) – sum(batting["IBB"]))
hbpNumerator <- as.numeric(runValues["rPerHB"]) * sum(batting["HBP"])

singles <- sum(batting["H"]) – sum(batting["X2B"]) – sum(batting["X3B"]) – sum(batting["HR"])
singleNumerator <- as.numeric(runValues["rPer1B"]) * singles
doubleNumerator <- as.numeric(runValues["rPer2B"]) * sum(batting["X2B"])
tripleNumerator <- as.numeric(runValues["rPer3B"]) * sum(batting["X3B"])
hrNumerator <- as.numeric(runValues["rHR"]) * sum(batting["HR"])
sbNumerator <- as.numeric(runValues["rSB"]) * sum(batting["SB"])
csNumerator <- as.numeric(runValues["rCS"]) * sum(batting["CS"])

numerator <- bbNumerator + hbpNumerator + singleNumerator + doubleNumerator + tripleNumerator + hrNumerator + sbNumerator – csNumerator
denom <- (sum(batting["BB"]) – sum(batting["IBB"])) + sum(batting["HBP"]) + sum(batting["H"])

return(numerator/denom)
}

woba_calc_lgAvgSpeed <- function(batting, runValues){
numerator <- sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["H"]) + sum(batting["SB"]) – sum(batting["CS"])
denom <- sum(batting["AB"]) + sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["SF"])

return(numerator/denom)
}

woba_calc_lgAvgReg <- function(batting, runValues){
numerator <- sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["H"])
denom <- sum(batting["AB"]) + sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["SF"])

return(numerator/denom)
}

get_woba_weights <- function(pitching, batting){
#Get the run values
run_values <- get_runvalues(pitching)

#Compute runs minus
runs_minus <- get_woba_runsMinus(batting, run_values)

runs_plus <- get_woba_runsPlus(batting, run_values)

#woba <- woba_calc_lgAvgSpeed(batting, run_values)
woba <- woba_calc_lgAvgReg(batting, run_values)

wobaScale <- 1/(runs_plus + runs_minus)

wobaBB <- (as.numeric(run_values["rPerBB"]) + runs_minus)*wobaScale
wobaHB <- (as.numeric(run_values["rPerHB"]) + runs_minus)*wobaScale
woba1B <- (as.numeric(run_values["rPer1B"]) + runs_minus)*wobaScale
woba2B <- (as.numeric(run_values["rPer2B"]) + runs_minus)*wobaScale
woba3B <- (as.numeric(run_values["rPer3B"]) + runs_minus)*wobaScale
wobaHR <- (as.numeric(run_values["rHR"]) + runs_minus)*wobaScale
wobaSB <- as.numeric(run_values["rSB"])*wobaScale
wobaCS <- as.numeric(run_values["rCS"])*wobaScale

df <- data.frame(run_values, runs_minus, runs_plus, woba, wobaScale, wobaBB, wobaHB, woba1B, woba3B, wobaHR, wobaSB, wobaCS)

return(df)
}
[/code]

With all this in place I now needed to test my functions against the table he has here . The data I ended up using for the test is from Fangraphs . The code below calculates the wOBA weights:

[code language=”r”]
fg_batting_data = read.csv("data/FG_MLB_Batting_Std_All.csv")
fg_pitching_data = read.csv("data/FG_MLB_Pitching_Std_All.csv")
min_pa = subset(fg_batting_data, PA > 80)
wobaWeights = get_woba_weights(fg_pitching_data, fg_batting_data)
wobaWeightsFiltered = get_woba_weights(fg_pitching_data, min_pa)
wobaWeights
[/code]

Here’s the output for the wOBA weights for 2007:

[code]
rPerOut rPerBB rPerHB rPer1B rPer2B rPer3B rHR rSB rCS
1 0.1796024 0.3196024 0.3446024 0.4746024 0.7746024 1.044602 1.4 0.2 0.4342048
runs_minus runs_plus woba wobaScale wobaBB wobaHB woba1B woba2B
1 0.2792543 0.5642809 0.3310523 1.185487 0.7099368 0.739574 0.8936873 1.249333
woba3B wobaHR wobaSB wobaCS
1 1.569415 1.990734 0.2370974 0.5147441
[/code]

These numbers line up with the numbers from Tom Tangos table for 2007 I linked above. Here’s a look at what the side by side numbers:

label original calc my calc
year 2007 2007
rPerOut .179 0.1796024
rPerBB .32 .3196024
rPerHB .34 .3446024
rPer1B .47 .4746024
rPer2B .77 .7746024
rPer3B 1.04 1.044602
rHR 1.40 1.4
rSB .2 .2
rCS .43 .4342048
rMinus .28 .2792543
rPlus .56 .5642809
wOBA .331 .3310523
wobaScale 1.19 1.185487
wobaBB .71 .7099368
wobaHB .74 .739574
woba1B .89 .8936873
woba2B 1.25 1.249333
woba3B 1.57 1.569415
wobaHR 1.99 1.990734
wobaSB .24 .2370974
wobaCS .51 .5147441

With some simplification the weights I produced with the R code above would line up with Tom Tango’s original chart for 2007. Since I’m now able to calculate wOBA weights from scratch I’ll be able to calculate wRAA and apply wOBA in other offensive environments.

Exploratory Data Analysis using Minor League Batting Statistics

Exploratory Data Analysis using Minor League Batting Statistics

Similar to graphically looking at Nationals minor league pitching stats I wanted to do the same with their minor league hitting stats per team. I decided to look at how the Nationals minor league team OPS is doing relative to their league and level. OPS is a players OBP added to their SLG measure how good a player is doing offensively when those two metrics are taken into account.

Since pitchers also bat I needed to do some data cleaning or the numbers wouldn’t make sense. To clean the data I removed all players from data set that didn’t have more than 20 Plate Appearances this season. The original data set 3255 data points. After adding that filter I got down to 2384 data points. Here’s the layout by level:

Level Data Points
SS-A 273
A 514
A+ 507
AA 518
AAA 572

For each League at each level I wanted to get the average team ERA and compare it to how the Nationals affiliates are doing. In the below table you can see those numbers.

 

League/Team Level OPS
New York Penn League SS-A .628
Northwest League SS-A .653
Auburn Doubledays SS-A .599
Midwest A .645
South Atlantic A .675
Hagerstown Suns A .736
California A+ .702
Carolina A+ .672
Florida State A+ .653
Potomac Nationals A+ .679
Eastern AA .699
Texas AA .669
Southern AA .678
Harrisburg AA .685
Pacific Coast AAA .727
International AAA .677
Syracuse AAA .631

Note: Data covers the season up to 7/1/2016

Here’s a look at the data graphically:

all_ops_2016

Auburn is a small sample size so I wouldn’t pay to much attention to the short season portion of the graph just yet. Hagerstown is our best performing offensive team based on OPS. Their team OPS is better than the average OPS for the two leagues at their level(South Atlantic League and California League). Overall Hagerstown(.735) has the second best team OPS in their League(1st is Asheville .756) and third best OPS for their league(1st is Bowling Green at .764). Harrisburgs and Potomac are performing at a little over League average each. On the other end of the spectrum from Hagerstown, Syracuse has a bottom 5 team OPS.

Here’s a look at only the Nationals Minor League affiliates OPS:

wsh_ops.png

In my next blog post I’m going to look at two of the catalyst of the Hagerstown offense Max Schrock and Victor Robles.

Part 2 Source Code:

[code language=”r”]

#Summarize
minors_batting &lt;- getDfFromDir(&quot;dataDir&quot;)
#Only use cases with more than 20PA’s
minors_batting &lt;- filter(minors_batting, PA&gt;20)
minors_batting &lt;- minors_batting[complete.cases(minors_batting),]
summary(minors_batting$Lvl)

minors_tm_ops &lt;- ddply(minors_batting, .(Tm,Lg,Lvl,Franchise), summarise, ops=mean(OPS))
minors_lg_ops &lt;- ddply(minors_batting, .(Lg,Lvl), summarise, ops=mean(OPS))
aff_ops &lt;- ddply(minors_batting[minors_batting$Franchise==&quot;Washington Nationals&quot;,],.(Tm,Lvl), summarise, ops=mean(OPS))

lg_melt_data &lt;- melt(minors_lg_ops)
aff_melt_data &lt;- melt(aff_ops)

#Rename vars so you can bind
colnames(lg_melt_data) &lt;- c(&quot;lg_tm&quot;, &quot;lvl&quot;, &quot;variable&quot;, &quot;value&quot;)
colnames(aff_melt_data) &lt;- c(&quot;lg_tm&quot;, &quot;lvl&quot;, &quot;variable&quot;, &quot;value&quot;)
total_melt_data &lt;- rbind(lg_melt_data,aff_melt_data)
total_melt_data

ops_graph &lt;- ggplot(data=total_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat=&quot;identity&quot;, position=&quot;dodge&quot;) + ggtitle(&quot;WSH Minors OPS Per Level&quot;)
ops_graph

#Graph of only the nationals
nats_ops_graph &lt;- ggplot(data=aff_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat=&quot;identity&quot;, position=&quot;dodge&quot;) + ggtitle(&quot;WSH Minors OPS&quot;)
nats_ops_graph
[/code]

Exploratory Data Analysis of Nationals Minor League Pitching Stats

Exploratory Data Analysis of Nationals Minor League Pitching Stats

After attending SSAC this year I decided one of the skills I need to pick up is R. Well after finally finishing Grad School I finally have time. Best way for me to learn is to actually get some data I’m interested in. Daily I look up Nationals minor league statistics to see how the upcoming Nationals are doing. So minor league data made a lot of sense for me to collect and doing Exploratory Data Analysis(seeing the data) is necessary before making the next steps in data science. I was interested in seeing how the Nationals affiliates were doing in comparison to their leagues. For pitching I decided to compare team ERA with the rest of the league. ERA is a measure of how well a pitcher is doing per nine innings. A low era is a good thing the higher the ERA the more runs a team is giving up per nine innings.

Note: Data covers the season up to 7/1/2016

So how many data points do I actually have per level?

Level Data Points
SS-A 396
A 629
A+ 737
AA 684
AAA 800

For each League at each level I wanted to get the average team ERA and compare it to how the Nationals affiliates are doing. In the below table you can see those numbers.

League/Team Level ERA
New York Penn League SS-A 3.46
Northwest League SS-A 3.71
Auburn Doubledays SS-A 3.57
Midwest A 3.55
South Atlantic A 3.79
Hagerstown Suns A 3.73
California A+ 4.15
Carolina A+ 3.80
Florida State A+ 3.49
Potomac Nationals A+ 3.61
Eastern AA 3.91
Texas AA 3.69
Southern AA 3.76
Harrisburg AA 3.72
Pacific Coast AAA 4.50
International AAA 3.65
Syracuse AAA 3.84

To see that table in another way I also put this data into a graph using R.

eraPerLevel_7-2

 

And here’s what the teams look like compared side by side in a graph:

wsh_only_era

When I originally wrote this post a couple weeks ago now Harrisburg had the best performing Nationals staff in comparison to their league and that came as no surprise since the staff was headlined by two of the Nationals top pitching prospects Lucas Giolito and Reynaldo Lopez.

Now looking at the data up to 7/1 Potomac and Auburn(small sample size) have the best team ERA’s overall. Auburn probably need to wait a little more into the season prior to making any statements since the seasons just started but currently their staff ERA is not outperforming league average. Harrisburg and Potomac teams are both outperforming their respective leagues and Hagerstown is about even with league average.  Syracuse’s ERA is a bit up compared to how other staffs in the International League are performing.

Part 2. Source Code

The data itself was retrieved in csv format from Baseball-Reference . Each teams data was put into a folder and then read into a dataframe using the following function:

[code language=”r”]
#Read files from a directory into a dataframe
#http://www.r-bloggers.com/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/
getDfFromDir <- function(csvDir){
fileList <- list.files(csvDir)

outputFromDfDir = NULL
for(file in fileList){
fullPath <- paste(csvDir, "/", file, sep="")
#print("File name is")
#print(fullPath)
if(is.null(outputFromDfDir)){
#print("Hit null")
outputFromDfDir <- read.csv(fullPath)
}else{
tmp_ds <- read.csv(fullPath)
outputFromDfDir <- rbind(outputFromDfDir, tmp_ds)
rm(tmp_ds)
#print("Hit else")
}
}

return(outputFromDfDir)
}
[/code]

Then to do the graphs and get the data I needed I used the following code:

[code language=”r”]
#install.packages("ggplot2")
#install.packages("dplyr")
library(ggplot2)
library(plyr)
library(reshape2)

#Read in all the minors pitching data from directory
minors_pitching <- getDfFromDir("sport_data/product/all/baseball-reference/2016/csv/pitching/")
summary(minors_pitching$Lvl)

#See if you can compute staff era for each team
minors_tm_era <- ddply(minors_pitching,.(Tm,Lg,Lvl,Franchise), summarise, era=(sum(ER)/sum(IP))*9)
#Remove NA rows
minors_tm_era <- minors_tm_era[complete.cases(minors_tm_era),]
minors_tm_era

#Level Data
minors_lvl_era <- ddply(minors_pitching,.(Lvl),summarise, era=(sum(ER)/sum(IP))*9)
minors_lvl_era <- minors_lvl_era[complete.cases(minors_lvl_era),]
minors_lvl_era

#League Data
minors_Lg_era <- ddply(minors_pitching,.(Lg,Lvl),summarise, era=(sum(ER)/sum(IP))*9)
minors_Lg_era <- minors_Lg_era[complete.cases(minors_Lg_era),]
minors_Lg_era

wsh_tm_era <- ddply(minors_pitching[minors_pitching$Franchise=="Washington Nationals",],.(Tm,Lvl), summarise, era=(sum(ER)/sum(IP))*9)
wsh_tm_era

lg_melt_data <- melt(minors_Lg_era)
team_melt_data <- melt(wsh_tm_era)

#Rename vars so you can bind
colnames(lg_melt_data) <- c("lg_tm", "lvl", "variable", "value")
colnames(team_melt_data) <- c("lg_tm", "lvl", "variable", "value")
total_melt_data <- rbind(lg_melt_data,team_melt_data)
total_melt_data

#Graph for era
era_graph <- ggplot(data=total_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat="identity", position="dodge") + ggtitle("WSH Minors ERA Per Level")
era_graph

#Graph of only the nationals
nats_era_graph <- ggplot(data=team_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat="identity", position="dodge") + ggtitle("WSH Minors ERA")
nats_era_graph
[/code]