Browsed by
Category: MILB

Modeling Hit Rates Between Minor League Levels

Modeling Hit Rates Between Minor League Levels

Working on figuring out the hit rates for minor leaguer batters between levels. I’d like to take the hit rates(i.e. singles(1B/PA), doubles(2B/PA), triples(3B/PA) and HRs(HR/PA) ) a player had at their previous minor league level and use that data to predict how a player will do at the following level. Similar data has been used as in the previous articles on walk rates and strike out rates. This data set covered 2011-2015 and players with a minimum of 200 PA’s were included in the resulting model. Below are the graphs for each level, models and some thoughts.

A to A+

A to A+ Hit Rates

A theme throughout the graphs will show that the correlation numbers for singles and home runs are high but very low for doubles and triples. These same low correlation numbers for doubles and triples were found in previous research by Matt Klassen at Fangraphs.

Linear models:

  •  A+ Single Rate = (A single rate)*0.53520 + 0.07452
  • A+ Double Rate = (A double rate)*.36379 + .02929
  • A+ Triple Rate = (A triple rate)*.403826 + .004743
  • A+ HR Rate = (A HR rate)*.633131 + .0006235

A+ to AA

A+ to AA Hit Rates

Linear models:

  • AA Single Rate = (A+ Single rate)*.48235 + .07969
  • AA Double Rate = (A+ Double rate)*.22680 + .03389
  • AA Triple Rate = (A+ Triple rate)*.377505 + .003751
  • AA HR Rate = (A+ HR rate)*.534897 + .007925


AA to AAA Hit Rate

Linear models:

  • AAA Single Rate = (AA Single Rate)*.52767 + .07912
  • AAA Double Rate = (AA Double Rate)*.248769 + .03645
  • AAA Triple Rate = (AA Triple Rate)*.355865 + .003757
  • AAA HR Rate = (AA HR Rate)*.58037 + .00881

Whats Next:

  • Perform some validation on the above models
  • Combine the models you’ve generated to predict OBP/SLG/OPS
  • Make models that skip levels
  • Make code more efficient so you can do this faster



Modeling Walk Rate between minor league levels

Modeling Walk Rate between minor league levels

After reading through Projecting X by Mike Podhorzer I decided to try and predict some rate statistics between minor league levels. Mike states in his book “Projecting rates makes it dramatically easier to adjust a forecast if necessary.”; therefore if a player is injured or will only have a certain number of plate appearances that year I can still attempt to project performance. The first rate statistic I’m going to attempt project is Walk Rate between minor league levels. This article will cover the following:

Raw Data

Data Cleaning

Correlation and Graphs

Model and Results


Raw Data

For my model I used data from Baseball Reference and am using the last 7 years of minor league data(2009-2015). Accounting for the Short Season A(SS-A) to AAA affiliates I ended up with over 28,316 data points for my analysis.

Data Cleaning

I’m using R and the original dataframe I had put all the data from each year in different rows. In order to do the calculations I wanted to do I needed to move each players career minor league data to the same row. Also I noticed I needed to filter on plate appearances during a season to make sure I’m getting rid of noise. For example, a player on a rehab assignment in the minor leagues or a player who ended up getting injured for most of the year so they only had 50-100 plate appearances. The min plate appearances I ended up settling on was 200 for a player to be factored into the model. Another thing I’m doing to remove noise is only attempting to model player performance between full season leagues(A, A+, AA, AAA). Once the cleaning of the data was done I had the following data points for each level:

  • A to A+ : 1129
  • A+ to A: 1023
  • AA to AAA: 705

Correlation and Graphs

I was able to get strong correlation numbers for walk rate between minor league levels. You can see the results below:

  • A to A+ : .6301594
  • A+ to AA: .6141332
  • AA to AAA: .620662

Here’s the graphs for each level:




Model and Results

The linear models for each level are:

  • A to A+: A+ BB% = .63184*(A BB%) + .02882
  • A+ to AA: AA BB% = .6182*(A+ BB%) + .0343
  • AA to AAA: AAA BB% = .5682(AA BB%) + .0342

In order to interpret the success or failure of my results I compared how close I was to getting the actual walk rate. Fangraphs has a great rating scale for walk rate at the Major League level:

Image from Fangraphs 

The image above gives a classification for multiple levels of walk rates. While based on major league data it’s a good starting point for me to decide a margin of error for my model. The mean difference between each level in the Fangraphs table is .0183333, I ended up rounding and made my margin for error .02. So if my predicted value for a players walk rate was within .02 of being correct I counted counted the model as correct for the player and if my error was greater than that it was wrong. Here are the models results for each level:

  • A to A+
    • Incorrect: 450
    • Correct: 679
    • Percentage Correct: ~.6014
  • A+ to A
    • Incorrect: 445
    • Correct: 578
    • Percentage Correct: ~.565
  • AA to AAA
    • Incorrect: 278
    • Correct: 427
    • Percentage Correct: ~.6056

When I moved the cutoff up a percentage to .03 the models results drastically improve:

  • A to A+
    • Incorrect: 228
    • Correct: 901
    • Percentage Correct: ~.798
  • A+ to AA
    • Incorrect: 246
    • Correct: 777
    • Percentage Correct: ~.7595
  • AA to AAA
    • Incorrect: 144
    • Correct: 561
    • Percentage Correct: ~.7957


Numbers are cool but where are the actual examples. Ok, lets start off with my worst prediction. The largest error I had between levels was A to A+ and the error was >10%(~.1105). The player in this case was Joey Gallo a quick glance at the player page will show his A walk rate was only .1076 and his A+ walk rate was .2073 which is a 10% improvement between levels. So why did this happen and why didn’t my model do a better job of predicting this. Currently the model is only accounting for the previous seasons walk rate but what if the player is getting a lot of hits at one level and stops swinging as much on the next. In Gallo’s case he only had a .245 BA his year at A ball so that wasn’t the case. More investigation is required to see how the model can get closer on edge cases like this.

Gallo Dataframe Snippet

The lowest I was able to set the error too and still come back with results was ~.00004417. That very close prediction belongs too Erik Gonzalez. Don’t know Erik Gonzalez so I continued to look for results setting the min error to .0002 brought back Stephen Lombardozzi as one of my six results. Lombo’s interesting to hard core Nats fans(like myself) but wanted to continue to look for a more notable name. Finally after upping the number to .003 for A to A+ data I was able to see that the model successfully predicted Houston Astro’s multi-time All Star 2B walk rate for Jose Altuve walk rate within a .003 margin of error.

Altuve Dataframe snippet


Whats Next:

  • Improve model to get a lower max error
  • Predict Strike out rate between levels
  • Predicting more advanced statistics like woba/ops/wrc


Why Reynaldo?

Why Reynaldo?

The Nationals sent Lucas Giolito back down to the minors and have called up Reynaldo Lopez for his Major League debut tomorrow. So I decided to take a look at possible reasons for that decision. Giolito did ok in his first rain shortened start giving up only 1 hit in 4 innings but the 2 BB’s were a little concerning. Especially since this season he’s had a pattern of walking people to a tune of 4.3 BB/9 in the Eastern League this year. That BB/9 currently ranks in the bottom 10 in the Eastern League for pitchers with >50 IP. Giolito’s control problem reared it’s ugly head again in Giolito’s second start when he gave up 4 BB in 3.2 innings of work and gave up 4 ER. The Major Leagues isn’t the place for a young pitcher to workout their control issues. So the Nationals made a smart decision and sent Giolito down to figure out his control issues. Last year in A+ Giolito was only walking 2.58 batters per 9 innings if he can get his walk rate back down to those levels I’m sure he’ll be back up in no time. With that out of the way lets lets look at the three best Nationals pitching prospects(Austin Voth, Reynaldo Lopez and Giolito) to figure out ‘Why Reynaldo’:


As you can see in the table above they all have pretty similar ERA’s but Reynaldo is outpacing his two counter parts in K/9, K/BB, and FIP. Voth and Reynaldo have comparable WHIPs and Voth is doing the best at stranding runners on base. Then again when your striking out as many people as Reynaldo you probably don’t have too many runners on. Here’s a look at all three players ERA’s over their last 7 minor league starts:

So why Reynaldo? Well the answer seems clear to me. He’s been the best performing Nationals minor league pitcher thus far this season and has earned the start. I’m looking forward to seeing Reynaldo’s debut tomorrow hopefully he keeps striking people out and keeps the BBs at a manageable rate.

Exploratory Data Analysis using Minor League Batting Statistics

Exploratory Data Analysis using Minor League Batting Statistics

Similar to graphically looking at Nationals minor league pitching stats I wanted to do the same with their minor league hitting stats per team. I decided to look at how the Nationals minor league team OPS is doing relative to their league and level. OPS is a players OBP added to their SLG measure how good a player is doing offensively when those two metrics are taken into account.

Since pitchers also bat I needed to do some data cleaning or the numbers wouldn’t make sense. To clean the data I removed all players from data set that didn’t have more than 20 Plate Appearances this season. The original data set 3255 data points. After adding that filter I got down to 2384 data points. Here’s the layout by level:

Level Data Points
SS-A 273
A 514
A+ 507
AA 518
AAA 572

For each League at each level I wanted to get the average team ERA and compare it to how the Nationals affiliates are doing. In the below table you can see those numbers.


League/Team Level OPS
New York Penn League SS-A .628
Northwest League SS-A .653
Auburn Doubledays SS-A .599
Midwest A .645
South Atlantic A .675
Hagerstown Suns A .736
California A+ .702
Carolina A+ .672
Florida State A+ .653
Potomac Nationals A+ .679
Eastern AA .699
Texas AA .669
Southern AA .678
Harrisburg AA .685
Pacific Coast AAA .727
International AAA .677
Syracuse AAA .631

Note: Data covers the season up to 7/1/2016

Here’s a look at the data graphically:


Auburn is a small sample size so I wouldn’t pay to much attention to the short season portion of the graph just yet. Hagerstown is our best performing offensive team based on OPS. Their team OPS is better than the average OPS for the two leagues at their level(South Atlantic League and California League). Overall Hagerstown(.735) has the second best team OPS in their League(1st is Asheville .756) and third best OPS for their league(1st is Bowling Green at .764). Harrisburgs and Potomac are performing at a little over League average each. On the other end of the spectrum from Hagerstown, Syracuse has a bottom 5 team OPS.

Here’s a look at only the Nationals Minor League affiliates OPS:


In my next blog post I’m going to look at two of the catalyst of the Hagerstown offense Max Schrock and Victor Robles.

Part 2 Source Code:

[code language=”r”]

minors_batting <- getDfFromDir("dataDir")
#Only use cases with more than 20PA’s
minors_batting <- filter(minors_batting, PA>20)
minors_batting <- minors_batting[complete.cases(minors_batting),]

minors_tm_ops <- ddply(minors_batting, .(Tm,Lg,Lvl,Franchise), summarise, ops=mean(OPS))
minors_lg_ops <- ddply(minors_batting, .(Lg,Lvl), summarise, ops=mean(OPS))
aff_ops <- ddply(minors_batting[minors_batting$Franchise=="Washington Nationals",],.(Tm,Lvl), summarise, ops=mean(OPS))

lg_melt_data <- melt(minors_lg_ops)
aff_melt_data <- melt(aff_ops)

#Rename vars so you can bind
colnames(lg_melt_data) <- c("lg_tm", "lvl", "variable", "value")
colnames(aff_melt_data) <- c("lg_tm", "lvl", "variable", "value")
total_melt_data <- rbind(lg_melt_data,aff_melt_data)

ops_graph <- ggplot(data=total_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat="identity", position="dodge") + ggtitle("WSH Minors OPS Per Level")

#Graph of only the nationals
nats_ops_graph <- ggplot(data=aff_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat="identity", position="dodge") + ggtitle("WSH Minors OPS")

Exploratory Data Analysis of Nationals Minor League Pitching Stats

Exploratory Data Analysis of Nationals Minor League Pitching Stats

After attending SSAC this year I decided one of the skills I need to pick up is R. Well after finally finishing Grad School I finally have time. Best way for me to learn is to actually get some data I’m interested in. Daily I look up Nationals minor league statistics to see how the upcoming Nationals are doing. So minor league data made a lot of sense for me to collect and doing Exploratory Data Analysis(seeing the data) is necessary before making the next steps in data science. I was interested in seeing how the Nationals affiliates were doing in comparison to their leagues. For pitching I decided to compare team ERA with the rest of the league. ERA is a measure of how well a pitcher is doing per nine innings. A low era is a good thing the higher the ERA the more runs a team is giving up per nine innings.

Note: Data covers the season up to 7/1/2016

So how many data points do I actually have per level?

Level Data Points
SS-A 396
A 629
A+ 737
AA 684
AAA 800

For each League at each level I wanted to get the average team ERA and compare it to how the Nationals affiliates are doing. In the below table you can see those numbers.

League/Team Level ERA
New York Penn League SS-A 3.46
Northwest League SS-A 3.71
Auburn Doubledays SS-A 3.57
Midwest A 3.55
South Atlantic A 3.79
Hagerstown Suns A 3.73
California A+ 4.15
Carolina A+ 3.80
Florida State A+ 3.49
Potomac Nationals A+ 3.61
Eastern AA 3.91
Texas AA 3.69
Southern AA 3.76
Harrisburg AA 3.72
Pacific Coast AAA 4.50
International AAA 3.65
Syracuse AAA 3.84

To see that table in another way I also put this data into a graph using R.



And here’s what the teams look like compared side by side in a graph:


When I originally wrote this post a couple weeks ago now Harrisburg had the best performing Nationals staff in comparison to their league and that came as no surprise since the staff was headlined by two of the Nationals top pitching prospects Lucas Giolito and Reynaldo Lopez.

Now looking at the data up to 7/1 Potomac and Auburn(small sample size) have the best team ERA’s overall. Auburn probably need to wait a little more into the season prior to making any statements since the seasons just started but currently their staff ERA is not outperforming league average. Harrisburg and Potomac teams are both outperforming their respective leagues and Hagerstown is about even with league average.  Syracuse’s ERA is a bit up compared to how other staffs in the International League are performing.

Part 2. Source Code

The data itself was retrieved in csv format from Baseball-Reference . Each teams data was put into a folder and then read into a dataframe using the following function:

[code language=”r”]
#Read files from a directory into a dataframe
getDfFromDir <- function(csvDir){
fileList <- list.files(csvDir)

outputFromDfDir = NULL
for(file in fileList){
fullPath <- paste(csvDir, "/", file, sep="")
#print("File name is")
#print("Hit null")
outputFromDfDir <- read.csv(fullPath)
tmp_ds <- read.csv(fullPath)
outputFromDfDir <- rbind(outputFromDfDir, tmp_ds)
#print("Hit else")


Then to do the graphs and get the data I needed I used the following code:

[code language=”r”]

#Read in all the minors pitching data from directory
minors_pitching <- getDfFromDir("sport_data/product/all/baseball-reference/2016/csv/pitching/")

#See if you can compute staff era for each team
minors_tm_era <- ddply(minors_pitching,.(Tm,Lg,Lvl,Franchise), summarise, era=(sum(ER)/sum(IP))*9)
#Remove NA rows
minors_tm_era <- minors_tm_era[complete.cases(minors_tm_era),]

#Level Data
minors_lvl_era <- ddply(minors_pitching,.(Lvl),summarise, era=(sum(ER)/sum(IP))*9)
minors_lvl_era <- minors_lvl_era[complete.cases(minors_lvl_era),]

#League Data
minors_Lg_era <- ddply(minors_pitching,.(Lg,Lvl),summarise, era=(sum(ER)/sum(IP))*9)
minors_Lg_era <- minors_Lg_era[complete.cases(minors_Lg_era),]

wsh_tm_era <- ddply(minors_pitching[minors_pitching$Franchise=="Washington Nationals",],.(Tm,Lvl), summarise, era=(sum(ER)/sum(IP))*9)

lg_melt_data <- melt(minors_Lg_era)
team_melt_data <- melt(wsh_tm_era)

#Rename vars so you can bind
colnames(lg_melt_data) <- c("lg_tm", "lvl", "variable", "value")
colnames(team_melt_data) <- c("lg_tm", "lvl", "variable", "value")
total_melt_data <- rbind(lg_melt_data,team_melt_data)

#Graph for era
era_graph <- ggplot(data=total_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat="identity", position="dodge") + ggtitle("WSH Minors ERA Per Level")

#Graph of only the nationals
nats_era_graph <- ggplot(data=team_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat="identity", position="dodge") + ggtitle("WSH Minors ERA")