Browsed by
Month: September 2016

Modeling Walk Rate between minor league levels

Modeling Walk Rate between minor league levels

After reading through Projecting X by Mike Podhorzer I decided to try and predict some rate statistics between minor league levels. Mike states in his book “Projecting rates makes it dramatically easier to adjust a forecast if necessary.”; therefore if a player is injured or will only have a certain number of plate appearances that year I can still attempt to project performance. The first rate statistic I’m going to attempt project is Walk Rate between minor league levels. This article will cover the following:

Raw Data

Data Cleaning

Correlation and Graphs

Model and Results


Raw Data

For my model I used data from Baseball Reference and am using the last 7 years of minor league data(2009-2015). Accounting for the Short Season A(SS-A) to AAA affiliates I ended up with over 28,316 data points for my analysis.

Data Cleaning

I’m using R and the original dataframe I had put all the data from each year in different rows. In order to do the calculations I wanted to do I needed to move each players career minor league data to the same row. Also I noticed I needed to filter on plate appearances during a season to make sure I’m getting rid of noise. For example, a player on a rehab assignment in the minor leagues or a player who ended up getting injured for most of the year so they only had 50-100 plate appearances. The min plate appearances I ended up settling on was 200 for a player to be factored into the model. Another thing I’m doing to remove noise is only attempting to model player performance between full season leagues(A, A+, AA, AAA). Once the cleaning of the data was done I had the following data points for each level:

  • A to A+ : 1129
  • A+ to A: 1023
  • AA to AAA: 705

Correlation and Graphs

I was able to get strong correlation numbers for walk rate between minor league levels. You can see the results below:

  • A to A+ : .6301594
  • A+ to AA: .6141332
  • AA to AAA: .620662

Here’s the graphs for each level:




Model and Results

The linear models for each level are:

  • A to A+: A+ BB% = .63184*(A BB%) + .02882
  • A+ to AA: AA BB% = .6182*(A+ BB%) + .0343
  • AA to AAA: AAA BB% = .5682(AA BB%) + .0342

In order to interpret the success or failure of my results I compared how close I was to getting the actual walk rate. Fangraphs has a great rating scale for walk rate at the Major League level:

Image from Fangraphs 

The image above gives a classification for multiple levels of walk rates. While based on major league data it’s a good starting point for me to decide a margin of error for my model. The mean difference between each level in the Fangraphs table is .0183333, I ended up rounding and made my margin for error .02. So if my predicted value for a players walk rate was within .02 of being correct I counted counted the model as correct for the player and if my error was greater than that it was wrong. Here are the models results for each level:

  • A to A+
    • Incorrect: 450
    • Correct: 679
    • Percentage Correct: ~.6014
  • A+ to A
    • Incorrect: 445
    • Correct: 578
    • Percentage Correct: ~.565
  • AA to AAA
    • Incorrect: 278
    • Correct: 427
    • Percentage Correct: ~.6056

When I moved the cutoff up a percentage to .03 the models results drastically improve:

  • A to A+
    • Incorrect: 228
    • Correct: 901
    • Percentage Correct: ~.798
  • A+ to AA
    • Incorrect: 246
    • Correct: 777
    • Percentage Correct: ~.7595
  • AA to AAA
    • Incorrect: 144
    • Correct: 561
    • Percentage Correct: ~.7957


Numbers are cool but where are the actual examples. Ok, lets start off with my worst prediction. The largest error I had between levels was A to A+ and the error was >10%(~.1105). The player in this case was Joey Gallo a quick glance at the player page will show his A walk rate was only .1076 and his A+ walk rate was .2073 which is a 10% improvement between levels. So why did this happen and why didn’t my model do a better job of predicting this. Currently the model is only accounting for the previous seasons walk rate but what if the player is getting a lot of hits at one level and stops swinging as much on the next. In Gallo’s case he only had a .245 BA his year at A ball so that wasn’t the case. More investigation is required to see how the model can get closer on edge cases like this.

Gallo Dataframe Snippet

The lowest I was able to set the error too and still come back with results was ~.00004417. That very close prediction belongs too Erik Gonzalez. Don’t know Erik Gonzalez so I continued to look for results setting the min error to .0002 brought back Stephen Lombardozzi as one of my six results. Lombo’s interesting to hard core Nats fans(like myself) but wanted to continue to look for a more notable name. Finally after upping the number to .003 for A to A+ data I was able to see that the model successfully predicted Houston Astro’s multi-time All Star 2B walk rate for Jose Altuve walk rate within a .003 margin of error.

Altuve Dataframe snippet


Whats Next:

  • Improve model to get a lower max error
  • Predict Strike out rate between levels
  • Predicting more advanced statistics like woba/ops/wrc


Correlation between Salary Cap and Winning?

Correlation between Salary Cap and Winning?

After doing my initial blog looking at how much each team is spending per position group. I wanted to take a look to see if there was any correlation between how much teams are spending on a position group and winning. To do this I needed to merge the cap data from spotrac  and season summary data from pro-football-reference . I merged these datasets over the last 5 years but it’d be interesting to try and find data since the salary cap was in place(1994). Here’s a graph of my yearly findings for the last 5 years(2011-2015).


Quick review on correlation from Pearson:

  • .00-.19 “very weak”
  • .20-.39 “weak”
  •  .40-.59 “moderate”
  • .60-.79 “strong”
  • .80-1.0 “very strong”

As you can see from the graph the correlation numbers aren’t exactly high. I believe that’s because the best players aren’t necessarily getting paid the most money. For example, before last year Russel Wilson was on his rookie contract and the Seahawks were making the playoffs year after year and only paying Russel $749,176  a year. Now he know doubt had a lot to do with the Seahawks winning before and going forward but his Salary before his new contract wouldn’t have correlated much to winning. Examples like this can be found at each position. This is why it’s necessary to have a good Front Office to continually bring in young talent that can contribute at a lower price. Looking at actual on the field stats and trying to correlate would be a much better exercise than trying to merge cap data with correlation.

Position Correlation
DB 0.05356608
DL -0.10102064
LB 0.14434313
OL -0.0783075
QB 0.08256776
RB -0.0917516
ST -0.05109986
TE -0.04013766
WR 0.07491824

Over the last 5 years there’s not one correlation that’s greater than ‘very weak’. But the positions that do have positive correlations DB, LB, QB, and WR are the positions that GM’s over the last 5 years have been willing to pay. Left Tackle is another position that has been getting paid very well in the league because Left Tackle’s are usually the one’s protecting a QB’s blind spot.

This data emphasizes the importance of drafting well because spending money on a particular position does not correlate with your team winning. Also interesting to note the positions that do have positive correlations must be that way because those players have made it to their second/third contracts and are now getting the big contracts. So it makes me wonder if DB, LB, QB and WR are the positions that have the longest careers in the NFL.