## Modeling Walk Rate between minor league levels

After reading through Projecting X by Mike Podhorzer I decided to try and predict some rate statistics between minor league levels. Mike states in his book “Projecting rates makes it dramatically easier to adjust a forecast if necessary.”; therefore if a player is injured or will only have a certain number of plate appearances that year I can still attempt to project performance. The first rate statistic I’m going to attempt project is Walk Rate between minor league levels. This article will cover the following:

**Raw Data**

For my model I used data from Baseball Reference and am using the last 7 years of minor league data(2009-2015). Accounting for the Short Season A(SS-A) to AAA affiliates I ended up with over 28,316 data points for my analysis.

**Data Cleaning**

I’m using R and the original dataframe I had put all the data from each year in different rows. In order to do the calculations I wanted to do I needed to move each players career minor league data to the same row. Also I noticed I needed to filter on plate appearances during a season to make sure I’m getting rid of noise. For example, a player on a rehab assignment in the minor leagues or a player who ended up getting injured for most of the year so they only had 50-100 plate appearances. The min plate appearances I ended up settling on was 200 for a player to be factored into the model. Another thing I’m doing to remove noise is only attempting to model player performance between full season leagues(A, A+, AA, AAA). Once the cleaning of the data was done I had the following data points for each level:

- A to A+ : 1129
- A+ to A: 1023
- AA to AAA: 705

**Correlation and Graphs**

I was able to get strong correlation numbers for walk rate between minor league levels. You can see the results below:

- A to A+ : .6301594
- A+ to AA: .6141332
- AA to AAA: .620662

Here’s the graphs for each level:

**Model and Results**

The linear models for each level are:

- A to A+: A+ BB% = .63184*(A BB%) + .02882
- A+ to AA: AA BB% = .6182*(A+ BB%) + .0343
- AA to AAA: AAA BB% = .5682(AA BB%) + .0342

In order to interpret the success or failure of my results I compared how close I was to getting the actual walk rate. Fangraphs has a great rating scale for walk rate at the Major League level:

The image above gives a classification for multiple levels of walk rates. While based on major league data it’s a good starting point for me to decide a margin of error for my model. The mean difference between each level in the Fangraphs table is .0183333, I ended up rounding and made my margin for error .02. So if my predicted value for a players walk rate was within .02 of being correct I counted counted the model as correct for the player and if my error was greater than that it was wrong. Here are the models results for each level:

- A to A+
- Incorrect: 450
- Correct: 679
- Percentage Correct: ~.6014

- A+ to A
- Incorrect: 445
- Correct: 578
- Percentage Correct: ~.565

- AA to AAA
- Incorrect: 278
- Correct: 427
- Percentage Correct: ~.6056

When I moved the cutoff up a percentage to .03 the models results drastically improve:

- A to A+
- Incorrect: 228
- Correct: 901
- Percentage Correct: ~.798

- A+ to AA
- Incorrect: 246
- Correct: 777
- Percentage Correct: ~.7595

- AA to AAA
- Incorrect: 144
- Correct: 561
- Percentage Correct: ~.7957

**Examples**

Numbers are cool but where are the actual examples. Ok, lets start off with my worst prediction. The largest error I had between levels was A to A+ and the error was >10%(~.1105). The player in this case was Joey Gallo a quick glance at the player page will show his A walk rate was only .1076 and his A+ walk rate was .2073 which is a 10% improvement between levels. So why did this happen and why didn’t my model do a better job of predicting this. Currently the model is only accounting for the previous seasons walk rate but what if the player is getting a lot of hits at one level and stops swinging as much on the next. In Gallo’s case he only had a .245 BA his year at A ball so that wasn’t the case. More investigation is required to see how the model can get closer on edge cases like this.

The lowest I was able to set the error too and still come back with results was ~.00004417. That very close prediction belongs too Erik Gonzalez. Don’t know Erik Gonzalez so I continued to look for results setting the min error to .0002 brought back Stephen Lombardozzi as one of my six results. Lombo’s interesting to hard core Nats fans(like myself) but wanted to continue to look for a more notable name. Finally after upping the number to .003 for A to A+ data I was able to see that the model successfully predicted Houston Astro’s multi-time All Star 2B walk rate for Jose Altuve walk rate within a .003 margin of error.

**Whats Next:**

- Improve model to get a lower max error
- Predict Strike out rate between levels
- Predicting more advanced statistics like woba/ops/wrc