## Modeling Strikeout Rate between minor league levels

In this post I’ll go over my results for predicting strikeout rates between minor league levels. This article will cover the following:

Data

This time around I’ve change my approach up so I can do some cross-validation. The article will cover data from 2004-2015 but I’ll be training my model on data from 2004-2013 and evaluating it using the 2014-2015 data. The data itself consists of 39,349 data points and came from Baseball Reference . The data points represent minor league data from Short Season(SS-A) to AAA ball. I end up removing the SS-A data because currently I’m only modeling data between the full season leagues(A-AAA). Also, players data points were only included if they had >=200 plate appearances.

Data Wrangling

In order to model the data between minor league levels I need to do some data wrangling to get the dataframe in the format I need. The original data has each players season as a different entry.

In order for me to graph and get correlation values between minor league levels I need all this data on one row with the stats for each level represented by a column. Below you can see a snippet of the dataframe I use for my analysis:

Notice how in the dataframe above all the stats I need for each level have been merged into one row.

Graphs and Correlation

As you can see from the graphs above a positive linear relationship exists for strike out rate between the minor league levels(A to A+, A+ to AA, AA to AAA) I’ve analyzed. Here are the correlation values for each level:

- A to A+ : 0.7532319
- A+ to AA : 0.7717004
- AA to AAA : 0.7666475

From the numbers above and graphs you can see a ‘strong’ positive correlation exists for the strikeout rate between levels.

Model and Evaluation

The models for the regression line in the graphs above are:

- A to A+ :
**A+ SO Rate**= .7598*(A SO Rate) + .04591 - A+ to AA:
**AA SO Rate**= .83204*(A+ SO Rate) + .03608 - AA to AAA:
**AAA SO Rate**= .80664*(AA SO Rate) + .04147

The ‘Doing Data Science‘ book suggests using R-squared, p-values, and cross-validation to validate linear models. For this article I’ll be using R-squared and cross-validation:

- A to A+: .5674
- A+ to AA: .5955
- AA to AAA: .5877

To do cross validation I’m going to use the data from 2014-2015. This dataset consists of of 8198 points. I performed the same steps I described above in the data wrangling section and that bought the dataframe I do my analysis on down to 427 points. The correlation numbers remained strong per level:

- A to A+: 0.7366793
- A+ to AA: 0.729288
- AA to AAA: 0.7794951

Here is a graph showing the regression line against the 2014-2015 data:

To tell how often I’m correct or not I once again used the classification provided by fangraphs in this chart:

This time using the average difference between classifications of K% and got that to be .0291667. So if my model is more than ~.03 off the actual error rate then I say it’s wrong for that data point. Here are my results for each level:

A to A+:

- Incorrect: 48
- Correct: 66
- Percentage Correct: 57.89%

A+ to AA:

- Incorrect:78
- Correct: 93
- Percentage Correct: 54.39%

AA to AAA:

- Incorrect: 52
- Correct: 74
- Percentage Correct: 58.73