Browsed by
Month: October 2016

Combining R and Java

Combining R and Java

Was curious if there were any libraries out there for combining R and Java so did some research to figure out the best library out there for this. Why? Since I know Java already combining it directly with R is something I was interested in. A use case could be if you have a mathematician who is great at using R to produce models but not so good at writing Java code to tie those models into your application. Combining R and Java(assuming your app is in Java) would would be an easy way to let the mathematician do his job and let the Java developers easily integrate the mathematician’s model.

Libraries

Helpful Links

Example

Thoughts

Libraries

Below are some of the libraries available for combining R and Java.

  • Renjin : This is the library I went with. Library appears to be actively developed. They have a nice website and good docs teaching you how to use the library.
  • RCaller: Was close to choosing this one. Library is actively developed and has good docs(didn’t try to use them but appear intuitive).
  • JRI
  • RServer
  • rJava

Helpful Links

Example

Ended up using renjin as my Java to R library. This example is a simple web service that would allow a user RESTfully input ‘x’ and ‘y’ coordinates and generate a model using R. The webservice also provides functionality for making predicitons based on your model. I used Jetty, CXF, Renjin, Jackson and Maven in the example. The source can be found here. Here are some screen shots showing the example endpoints:

modelpost
Using REST Client to post to data for my model.
getmodel
Hitting the GET endpoint to see my model.
predict
Passing in input to the model to get a prediction.

Thoughts

Renjin was pretty easy to use. But wish they had some more helper methods to get Vector or other custom data types they have into native Java Objects. Also don’t like that the toStrings() only list a certain number of variables(i.e. if you have a Vector and want to see all the ‘names’ it has). At some point I’ll sit down and figure out how to use Shiny but this could be a possible interm solution for using an R Script directly in your webapp.

Modeling Hit Rates Between Minor League Levels

Modeling Hit Rates Between Minor League Levels

Working on figuring out the hit rates for minor leaguer batters between levels. I’d like to take the hit rates(i.e. singles(1B/PA), doubles(2B/PA), triples(3B/PA) and HRs(HR/PA) ) a player had at their previous minor league level and use that data to predict how a player will do at the following level. Similar data has been used as in the previous articles on walk rates and strike out rates. This data set covered 2011-2015 and players with a minimum of 200 PA’s were included in the resulting model. Below are the graphs for each level, models and some thoughts.

A to A+

atoaplushitrates
A to A+ Hit Rates

A theme throughout the graphs will show that the correlation numbers for singles and home runs are high but very low for doubles and triples. These same low correlation numbers for doubles and triples were found in previous research by Matt Klassen at Fangraphs.

Linear models:

  •  A+ Single Rate = (A single rate)*0.53520 + 0.07452
  • A+ Double Rate = (A double rate)*.36379 + .02929
  • A+ Triple Rate = (A triple rate)*.403826 + .004743
  • A+ HR Rate = (A HR rate)*.633131 + .0006235

A+ to AA

aplustoaahitrates
A+ to AA Hit Rates

Linear models:

  • AA Single Rate = (A+ Single rate)*.48235 + .07969
  • AA Double Rate = (A+ Double rate)*.22680 + .03389
  • AA Triple Rate = (A+ Triple rate)*.377505 + .003751
  • AA HR Rate = (A+ HR rate)*.534897 + .007925

AA to AAA

aatoaaahitrate
AA to AAA Hit Rate

Linear models:

  • AAA Single Rate = (AA Single Rate)*.52767 + .07912
  • AAA Double Rate = (AA Double Rate)*.248769 + .03645
  • AAA Triple Rate = (AA Triple Rate)*.355865 + .003757
  • AAA HR Rate = (AA HR Rate)*.58037 + .00881

Whats Next:

  • Perform some validation on the above models
  • Combine the models you’ve generated to predict OBP/SLG/OPS
  • Make models that skip levels
  • Make code more efficient so you can do this faster

 

 

Modeling Strikeout Rate between minor league levels

Modeling Strikeout Rate between minor league levels

In this post I’ll go over my results for predicting strikeout rates between minor league levels. This article will cover the following:

Data

Data Wrangling

Graphs and Correlation

Model and Evaluation

Data

This time around I’ve change my approach up so I can do some cross-validation. The article will cover data from 2004-2015 but I’ll be training my model on data from 2004-2013 and evaluating it using the 2014-2015 data. The data itself consists of 39,349 data points and came from Baseball Reference . The data points represent minor league data from Short Season(SS-A) to AAA ball. I end up removing the SS-A data because currently I’m only modeling data between the full season leagues(A-AAA). Also, players data points were only included if they had >=200 plate appearances.

Data Wrangling

In order to model the data between minor league levels I need to do some data wrangling to get the dataframe in the format I need. The original data has each players season as a different entry.

ramosoriginaldata
Snippet from original dataframe. Each entry represents a year and minor league level the stats are for.

In order for me to graph and get correlation values between minor league levels I need all this data on one row with the stats for each level represented by a column. Below you can see a snippet of the dataframe I use for my analysis:

ramoscorrelationsnippet
Snippet of correlation dataframe.

Notice how in the dataframe above all the stats I need for each level have been merged into one row.

Graphs and Correlation

regressionlineformilbdatausedformodel
Graphs showing the scatter plot and regression lines for the levels of minor league data I modeled.

As you can see from the graphs above a positive linear relationship exists for strike out rate between the minor league levels(A to A+, A+ to AA, AA to AAA) I’ve analyzed. Here are the correlation values for each level:

  • A to A+ :  0.7532319
  • A+ to AA : 0.7717004
  • AA to AAA : 0.7666475

From the numbers above and graphs you can see a ‘strong’ positive correlation exists for the strikeout rate between levels.

Model and Evaluation

The models for the regression line in the graphs above are:

  • A to A+ : A+ SO Rate = .7598*(A SO Rate) + .04591
  • A+ to AA: AA SO Rate = .83204*(A+ SO Rate) + .03608
  • AA to AAA: AAA SO Rate = .80664*(AA SO Rate) + .04147

The ‘Doing Data Science‘ book suggests using R-squared, p-values, and cross-validation to validate linear models. For this article I’ll be using R-squared and cross-validation:

  • A to A+: .5674
  • A+ to AA: .5955
  • AA to AAA: .5877

To do cross validation I’m going to use the data  from 2014-2015. This dataset consists of  of 8198 points. I performed the same steps I described above in the data wrangling section and that bought the dataframe I do my analysis on down to 427 points. The correlation numbers remained strong per level:

  • A to A+: 0.7366793
  • A+ to AA: 0.729288
  • AA to AAA: 0.7794951

Here is a graph showing the regression line against the 2014-2015 data:

crossvalidationmultigraph

To tell how often I’m correct or not I once again used the classification provided by fangraphs in this chart:

fangraphsbbrate
Picture retrieved from http://www.fangraphs.com/library/offense/rate-stats/

This time using the average difference between classifications of K% and got that to be .0291667. So if my model is more than ~.03 off the actual error rate then I say it’s wrong for that data point. Here are my results for each level:

A to A+:

  • Incorrect: 48
  • Correct: 66
  • Percentage Correct: 57.89%

A+ to AA:

  • Incorrect:78
  • Correct: 93
  • Percentage Correct: 54.39%

AA to AAA:

  • Incorrect: 52
  • Correct: 74
  • Percentage Correct: 58.73