Browsed by
Month: July 2016

Enriching Datasets with R

Enriching Datasets with R

If you have a simple data set and have some additional statistics you’d like to add to that dataset you can easily do that with R. Going to add fip, woba, wrc, and wraa to a couple of baseball datasets as an example of this.

To calculate FIP I first needed the following R functions:

[code language=”r”]
#Calculate FIP Constant
fip_constant_calc <- function(pitching_data){
#FIP Constant = lgERA – (((13*lgHR)+(3*(lgBB+lgHBP))-(2*lgK))/lgIP)
era = sum(pitching_data["ER"])/sum(pitching_data["IP"]) * 9
lgHR = sum(pitching_data["HR"])
lgBB = sum(pitching_data["BB"])
lgHBP = sum(pitching_data["HBP"])
lgK = sum(pitching_data["SO"])
lgIP = sum(pitching_data["IP"])

fipConstant = era – (((13*lgHR)+(3*(lgBB+lgHBP))-(2*lgK))/lgIP)
#print(paste("ERA", era))
return(fipConstant)
}

#Calculate FIP
fip_calc <- function(pitching_data, fipConstant){
#FIP = ((13*HR)+(3*(BB+HBP))-(2*K))/IP + constant
hr = as.numeric(pitching_data["HR"])
bb = as.numeric(pitching_data["BB"])
hbp = as.numeric(pitching_data["HBP"])
k = as.numeric(pitching_data["SO"])
ip = as.numeric(pitching_data["IP"])

fip = ((13*hr)+(3*(bb+hbp))-2*(k))/ip + fipConstant

return(fip)
}
[/code]

Once you have the necessary functions to calculate the stat you can use apply in R to apply the function to your dataframe.

 

[code language=”r”]
#Get Data
fg_pitching_data = read.csv("data/2007/FG_MLB_Pitching_Std_All.csv")
constant <- fip_constant_calc(fg_pitching_data)
fg_pitching_data$fip <- apply(fg_pitching_data, 1, fip_calc, fipConstant=constant)
[/code]

Here are some R functions to calculate wOBA, wrc, and wraa.

[code language=”r”]
#Calculate wOBA based on weights using FG formula
woba_calc_weights_fg <- function(row, weights){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
ab <- as.numeric(row["AB"])
ibb <- as.numeric(row["IBB"])
sf <- as.numeric(row["SF"])

numerator <- as.numeric(weights["wobaBB"])*(bb-ibb)+as.numeric(weights["wobaHB"])*hbp+as.numeric(weights["woba1B"])*singles+as.numeric(weights["woba2B"])*doubles+as.numeric(weights["woba3B"])*triples+as.numeric(weights["wobaHR"])*hr
denom <- ab + bb – ibb + sf + hbp

return(numerator/denom)
}

#http://www.fangraphs.com/library/offense/wraa/
wraa_calc_fg <- function(row, weights){
numerator <- as.numeric(row["woba"]) – as.numeric(weights["woba"])
denom <- as.numeric(weights["wobaScale"])

result = (numerator/denom) * (as.numeric(row["PA"]))

return(result)
}

wrc_calc <- function(row, weights){
woba = as.numeric(row["woba"])
lgWOBA = as.numeric(weights["woba"])
wobaScale = as.numeric(weights["wobaScale"])
lgR = as.numeric(weights["lgR"])
lgPA = as.numeric(weights["lgPA"])
pa = as.numeric(row["PA"])
wrc = (((woba-lgWOBA)/wobaScale)+(lgR/lgPA))*pa

return(wrc)
}
[/code]

Once you have the functions necessary in place you can add the statistics to your dataset using this code:

[code language=”r”]
fg_batting_data = read.csv(&amp;quot;data/2007/FG_MLB_Batting_Std_All.csv&amp;quot;)
wobaWeights = get_woba_weights(fg_pitching_data, fg_batting_data)
fg_batting_data$woba = apply(fg_batting_data, 1, woba_calc_weights_fg, weights=wobaWeights)
fg_batting_data$wraa = apply(fg_batting_data, 1, wraa_calc_fg, weights=wobaWeights)
fg_batting_data$wrc = apply(fg_batting_data, 1, wrc_calc, weights=wobaWeights)
[/code]

Note: ‘get_woba_weights’ was addressed in the Learning wOBA blog post.

Why Reynaldo?

Why Reynaldo?

The Nationals sent Lucas Giolito back down to the minors and have called up Reynaldo Lopez for his Major League debut tomorrow. So I decided to take a look at possible reasons for that decision. Giolito did ok in his first rain shortened start giving up only 1 hit in 4 innings but the 2 BB’s were a little concerning. Especially since this season he’s had a pattern of walking people to a tune of 4.3 BB/9 in the Eastern League this year. That BB/9 currently ranks in the bottom 10 in the Eastern League for pitchers with >50 IP. Giolito’s control problem reared it’s ugly head again in Giolito’s second start when he gave up 4 BB in 3.2 innings of work and gave up 4 ER. The Major Leagues isn’t the place for a young pitcher to workout their control issues. So the Nationals made a smart decision and sent Giolito down to figure out his control issues. Last year in A+ Giolito was only walking 2.58 batters per 9 innings if he can get his walk rate back down to those levels I’m sure he’ll be back up in no time. With that out of the way lets lets look at the three best Nationals pitching prospects(Austin Voth, Reynaldo Lopez and Giolito) to figure out ‘Why Reynaldo’:

pitching_comp

As you can see in the table above they all have pretty similar ERA’s but Reynaldo is outpacing his two counter parts in K/9, K/BB, and FIP. Voth and Reynaldo have comparable WHIPs and Voth is doing the best at stranding runners on base. Then again when your striking out as many people as Reynaldo you probably don’t have too many runners on. Here’s a look at all three players ERA’s over their last 7 minor league starts:

So why Reynaldo? Well the answer seems clear to me. He’s been the best performing Nationals minor league pitcher thus far this season and has earned the start. I’m looking forward to seeing Reynaldo’s debut tomorrow hopefully he keeps striking people out and keeps the BBs at a manageable rate.

An Example CSV to Mongo Dataflow

An Example CSV to Mongo Dataflow

Do you have a bunch of CSV files laying around that you’d like to be able to run queries against? This post is going to discuss a way to do that with Apache Camel and a processor I wrote.

Prerequisites:

  • Ability to run a mvn camel:run
  • A MongoDB installation available

Technologies used:

Wrote a Custom camel processor to take in a CSV file and turn it into JSON. Jackson was my tool of choice for converting the CSV to a JSON file. Below you can see the code for the custom processor:

package datadidit.helpful.hints.camel;
import java.io.IOException;
import java.io.InputStream;
import java.io.Writer;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Logger;
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import org.apache.camel.ProducerTemplate;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.MappingIterator;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.dataformat.csv.CsvMapper;
import com.fasterxml.jackson.dataformat.csv.CsvSchema;
import com.fasterxml.jackson.dataformat.csv.CsvSchema.Builder;
/**
* Camel Processor that reads in a CSV file and produces a JSON Exchange for
* each line.
*
* Helpful Example:
* http://camel.apache.org/how-do-i-write-a-custom-processor-which-sends-multiple-messages.html
* http://stackoverflow.com/questions/19766266/convert-csv-file-directly-to-json-file-using-jackson-library
*/
public class CSVToJson implements Processor{
private Logger logger = Logger.getLogger(CSVToJson.class.getName());
ProducerTemplate producer;
private Boolean header;
private Boolean fieldNames;
private CsvSchema schema;
public CSVToJson(Boolean header, String fieldNames) throws Exception{
if(!header && fieldNames!=null){
Builder build = CsvSchema.builder();
for(String field : fieldNames.split(",")){
build.addColumn(field, CsvSchema.ColumnType.NUMBER_OR_STRING);
}
schema = build.build();
}else if(header && fieldNames!=null && !fieldNames.equals("")){
schema = this.buildCsvSchema(fieldNames, header);
}else if(!header && fieldNames==null){
throw new Exception("File must either contain headers or you must provide them..");
}else{
schema = CsvSchema.emptySchema().withHeader();
}
}
//TODO: Should probably do it this way at some point....
private CsvSchema buildCsvSchema(String fieldNames, Boolean withHeader){
Builder build = CsvSchema.builder();
for(String field : fieldNames.split(",")){
String[] fieldWithType = field.split("#");
if(fieldWithType.length==2){
logger.info("Field: "+fieldWithType[0]);
logger.info("Type: "+fieldWithType[1]);
build.addColumn(fieldWithType[0], CsvSchema.ColumnType.valueOf(fieldWithType[1]));
}else{
build.addColumn(field);
}
}
if(withHeader){
return build.build().withHeader();
}
return build.build();
}
public void process(Exchange arg0) throws Exception {
InputStream stream = arg0.getIn().getBody(InputStream.class);
List<Map<?, ?>> objects = readObjectsFromCsv(stream);
for(Map<?,?> map : objects){
//Create JSON
final String json = writeAsJson(map);
producer.send(new Processor(){
public void process(Exchange outExchange){
outExchange.getIn().setBody(json);
}
});
}
//TODO:If you don't close the stream this processor will continue to try and process the exchange...
stream.close();
}
public List<Map<?, ?>> readObjectsFromCsv(InputStream file) throws IOException {
CsvMapper csvMapper = new CsvMapper();
String csv = IOUtils.toString(file, "UTF-8");
MappingIterator<Map<?, ?>> mappingIterator = csvMapper.readerFor(Map.class).with(schema).readValues(csv);
return this.fixMap(mappingIterator.readAll());
}
public List<Map<?,?>> readObjectsFromCsv(String fileContent) throws JsonProcessingException, IOException{
CsvMapper csvMapper = new CsvMapper();
MappingIterator<Map<?, ?>> mappingIterator = csvMapper.readerFor(Map.class).with(schema).readValues(fileContent);
return this.fixMap(mappingIterator.readAll());
}
//TODO: This is a HACK, use library or submit bug
public List<Map<?,?>> fixMap(List<Map<?,?>> map){
List<Map<?,?>> newList = new ArrayList<>();
for(Map<?, ?> entry : map){
Map<String,Object> newMap = new HashMap<String,Object>();
for(Map.Entry<?, ?> mEntry : entry.entrySet()){
String value = mEntry.getValue().toString();
//Need to remove leading . for isNumeric to work with Doubles
if(value.startsWith(".") && StringUtils.isNumeric(value.substring(1))){
newMap.put(mEntry.getKey().toString(), Double.parseDouble(value));
}else if(StringUtils.isNumeric(mEntry.getValue().toString())){
newMap.put(mEntry.getKey().toString(), Integer.parseInt(value));
}else{
newMap.put(mEntry.getKey().toString(), mEntry.getValue().toString());
}
}
newList.add(newMap);
}
return newList;
}
public String writeAsJson(List<Map<?, ?>> data) throws IOException {
ObjectMapper mapper = new ObjectMapper();
return mapper.writeValueAsString(data);
}
/**
* Taken from: https://itexpertsconsultant.wordpress.com/2016/08/03/how-to-readwrite-csv-file-to-map-in-java/
* @param listOfMap
* @param writer
* @throws IOException
*/
public void csvWriter(List<Map<?, ?>> listOfMap, Writer writer) throws IOException {
CsvSchema schema = null;
CsvSchema.Builder schemaBuilder = CsvSchema.builder();
if (listOfMap != null && !listOfMap.isEmpty()) {
for (Object col : listOfMap.get(0).keySet()) {
schemaBuilder.addColumn(col.toString());
}
schema = schemaBuilder.build().withLineSeparator("\r").withHeader();
}
CsvMapper mapper = new CsvMapper();
mapper.writer(schema).writeValues(writer).writeAll(listOfMap);
writer.flush();
}
public String writeAsJson(Map<?, ?> data) throws IOException {
ObjectMapper mapper = new ObjectMapper();
return mapper.writeValueAsString(data);
}
public void setProducer(ProducerTemplate producer){
this.producer = producer;
}
public Boolean getHeader() {
return header;
}
public void setHeader(Boolean header) {
this.header = header;
}
public Boolean getFieldNames() {
return fieldNames;
}
public void setFieldNames(Boolean fieldNames) {
this.fieldNames = fieldNames;
}
}
view raw CSVToJson.java hosted with ❤ by GitHub

Once the custom processor was in place I used Camels XML DSL which allows users to use camel without writing any code:

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:camel="http://camel.apache.org/schema/spring"
xsi:schemaLocation="
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://camel.apache.org/schema/spring http://camel.apache.org/schema/spring/camel-spring.xsd">
<camelContext xmlns="http://camel.apache.org/schema/spring">
<template id="myTemplate" defaultEndpoint="direct:handleJson"/>
<!--
Route that takes in a CSV file and outputs JSON
-->
<route id="csv-to-json">
<from uri="file:///Users/mkwyc_000/workspace/git/apache-camel/process-csv?delete=true" />
<process ref="csvToJson" />
<to uri="file:///Users/mkwyc_000/workspace/git/apache-camel/process-complete" />
</route>
<!--
Route to turn CSV into JSON and insert into Mongo
-->
<route id="handleJson">
<from uri="direct:handleJson" />
<log message="Got ${body}" />
<to uri="mongodb:mongoBean?database=example&amp;collection=testCollection&amp;operation=insert" />
</route>
</camelContext>
<!--
Bean instantiation for Mongo Client
-->
<bean id="mongoBean" class="com.mongodb.Mongo">
<constructor-arg name="host" value="127.0.0.1" />
<constructor-arg name="port" value="27017" />
</bean>
<!--
Bean instantiation of CSVToJson
-->
<bean id="csvToJson" class="datadidit.helpful.hints.camel.CSVToJson" >
<property name="producer" ref="myTemplate" />
<constructor-arg type="java.lang.Boolean" value="true"/>
<constructor-arg type="java.lang.String" value="" />
</bean>
</beans>
view raw camelContext.xml hosted with ❤ by GitHub

In the Camel Context above there are two routes csv-to-json and handleJson. The csv-to-json portion of the route first uses the file component to read in any data dropped into the directory defined in the ‘from’ URI. Then it passes that data to my custom processor which is defined by this bean:

<bean id="csvToJson" class="datadidit.helpful.hints.camel.CSVToJson" >
<property name="producer" ref="myTemplate" />
<constructor-arg type="java.lang.Boolean" value="true"/>
<constructor-arg type="java.lang.String" value="" />
</bean>
view raw camelContext.xml hosted with ❤ by GitHub

The custom processor reads in the Exchange(csv file) and uses Jackson to convert each line of the incoming CSV to JSON. The producerTemplate sends the produced JSON to the handleJson route. Once in the handleJson route I use Camels Mongodb component to insert the data into Mongodb using this line:

<route id="handleJson">
<from uri="direct:handleJson" />
<log message="Got ${body}" />
<to uri="mongodb:mongoBean?database=example&amp;collection=testCollection&amp;operation=insert" />
</route>
view raw camelContext.xml hosted with ❤ by GitHub

The mongo bean is defined here:

<bean id="mongoBean" class="com.mongodb.Mongo">
<constructor-arg name="host" value="127.0.0.1" />
<constructor-arg name="port" value="27017" />
</bean>
view raw camelContext.xml hosted with ❤ by GitHub

That’s all the code and xml it takes to get csv into MongoDB. Users can now put their CSV files into MongoDB easily by following these steps:

  1. Update the camel context variables to monitor the appropriate directories for your environment. You can also update the mongodb portion of the route to insert into the appropriate DB and Collection for your situation.
  2. Start mongodb there are multiple ways to get it up and running I just used ‘mongod –dbpath d:\mongodb-data’
  3. Run ‘mvn camel:run’ to get your route Camel route started
  4. Drop your data into the ‘process.csv’ directory you defined in your camel context.
  5. Log into mongodb by typing this command at the console: ‘mongo’
  6. In the console run ‘db.<‘collection name in route’>.find() to see all your data

 

Learning wOBA

Learning wOBA

As I continue to learn R and go down the road of becoming a data scientist. I need to learn how to use and compute advanced statistics. The first advanced analytic I’m going to learn how to compute is weighted on-base average(wOBA). Weighted on-base average combines all the parts of a players offensive game and gives them all appropriate weights for their impact on the game. For example, a HR is given more weight than a BB or a Single because a HR is guarantees a team atleast one run while a BB/Single only scores a run a much lower percentage of the time. General statistics like BA/SLG/OBP only look at a piece of a players offensive game which is why wOBA is a better tool for looking at a players offensive contribution. Fangraphs does a great job of describing the details in the following links:

The links above give a lot of insight on why you should use wOBA. Tom Tango in The Book describes the formula. He also has the standard formula for wOBA that includes and doesn’t include speed on his site . I used the standard formula from his site for my initial calculations with R.

The following R code has the functions I use to calculate speed and non speed wOBA:

 

[code language=”r”]
#Link for standard WOBA formula
#http://tangotiger.com/index.php/site/article/standard-woba
woba_calc_basic <- function(row){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
pa <- as.numeric(row["PA"])
ibb <- as.numeric(row["IBB"])
sh <- as.numeric(row["SH"])

numerator <- .7*(bb+hbp) + .9*(singles) + 1.3*(doubles + triples) + 2*hr
denominator <- pa – ibb – sh

return(numerator/denominator)
}

woba_calc_speed <- function(row){
bb <- as.numeric(row["BB"])
hbp <- as.numeric(row["HBP"])
doubles <- as.numeric(row["X2B"])
triples <- as.numeric(row["X3B"])
hr <- as.numeric(row["HR"])
hits <- as.numeric(row["H"])
singles <- hits – triples – doubles – hr
pa <- as.numeric(row["PA"])
ibb <- as.numeric(row["IBB"])
sh <- as.numeric(row["SH"])
sb <- as.numeric(row["SB"])
cs <- as.numeric(row["CS"])

numerator <- .7*(bb+hbp) + .9*(singles) + 1.3*(doubles + triples) + 2*hr + .25*sb + -.5*cs
denominator <- pa – ibb – sh

return(numerator/denominator)
}
[/code]

Once I had those two formula’s in place I retrieved some data from Nationals data from baseball-reference . Then used the code below to apply the two functions to a dataframe.

 

[code language=”r”]
data = read.csv("data/FG_MLB_Batting_Std_All.csv")

#Apply basic woba
data$woba_basic = apply(data, 1, woba_calc_basic)

#Apply speed woba
data$woba_speed = apply(data, 1, woba_calc_speed)

[/code]

Once I had that in place I noticed that Fangraphs kept mentioning ‘wOBA scale’ and how it and wOBA could be used to generate another stat called Weighted Runs Above Average . After doing some digging I found an old article written by Tom Tango on how to compute yearly wOBA from scratch . Above I’ve just used some standard weights to compute a standard wOBA. But standard wOBA doesn’t take into account the yearly offensive environment that players are playing in. Basically some years an offensive action is worth more than others based upon how the whole league is performing. Tom Tango’s article on computing yearly wOBA teaches you how to get the yearly weights and calculate wOBA scale. Continuing down this R, wOBA rabbit hole I transcribe his SQL script into R functions.

To compute the offensive environment(Step 1 and Step 2) in Tom Tangos script I use this function on pitching data:

 

[code language=”r”]
get_runvalues <- function(pitching_data){
#Calculate runs per out
stat <- sum(pitching_data["R"])/(sum(pitching_data["IP"])*3)
rPerOut <- c(stat)

#Calculate runs per bb
stat <- rPerOut + .14
rPerBB <- c(stat)

#Runs per HB
stat <- rPerBB + .025
rPerHB <- c(stat)

#Runs Per Single
stat <- rPerBB + .155
rPer1B <- c(stat)

#Runs Per Double
stat <- rPer1B + .3
rPer2B <- c(stat)

#Runs Per Triple
stat <- rPer2B + .27
rPer3B <- c(stat)

#Runs per HR
rHR <- 1.4

rSB <- .2

rCS <- 2*rPerOut + .075

df <- data.frame(rPerOut, rPerBB, rPerHB, rPer1B, rPer2B, rPer3B, rHR, rSB, rCS)

return(df)
}

[/code]

Step 3 in the SQL script pulls uses the run values to generate the necessary weights to calculate wOBA. The following R methods accomplish this:

[code language=”r”]
get_woba_runsMinus <- function(batting, runValues){
bbNumerator <- as.numeric(runValues["rPerBB"]) * (sum(batting["BB"]) – sum(batting["IBB"]))
hbpNumerator <- as.numeric(runValues["rPerHB"]) * sum(batting["HBP"])

singles <- sum(batting["H"]) – sum(batting["X2B"]) – sum(batting["X3B"]) – sum(batting["HR"])
singleNumerator <- as.numeric(runValues["rPer1B"]) * singles
doubleNumerator <- as.numeric(runValues["rPer2B"]) * sum(batting["X2B"])
tripleNumerator <- as.numeric(runValues["rPer3B"]) * sum(batting["X3B"])
hrNumerator <- as.numeric(runValues["rHR"]) * sum(batting["HR"])
sbNumerator <- as.numeric(runValues["rSB"]) * sum(batting["SB"])
csNumerator <- as.numeric(runValues["rCS"]) * sum(batting["CS"])

numerator <- bbNumerator + hbpNumerator + singleNumerator + doubleNumerator + tripleNumerator + hrNumerator + sbNumerator – csNumerator

#print(paste("numerator is, ", numerator))

denom <- sum(batting["AB"]) – sum(batting["H"]) + sum(batting["SF"])
#print(paste("denominator is, ", denom))

return(numerator/denom)
}

get_woba_runsPlus <- function(batting, runValues){
#Same as runMinus should probable combine and output df
bbNumerator <- as.numeric(runValues["rPerBB"]) * (sum(batting["BB"]) – sum(batting["IBB"]))
hbpNumerator <- as.numeric(runValues["rPerHB"]) * sum(batting["HBP"])

singles <- sum(batting["H"]) – sum(batting["X2B"]) – sum(batting["X3B"]) – sum(batting["HR"])
singleNumerator <- as.numeric(runValues["rPer1B"]) * singles
doubleNumerator <- as.numeric(runValues["rPer2B"]) * sum(batting["X2B"])
tripleNumerator <- as.numeric(runValues["rPer3B"]) * sum(batting["X3B"])
hrNumerator <- as.numeric(runValues["rHR"]) * sum(batting["HR"])
sbNumerator <- as.numeric(runValues["rSB"]) * sum(batting["SB"])
csNumerator <- as.numeric(runValues["rCS"]) * sum(batting["CS"])

numerator <- bbNumerator + hbpNumerator + singleNumerator + doubleNumerator + tripleNumerator + hrNumerator + sbNumerator – csNumerator
denom <- (sum(batting["BB"]) – sum(batting["IBB"])) + sum(batting["HBP"]) + sum(batting["H"])

return(numerator/denom)
}

woba_calc_lgAvgSpeed <- function(batting, runValues){
numerator <- sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["H"]) + sum(batting["SB"]) – sum(batting["CS"])
denom <- sum(batting["AB"]) + sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["SF"])

return(numerator/denom)
}

woba_calc_lgAvgReg <- function(batting, runValues){
numerator <- sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["H"])
denom <- sum(batting["AB"]) + sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["SF"])

return(numerator/denom)
}

get_woba_weights <- function(pitching, batting){
#Get the run values
run_values <- get_runvalues(pitching)

#Compute runs minus
runs_minus <- get_woba_runsMinus(batting, run_values)

runs_plus <- get_woba_runsPlus(batting, run_values)

#woba <- woba_calc_lgAvgSpeed(batting, run_values)
woba <- woba_calc_lgAvgReg(batting, run_values)

wobaScale <- 1/(runs_plus + runs_minus)

wobaBB <- (as.numeric(run_values["rPerBB"]) + runs_minus)*wobaScale
wobaHB <- (as.numeric(run_values["rPerHB"]) + runs_minus)*wobaScale
woba1B <- (as.numeric(run_values["rPer1B"]) + runs_minus)*wobaScale
woba2B <- (as.numeric(run_values["rPer2B"]) + runs_minus)*wobaScale
woba3B <- (as.numeric(run_values["rPer3B"]) + runs_minus)*wobaScale
wobaHR <- (as.numeric(run_values["rHR"]) + runs_minus)*wobaScale
wobaSB <- as.numeric(run_values["rSB"])*wobaScale
wobaCS <- as.numeric(run_values["rCS"])*wobaScale

df <- data.frame(run_values, runs_minus, runs_plus, woba, wobaScale, wobaBB, wobaHB, woba1B, woba3B, wobaHR, wobaSB, wobaCS)

return(df)
}
[/code]

With all this in place I now needed to test my functions against the table he has here . The data I ended up using for the test is from Fangraphs . The code below calculates the wOBA weights:

[code language=”r”]
fg_batting_data = read.csv("data/FG_MLB_Batting_Std_All.csv")
fg_pitching_data = read.csv("data/FG_MLB_Pitching_Std_All.csv")
min_pa = subset(fg_batting_data, PA > 80)
wobaWeights = get_woba_weights(fg_pitching_data, fg_batting_data)
wobaWeightsFiltered = get_woba_weights(fg_pitching_data, min_pa)
wobaWeights
[/code]

Here’s the output for the wOBA weights for 2007:

[code]
rPerOut rPerBB rPerHB rPer1B rPer2B rPer3B rHR rSB rCS
1 0.1796024 0.3196024 0.3446024 0.4746024 0.7746024 1.044602 1.4 0.2 0.4342048
runs_minus runs_plus woba wobaScale wobaBB wobaHB woba1B woba2B
1 0.2792543 0.5642809 0.3310523 1.185487 0.7099368 0.739574 0.8936873 1.249333
woba3B wobaHR wobaSB wobaCS
1 1.569415 1.990734 0.2370974 0.5147441
[/code]

These numbers line up with the numbers from Tom Tangos table for 2007 I linked above. Here’s a look at what the side by side numbers:

label original calc my calc
year 2007 2007
rPerOut .179 0.1796024
rPerBB .32 .3196024
rPerHB .34 .3446024
rPer1B .47 .4746024
rPer2B .77 .7746024
rPer3B 1.04 1.044602
rHR 1.40 1.4
rSB .2 .2
rCS .43 .4342048
rMinus .28 .2792543
rPlus .56 .5642809
wOBA .331 .3310523
wobaScale 1.19 1.185487
wobaBB .71 .7099368
wobaHB .74 .739574
woba1B .89 .8936873
woba2B 1.25 1.249333
woba3B 1.57 1.569415
wobaHR 1.99 1.990734
wobaSB .24 .2370974
wobaCS .51 .5147441

With some simplification the weights I produced with the R code above would line up with Tom Tango’s original chart for 2007. Since I’m now able to calculate wOBA weights from scratch I’ll be able to calculate wRAA and apply wOBA in other offensive environments.

JSON to CSV with Jackson

JSON to CSV with Jackson

Needed to convert some JSON output to CSV this week on one of my REST endpoints. Previously on this endpoint I was using Gson to do the conversion. Gson does not natively support JSON to CSV conversion. So though I love how easy it is for me to type:

[code language=”java”]
Gson gson = new Gson();

gson.toJson("Hello World");
[/code]

and easily retrtieve my Json. I switched over to using Jackson because it does support CSV conversion of my Pojos as well. When trying to figure this out the following links came in very handy:

stackoverflow

cowtowncoder

jackson-csv

Since the post from cowtowncoder is a bit old I figured I’d write a new blogpost in case someone else runs into this problem.

[code language=”xml”]<?xml version="1.0"?>
<project
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"
xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>datadidit.helpful.hints</groupId>
<artifactId>parent</artifactId>
<version>1.0.0-SNAPSHOT</version>
</parent>
<artifactId>json-to-csv</artifactId>
<name>json-to-csv</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-csv</artifactId>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-engine</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.vintage</groupId>
<artifactId>junit-vintage-engine</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.19</version>
<dependencies>
<dependency>
<groupId>org.junit.platform</groupId>
<artifactId>junit-platform-surefire-provider</artifactId>
<version>1.0.0-M1</version>
</dependency>
</dependencies>
</plugin>
</plugins>
</build>
</project>
[/code]

Note: Also took the time to try out JUnit 5.

Made a simple POJO for converting to JSON and CSV:

[code language=”java”]

package datadidit.helpful.hints.csv;

import java.util.Date;

import com.fasterxml.jackson.annotation.JsonPropertyOrder;

@JsonPropertyOrder({"firstName", "lastName", "dob"})
public class Simple {
private String firstName;

private String lastName;

private Date dob;

public Simple(){
}

public Simple(String firstName, String lastName, Date dob) {
super();
this.firstName = firstName;
this.lastName = lastName;
this.dob = dob;
}

public String getFirstName() {
return firstName;
}

public void setFirstName(String firstName) {
this.firstName = firstName;
}

public String getLastName() {
return lastName;
}

public void setLastName(String lastName) {
this.lastName = lastName;
}

public Date getDob() {
return dob;
}

public void setDob(Date dob) {
this.dob = dob;
}
}

[/code]

Then wrote a Unit test for testing the code out:

[code language=”java”]

package datadidit.helpful.hints.csv;

import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;

import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.dataformat.csv.CsvMapper;
import com.fasterxml.jackson.dataformat.csv.CsvSchema;

public class JsonToCsvTest {
private ObjectMapper mapper;

private CsvMapper csvMapper;

@BeforeEach
public void setup(){
csvMapper = new CsvMapper();
mapper = new ObjectMapper();
}
@Test
public void CreateJson() throws JsonProcessingException{
List<Simple> simpleList = new ArrayList<>();
simpleList.add(new Simple("hello", "world", new Date()));

String json = mapper.writerWithDefaultPrettyPrinter().writeValueAsString(simpleList);

System.out.println(json);

System.out.println("Now as CSV: ");
CsvSchema schema = csvMapper.schemaFor(Simple.class).withHeader();
System.out.println(csvMapper.writer(schema).writeValueAsString(simpleList));
}
}

[/code]

Output as JSON:

[code language=”text”]

[ {
"firstName" : "hello",
"lastName" : "world",
"dob" : 1468293742743
} ]

[/code]

Output as CSV:

[code language=”text”]

firstName,lastName,dob
hello,world,1468075423437

[/code]

Exploratory Data Analysis using Minor League Batting Statistics

Exploratory Data Analysis using Minor League Batting Statistics

Similar to graphically looking at Nationals minor league pitching stats I wanted to do the same with their minor league hitting stats per team. I decided to look at how the Nationals minor league team OPS is doing relative to their league and level. OPS is a players OBP added to their SLG measure how good a player is doing offensively when those two metrics are taken into account.

Since pitchers also bat I needed to do some data cleaning or the numbers wouldn’t make sense. To clean the data I removed all players from data set that didn’t have more than 20 Plate Appearances this season. The original data set 3255 data points. After adding that filter I got down to 2384 data points. Here’s the layout by level:

Level Data Points
SS-A 273
A 514
A+ 507
AA 518
AAA 572

For each League at each level I wanted to get the average team ERA and compare it to how the Nationals affiliates are doing. In the below table you can see those numbers.

 

League/Team Level OPS
New York Penn League SS-A .628
Northwest League SS-A .653
Auburn Doubledays SS-A .599
Midwest A .645
South Atlantic A .675
Hagerstown Suns A .736
California A+ .702
Carolina A+ .672
Florida State A+ .653
Potomac Nationals A+ .679
Eastern AA .699
Texas AA .669
Southern AA .678
Harrisburg AA .685
Pacific Coast AAA .727
International AAA .677
Syracuse AAA .631

Note: Data covers the season up to 7/1/2016

Here’s a look at the data graphically:

all_ops_2016

Auburn is a small sample size so I wouldn’t pay to much attention to the short season portion of the graph just yet. Hagerstown is our best performing offensive team based on OPS. Their team OPS is better than the average OPS for the two leagues at their level(South Atlantic League and California League). Overall Hagerstown(.735) has the second best team OPS in their League(1st is Asheville .756) and third best OPS for their league(1st is Bowling Green at .764). Harrisburgs and Potomac are performing at a little over League average each. On the other end of the spectrum from Hagerstown, Syracuse has a bottom 5 team OPS.

Here’s a look at only the Nationals Minor League affiliates OPS:

wsh_ops.png

In my next blog post I’m going to look at two of the catalyst of the Hagerstown offense Max Schrock and Victor Robles.

Part 2 Source Code:

[code language=”r”]

#Summarize
minors_batting &lt;- getDfFromDir(&quot;dataDir&quot;)
#Only use cases with more than 20PA’s
minors_batting &lt;- filter(minors_batting, PA&gt;20)
minors_batting &lt;- minors_batting[complete.cases(minors_batting),]
summary(minors_batting$Lvl)

minors_tm_ops &lt;- ddply(minors_batting, .(Tm,Lg,Lvl,Franchise), summarise, ops=mean(OPS))
minors_lg_ops &lt;- ddply(minors_batting, .(Lg,Lvl), summarise, ops=mean(OPS))
aff_ops &lt;- ddply(minors_batting[minors_batting$Franchise==&quot;Washington Nationals&quot;,],.(Tm,Lvl), summarise, ops=mean(OPS))

lg_melt_data &lt;- melt(minors_lg_ops)
aff_melt_data &lt;- melt(aff_ops)

#Rename vars so you can bind
colnames(lg_melt_data) &lt;- c(&quot;lg_tm&quot;, &quot;lvl&quot;, &quot;variable&quot;, &quot;value&quot;)
colnames(aff_melt_data) &lt;- c(&quot;lg_tm&quot;, &quot;lvl&quot;, &quot;variable&quot;, &quot;value&quot;)
total_melt_data &lt;- rbind(lg_melt_data,aff_melt_data)
total_melt_data

ops_graph &lt;- ggplot(data=total_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat=&quot;identity&quot;, position=&quot;dodge&quot;) + ggtitle(&quot;WSH Minors OPS Per Level&quot;)
ops_graph

#Graph of only the nationals
nats_ops_graph &lt;- ggplot(data=aff_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat=&quot;identity&quot;, position=&quot;dodge&quot;) + ggtitle(&quot;WSH Minors OPS&quot;)
nats_ops_graph
[/code]