Browsed by
Month: June 2016

Exploratory Data Analysis of Nationals Minor League Pitching Stats

Exploratory Data Analysis of Nationals Minor League Pitching Stats

After attending SSAC this year I decided one of the skills I need to pick up is R. Well after finally finishing Grad School I finally have time. Best way for me to learn is to actually get some data I’m interested in. Daily I look up Nationals minor league statistics to see how the upcoming Nationals are doing. So minor league data made a lot of sense for me to collect and doing Exploratory Data Analysis(seeing the data) is necessary before making the next steps in data science. I was interested in seeing how the Nationals affiliates were doing in comparison to their leagues. For pitching I decided to compare team ERA with the rest of the league. ERA is a measure of how well a pitcher is doing per nine innings. A low era is a good thing the higher the ERA the more runs a team is giving up per nine innings.

Note: Data covers the season up to 7/1/2016

So how many data points do I actually have per level?

Level Data Points
SS-A 396
A 629
A+ 737
AA 684
AAA 800

For each League at each level I wanted to get the average team ERA and compare it to how the Nationals affiliates are doing. In the below table you can see those numbers.

League/Team Level ERA
New York Penn League SS-A 3.46
Northwest League SS-A 3.71
Auburn Doubledays SS-A 3.57
Midwest A 3.55
South Atlantic A 3.79
Hagerstown Suns A 3.73
California A+ 4.15
Carolina A+ 3.80
Florida State A+ 3.49
Potomac Nationals A+ 3.61
Eastern AA 3.91
Texas AA 3.69
Southern AA 3.76
Harrisburg AA 3.72
Pacific Coast AAA 4.50
International AAA 3.65
Syracuse AAA 3.84

To see that table in another way I also put this data into a graph using R.

eraPerLevel_7-2

 

And here’s what the teams look like compared side by side in a graph:

wsh_only_era

When I originally wrote this post a couple weeks ago now Harrisburg had the best performing Nationals staff in comparison to their league and that came as no surprise since the staff was headlined by two of the Nationals top pitching prospects Lucas Giolito and Reynaldo Lopez.

Now looking at the data up to 7/1 Potomac and Auburn(small sample size) have the best team ERA’s overall. Auburn probably need to wait a little more into the season prior to making any statements since the seasons just started but currently their staff ERA is not outperforming league average. Harrisburg and Potomac teams are both outperforming their respective leagues and Hagerstown is about even with league average.  Syracuse’s ERA is a bit up compared to how other staffs in the International League are performing.

Part 2. Source Code

The data itself was retrieved in csv format from Baseball-Reference . Each teams data was put into a folder and then read into a dataframe using the following function:

[code language=”r”]
#Read files from a directory into a dataframe
#http://www.r-bloggers.com/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/
getDfFromDir <- function(csvDir){
fileList <- list.files(csvDir)

outputFromDfDir = NULL
for(file in fileList){
fullPath <- paste(csvDir, "/", file, sep="")
#print("File name is")
#print(fullPath)
if(is.null(outputFromDfDir)){
#print("Hit null")
outputFromDfDir <- read.csv(fullPath)
}else{
tmp_ds <- read.csv(fullPath)
outputFromDfDir <- rbind(outputFromDfDir, tmp_ds)
rm(tmp_ds)
#print("Hit else")
}
}

return(outputFromDfDir)
}
[/code]

Then to do the graphs and get the data I needed I used the following code:

[code language=”r”]
#install.packages("ggplot2")
#install.packages("dplyr")
library(ggplot2)
library(plyr)
library(reshape2)

#Read in all the minors pitching data from directory
minors_pitching <- getDfFromDir("sport_data/product/all/baseball-reference/2016/csv/pitching/")
summary(minors_pitching$Lvl)

#See if you can compute staff era for each team
minors_tm_era <- ddply(minors_pitching,.(Tm,Lg,Lvl,Franchise), summarise, era=(sum(ER)/sum(IP))*9)
#Remove NA rows
minors_tm_era <- minors_tm_era[complete.cases(minors_tm_era),]
minors_tm_era

#Level Data
minors_lvl_era <- ddply(minors_pitching,.(Lvl),summarise, era=(sum(ER)/sum(IP))*9)
minors_lvl_era <- minors_lvl_era[complete.cases(minors_lvl_era),]
minors_lvl_era

#League Data
minors_Lg_era <- ddply(minors_pitching,.(Lg,Lvl),summarise, era=(sum(ER)/sum(IP))*9)
minors_Lg_era <- minors_Lg_era[complete.cases(minors_Lg_era),]
minors_Lg_era

wsh_tm_era <- ddply(minors_pitching[minors_pitching$Franchise=="Washington Nationals",],.(Tm,Lvl), summarise, era=(sum(ER)/sum(IP))*9)
wsh_tm_era

lg_melt_data <- melt(minors_Lg_era)
team_melt_data <- melt(wsh_tm_era)

#Rename vars so you can bind
colnames(lg_melt_data) <- c("lg_tm", "lvl", "variable", "value")
colnames(team_melt_data) <- c("lg_tm", "lvl", "variable", "value")
total_melt_data <- rbind(lg_melt_data,team_melt_data)
total_melt_data

#Graph for era
era_graph <- ggplot(data=total_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat="identity", position="dodge") + ggtitle("WSH Minors ERA Per Level")
era_graph

#Graph of only the nationals
nats_era_graph <- ggplot(data=team_melt_data, aes(x=lvl, value, fill=lg_tm)) + geom_bar(stat="identity", position="dodge") + ggtitle("WSH Minors ERA")
nats_era_graph
[/code]