As I continue to learn R and go down the road of becoming a data scientist. I need to learn how to use and compute advanced statistics. The first advanced analytic I’m going to learn how to compute is weighted on-base average(wOBA). Weighted on-base average combines all the parts of a players offensive game and gives them all appropriate weights for their impact on the game. For example, a HR is given more weight than a BB or a Single because a HR is guarantees a team atleast one run while a BB/Single only scores a run a much lower percentage of the time. General statistics like BA/SLG/OBP only look at a piece of a players offensive game which is why wOBA is a better tool for looking at a players offensive contribution. Fangraphs does a great job of describing the details in the following links:

The links above give a lot of insight on why you should use wOBA. Tom Tango in The Book describes the formula. He also has the standard formula for wOBA that includes and doesn’t include speed on his site . I used the standard formula from his site for my initial calculations with R.

The following R code has the functions I use to calculate speed and non speed wOBA:

[code language=”r”]

#Link for standard WOBA formula

#http://tangotiger.com/index.php/site/article/standard-woba

woba_calc_basic <- function(row){

bb <- as.numeric(row["BB"])

hbp <- as.numeric(row["HBP"])

doubles <- as.numeric(row["X2B"])

triples <- as.numeric(row["X3B"])

hr <- as.numeric(row["HR"])

hits <- as.numeric(row["H"])

singles <- hits – triples – doubles – hr

pa <- as.numeric(row["PA"])

ibb <- as.numeric(row["IBB"])

sh <- as.numeric(row["SH"])

numerator <- .7*(bb+hbp) + .9*(singles) + 1.3*(doubles + triples) + 2*hr

denominator <- pa – ibb – sh

return(numerator/denominator)

}

woba_calc_speed <- function(row){

bb <- as.numeric(row["BB"])

hbp <- as.numeric(row["HBP"])

doubles <- as.numeric(row["X2B"])

triples <- as.numeric(row["X3B"])

hr <- as.numeric(row["HR"])

hits <- as.numeric(row["H"])

singles <- hits – triples – doubles – hr

pa <- as.numeric(row["PA"])

ibb <- as.numeric(row["IBB"])

sh <- as.numeric(row["SH"])

sb <- as.numeric(row["SB"])

cs <- as.numeric(row["CS"])

numerator <- .7*(bb+hbp) + .9*(singles) + 1.3*(doubles + triples) + 2*hr + .25*sb + -.5*cs

denominator <- pa – ibb – sh

return(numerator/denominator)

}

[/code]

Once I had those two formula’s in place I retrieved some data from Nationals data from baseball-reference . Then used the code below to apply the two functions to a dataframe.

[code language=”r”]

data = read.csv("data/FG_MLB_Batting_Std_All.csv")

#Apply basic woba

data$woba_basic = apply(data, 1, woba_calc_basic)

#Apply speed woba

data$woba_speed = apply(data, 1, woba_calc_speed)

[/code]

Once I had that in place I noticed that Fangraphs kept mentioning ‘wOBA scale’ and how it and wOBA could be used to generate another stat called Weighted Runs Above Average . After doing some digging I found an old article written by Tom Tango on how to compute yearly wOBA from scratch . Above I’ve just used some standard weights to compute a standard wOBA. But standard wOBA doesn’t take into account the yearly offensive environment that players are playing in. Basically some years an offensive action is worth more than others based upon how the whole league is performing. Tom Tango’s article on computing yearly wOBA teaches you how to get the yearly weights and calculate wOBA scale. Continuing down this R, wOBA rabbit hole I transcribe his SQL script into R functions.

To compute the offensive environment(Step 1 and Step 2) in Tom Tangos script I use this function on pitching data:

[code language=”r”]

get_runvalues <- function(pitching_data){

#Calculate runs per out

stat <- sum(pitching_data["R"])/(sum(pitching_data["IP"])*3)

rPerOut <- c(stat)

#Calculate runs per bb

stat <- rPerOut + .14

rPerBB <- c(stat)

#Runs per HB

stat <- rPerBB + .025

rPerHB <- c(stat)

#Runs Per Single

stat <- rPerBB + .155

rPer1B <- c(stat)

#Runs Per Double

stat <- rPer1B + .3

rPer2B <- c(stat)

#Runs Per Triple

stat <- rPer2B + .27

rPer3B <- c(stat)

#Runs per HR

rHR <- 1.4

rSB <- .2

rCS <- 2*rPerOut + .075

df <- data.frame(rPerOut, rPerBB, rPerHB, rPer1B, rPer2B, rPer3B, rHR, rSB, rCS)

return(df)

}

[/code]

Step 3 in the SQL script pulls uses the run values to generate the necessary weights to calculate wOBA. The following R methods accomplish this:

[code language=”r”]

get_woba_runsMinus <- function(batting, runValues){

bbNumerator <- as.numeric(runValues["rPerBB"]) * (sum(batting["BB"]) – sum(batting["IBB"]))

hbpNumerator <- as.numeric(runValues["rPerHB"]) * sum(batting["HBP"])

singles <- sum(batting["H"]) – sum(batting["X2B"]) – sum(batting["X3B"]) – sum(batting["HR"])

singleNumerator <- as.numeric(runValues["rPer1B"]) * singles

doubleNumerator <- as.numeric(runValues["rPer2B"]) * sum(batting["X2B"])

tripleNumerator <- as.numeric(runValues["rPer3B"]) * sum(batting["X3B"])

hrNumerator <- as.numeric(runValues["rHR"]) * sum(batting["HR"])

sbNumerator <- as.numeric(runValues["rSB"]) * sum(batting["SB"])

csNumerator <- as.numeric(runValues["rCS"]) * sum(batting["CS"])

numerator <- bbNumerator + hbpNumerator + singleNumerator + doubleNumerator + tripleNumerator + hrNumerator + sbNumerator – csNumerator

#print(paste("numerator is, ", numerator))

denom <- sum(batting["AB"]) – sum(batting["H"]) + sum(batting["SF"])

#print(paste("denominator is, ", denom))

return(numerator/denom)

}

get_woba_runsPlus <- function(batting, runValues){

#Same as runMinus should probable combine and output df

bbNumerator <- as.numeric(runValues["rPerBB"]) * (sum(batting["BB"]) – sum(batting["IBB"]))

hbpNumerator <- as.numeric(runValues["rPerHB"]) * sum(batting["HBP"])

singles <- sum(batting["H"]) – sum(batting["X2B"]) – sum(batting["X3B"]) – sum(batting["HR"])

singleNumerator <- as.numeric(runValues["rPer1B"]) * singles

doubleNumerator <- as.numeric(runValues["rPer2B"]) * sum(batting["X2B"])

tripleNumerator <- as.numeric(runValues["rPer3B"]) * sum(batting["X3B"])

hrNumerator <- as.numeric(runValues["rHR"]) * sum(batting["HR"])

sbNumerator <- as.numeric(runValues["rSB"]) * sum(batting["SB"])

csNumerator <- as.numeric(runValues["rCS"]) * sum(batting["CS"])

numerator <- bbNumerator + hbpNumerator + singleNumerator + doubleNumerator + tripleNumerator + hrNumerator + sbNumerator – csNumerator

denom <- (sum(batting["BB"]) – sum(batting["IBB"])) + sum(batting["HBP"]) + sum(batting["H"])

return(numerator/denom)

}

woba_calc_lgAvgSpeed <- function(batting, runValues){

numerator <- sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["H"]) + sum(batting["SB"]) – sum(batting["CS"])

denom <- sum(batting["AB"]) + sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["SF"])

return(numerator/denom)

}

woba_calc_lgAvgReg <- function(batting, runValues){

numerator <- sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["H"])

denom <- sum(batting["AB"]) + sum(batting["BB"]) – sum(batting["IBB"]) + sum(batting["HBP"]) + sum(batting["SF"])

return(numerator/denom)

}

get_woba_weights <- function(pitching, batting){

#Get the run values

run_values <- get_runvalues(pitching)

#Compute runs minus

runs_minus <- get_woba_runsMinus(batting, run_values)

runs_plus <- get_woba_runsPlus(batting, run_values)

#woba <- woba_calc_lgAvgSpeed(batting, run_values)

woba <- woba_calc_lgAvgReg(batting, run_values)

wobaScale <- 1/(runs_plus + runs_minus)

wobaBB <- (as.numeric(run_values["rPerBB"]) + runs_minus)*wobaScale

wobaHB <- (as.numeric(run_values["rPerHB"]) + runs_minus)*wobaScale

woba1B <- (as.numeric(run_values["rPer1B"]) + runs_minus)*wobaScale

woba2B <- (as.numeric(run_values["rPer2B"]) + runs_minus)*wobaScale

woba3B <- (as.numeric(run_values["rPer3B"]) + runs_minus)*wobaScale

wobaHR <- (as.numeric(run_values["rHR"]) + runs_minus)*wobaScale

wobaSB <- as.numeric(run_values["rSB"])*wobaScale

wobaCS <- as.numeric(run_values["rCS"])*wobaScale

df <- data.frame(run_values, runs_minus, runs_plus, woba, wobaScale, wobaBB, wobaHB, woba1B, woba3B, wobaHR, wobaSB, wobaCS)

return(df)

}

[/code]

With all this in place I now needed to test my functions against the table he has here . The data I ended up using for the test is from Fangraphs . The code below calculates the wOBA weights:

[code language=”r”]

fg_batting_data = read.csv("data/FG_MLB_Batting_Std_All.csv")

fg_pitching_data = read.csv("data/FG_MLB_Pitching_Std_All.csv")

min_pa = subset(fg_batting_data, PA > 80)

wobaWeights = get_woba_weights(fg_pitching_data, fg_batting_data)

wobaWeightsFiltered = get_woba_weights(fg_pitching_data, min_pa)

wobaWeights

[/code]

Here’s the output for the wOBA weights for 2007:

[code]

rPerOut rPerBB rPerHB rPer1B rPer2B rPer3B rHR rSB rCS

1 0.1796024 0.3196024 0.3446024 0.4746024 0.7746024 1.044602 1.4 0.2 0.4342048

runs_minus runs_plus woba wobaScale wobaBB wobaHB woba1B woba2B

1 0.2792543 0.5642809 0.3310523 1.185487 0.7099368 0.739574 0.8936873 1.249333

woba3B wobaHR wobaSB wobaCS

1 1.569415 1.990734 0.2370974 0.5147441

[/code]

These numbers line up with the numbers from Tom Tangos table for 2007 I linked above. Here’s a look at what the side by side numbers:

label |
original calc |
my calc |

year |
2007 |
2007 |

rPerOut |
.179 |
0.1796024 |

rPerBB |
.32 |
.3196024 |

rPerHB |
.34 |
.3446024 |

rPer1B |
.47 |
.4746024 |

rPer2B |
.77 |
.7746024 |

rPer3B |
1.04 |
1.044602 |

rHR |
1.40 |
1.4 |

rSB |
.2 |
.2 |

rCS |
.43 |
.4342048 |

rMinus |
.28 |
.2792543 |

rPlus |
.56 |
.5642809 |

wOBA |
.331 |
.3310523 |

wobaScale |
1.19 |
1.185487 |

wobaBB |
.71 |
.7099368 |

wobaHB |
.74 |
.739574 |

woba1B |
.89 |
.8936873 |

woba2B |
1.25 |
1.249333 |

woba3B |
1.57 |
1.569415 |

wobaHR |
1.99 |
1.990734 |

wobaSB |
.24 |
.2370974 |

wobaCS |
.51 |
.5147441 |

With some simplification the weights I produced with the R code above would line up with Tom Tango’s original chart for 2007. Since I’m now able to calculate wOBA weights from scratch I’ll be able to calculate wRAA and apply wOBA in other offensive environments.