Just like baseball … never assume :-þ

In the most recent simulation results, the total runs scored by all teams added up to 17572. At the time, I wondered if this might be an issue because the actual total in any mlb season is about 20000. I reasoned that the simulation total might not be a concern because: a) it’s a simulation, not a duplication, and b) the only pitchers involved are starters and so fewer runs might be expected given that relievers (presumably inferior to starters, otherwise they’d be starting) were not being used … although 2500 runs did seem like a lot.

However, in trying to move on from simulating a season to asking questions about batting order optimization, I realized that there might be a flaw in the original programming.  I reviewed the functions that perform the nine-inning simulations and realized that there was a problem with the logic used to set the leadoff batter and subsequent batters of any next inning. The functions have been changed and I believe they are now performing as they should.

I’ve rerun the season simulation and now the runs total 18695; still short of 20000, but less of a worry … at least for now.

[table id=5 /]

 

Now it’s back to asking questions about actual batting order … which is a slow process because of the 362,880 possible combinations for nine batters.  It takes my desktop about 37 hours to analyze all possible combinations for a team.  Part of the problem is that there are so many possible combinations, but another issue is that each order must be simulated at least 10,000 times to generate a result that reduces error to a meaningful level.  So even at about 25,000 innings per second, it takes a while.  I need more power!

I’ve completed a trial run at simulating the optimal batting order for the Tigers. Their current order is projected to be: Kinsler, Davis, Cabrera, VMart, Cespedes, JMart, Castellanos, Avila, Iglesias … but simulation says it should be: Kinsler, Cabrera, Castellanos, VMart, Cespedes, JMart, Davis, Avila, Iglesias. In other words, Davis moves from 2 to 7, Castellanos from 7 to 3, and Cabrera from 3 to 2.  Plugging these two lineups into the nine-inning simulation suggests that the new order is about 20 runs better … or about 2 wins.

 

Baseball Simulation Redux

The underlying assumption of my first attempt at simulation was that a player will perform in 2015 as he did in 2014. This was quite reasonable in order to get the simulator working, but it’s a bit of a stretch in reality to say the least. I decided to generate transition-state probabilities based on the events of the past three seasons as a better (i.e. more reliable) source of information.

I combined the Retrosheet event files for 2012, 2013, and 2014 and reran the simulation. Think of the new transition-state probabilities as the result of one big season that was three times as long.

Here are the results:

[table id=3 /]

 

These seem to be quite reasonable. I like these better simply because they are based on the analysis of three times as much data. On the other hand (as confirmation bias rears its ugly head), I ‘like’ the suggestion that the Cubs might win ~81 a lot better than ~91 :-þ

Once again, these results are for the lineups of early January, and many changes have occurred over the past couple of weeks. Next on the list will be to update the lineups and rerun the simulation.

Parallel Processing with R

I realized that if I could simulate game outputs using batting orders that I ought to be able to simulate runs produced by different batting order combinations.

The first thing you find out when you start doing this is that you need more power. Preliminary testing showed that my desktop (2 x 2.7 GHz, 6GB RAM) was going to take about three days to test the 362,880 combinations of a nine-player batting order.  One of the major considerations is that it takes 10000 simulations of each lineup to get to a tolerable level of reproduction. It’s quite likely that I could improve programming speed with better R scripting … but that’s probably down the road a bit as I am learning R on the fly (having been using it for 3 months).

In the meantime, to make lineup simulation a little more practical, I have had to figure out how to get R to run using both processors.

Here’s how I do it:

1. Installed the ‘parallel’, ‘foreach’, ‘doParellel’,and ‘iterators’ packages.
2. Added these lines to the start of any R script for which I want to use both processors:

cl <- makeCluster(2)
registerDoParallel(cl)

3. Changed from running a for loop to using foreach. In other words, changed from:

for (i in 1:combos) {
program lines
}

to:

results_lineups <- foreach ( i = 1:combos ) %dopar% {
program lines
}

4. Added this line to the end of the script:

stopCluster(cl)

5. I had added a progress bar to the original script, but that doesn’t work in ‘foreach’. The answer begins with this post.

The code from that post makes a text file with an entry for each foreach iteration. In my case, this was a text file with ~362,880 lines. That’s a bit much and so I’ve changed it to this:

results_lineups <- foreach ( i = 1:combos ) %dopar% {

if (i %% 100 == 0) {
sink(“log.txt”, append=TRUE)
cat(paste(i,”,”))
sink()
}

… program lines …

}

What this does is divide i by 100 and if the remainder is 0, prints i to the log file. iow, it prints to the file every 100 passes. It also prints a stream of numbers separated by a comma as opposed to using a separate line each time. Now the log file is only about 30KB for 362,880 iterations. All I have to do is open the log file during the run to see where things are at.

Processing time for the nine-inning simulator is now under 5 hours for 1000 seasons and for the lineup simulator, it’s about 37 hours  … and I’m working on a presentation for my wife to convince her that more computer power is needed :-þ

Results to follow …

Baseball Simulation

Okay, so I’m going to give blogging a whirl. Bear with me while I figure a few things out.

I’ve been working for the last three months on developing a baseball simulator using Chapter 9 of Analyzing Baseball Data with R as the starting point. The simulator uses transition probabilities (i.e. the probability of moving from one transition state to another) to generate an estimate of run production.   The simulation uses the projected batting lineups and pitching rotations found at Rotochamp and the results presented here are from the lineups of early January (which have changed considerably since).

I wrote a PERL script to generate a csv file that lists all 2430 games with a home team, away team, home team pitcher, and away team pitcher.  The pitchers were used throughout the season on a five-man rotation.  One of the main reasons for writing this script was to have a means to generate an updated file quickly whenever lineup changes occur.

The transition-state probabilities were prepared from data obtained free of charge and copyrighted by Retrosheet. Interested parties may contact Retrosheet at “www.retrosheet.org”.

I started with batters only. Once this was working, I realized it should be better to consider batters versus a particular pitcher. The overall probability is the product of the batter’s probability multiplied by the pitcher’s probability divided by the overall probability of the transition:

(PB * PP) / PMLB

From there, I added in consideration of the park where the simulated game takes place.

The simulator was based on the assumption that each player would perform in 2015 as he did in 2014. Not a bad place to start, but not the best place either.

Other considerations:

  1. Batting lineup is unchanged throughout the game.
  2. Pitcher remains the same throughout the game.
  3. Pitching rotations are unchanged throughout the season.
  4. Each ‘game’ is nine simulated innings where the batter who leads off any inning is the one who was on deck at the end of the previous inning.
  5. Home team is given an advantage of 0.1 runs per game.  In effect, this enabled the breaking of ties at the end of the nine innings.
  6. The home team does not bat in the bottom of the ninth if they are ahead.
  7. Results are the average of 1000 simulated seasons
  8. Players who did not play in 2014 have no retrosheet ID and no transition-state probabilities.  I substituted an average player for them, but down the road would prefer to substitute a replacement-level player … and even replacement-level by position.
  9. If the game was in an NL park, the pitcher batted.

The results seem reasonable (at least Cubs fans might think so) :

[table id=2 /]

where xWins is the expected number of wins from:

( rf ^ 1.82 ) / ( rf ^ 1.82 + ra ^ 1.82 )