Pitchers who change their throwing hand during a game also drive me crazy!

It’s back to Hack28.

I’ve been working on this for some time.  I wanted to be able to generate an events list throughout the season using the data from MLB’s Gameday XML files – instead of having to wait for Retrosheet at the end of the season.

I’ve managed to get results that are very close to agreeing with Retrosheet (gotta figure that they’re right and mine are wrong) but had a difference of 62 events out of a total of ~189700.  So I made a massive spreadsheet (over 7 million cells) of the two datasets side-by-side and looked for lines that were different by totalling up the playerIDs of the baserunners.

The way to make the script generate results similar to Retrosheet is to change the atbat number in the xml file.  But this still left me with 62 additional events.  Turns out that every time Pat Venditte changed his throwing hand, it was an event.

Now, I have to either edit my results or add a line to my script to treat this special case.

I did it the hard way first and I’m here to tell you that Pat Venditte switched his throwing hand 62 times and I’ve found and deleted every damn one of them!  aarrgghh :-þ

But now I’ve got an event list that has the same total number of events as Retrosheet. … and all I have to do is add a couple of lines to the script for next season’s data collection.

Simulation 2016 … try, try again :-þ

Simulation 2016
First crack at simulating the 2016 MLB season.  Last year’s results weren’t very good.   Actually, they were bad …. since the RMSE was greater than that from guessing that every team would go 81-81. So, there’s room for improvement :-þ

For 2016, I’ve changed how the transition state probabilities are calculated by tweaking the pseudocounts, and added in consideration of how each team plays at home versus on the road.  Right now, that means how they performed in 2015 at home and on the road … which of course doesn’t mean they’ll do the same this season.  I’m working on how to add regression into the equation to make a better estimate of how each team might perform at home and on the road _this_ season.

Lineups and pitching rotations used for the simulation are those posted at rotochamp.

wins rf ra xWins wins rf ra xWins
tor 93 811 686 93 nyn 88 665 596 89
nya 84 701 688 82 was 88 670 608 88
bos 79 679 689 80 mia 80 599 623 78
bal 79 676 699 79 phi 69 560 657 69
tba 79 646 665 79 atl 69 541 640 69
wins rf ra xWins wins rf ra xWins
kca 85 682 661 83 sln 92 714 624 91
cle 85 679 658 83 chn 87 685 634 87
min 80 667 677 80 pit 87 691 635 87
cha 75 624 660 77 mil 75 617 671 75
det 74 622 673 75 cin 72 586 664 72
wins rf ra xWins wins rf ra xWins
hou 89 727 664 88 lan 89 687 626 88
tex 83 679 680 81 sfn 87 679 628 87
oak 80 635 659 78 ari 81 669 670 81
ana 79 638 658 79 sdn 76 597 646 75
sea 75 618 641 78 col 73 665 730 74

Are the Blue Jays the beasts of the AL East?

The Jays have been up and down this year but are now riding a six-game win streak which has brought them back to within one game of .500 – although they are still four games back of the Yankees.

However, the Jays are now +53 in run differential.  This, and the fact that they have a sorry 4-12 record in one-run games suggests that they could have had a much better record at this point.  The run differential suggests it could be about 34-25 instead of 29-30.

Their rating has also been climbing recently.  Who knows?  Maybe they’re putting it together … finally.



Just like baseball … never assume :-þ

In the most recent simulation results, the total runs scored by all teams added up to 17572. At the time, I wondered if this might be an issue because the actual total in any mlb season is about 20000. I reasoned that the simulation total might not be a concern because: a) it’s a simulation, not a duplication, and b) the only pitchers involved are starters and so fewer runs might be expected given that relievers (presumably inferior to starters, otherwise they’d be starting) were not being used … although 2500 runs did seem like a lot.

However, in trying to move on from simulating a season to asking questions about batting order optimization, I realized that there might be a flaw in the original programming.  I reviewed the functions that perform the nine-inning simulations and realized that there was a problem with the logic used to set the leadoff batter and subsequent batters of any next inning. The functions have been changed and I believe they are now performing as they should.

I’ve rerun the season simulation and now the runs total 18695; still short of 20000, but less of a worry … at least for now.

AL EastNL East
AL CentralNL Central
AL WestNL West


Now it’s back to asking questions about actual batting order … which is a slow process because of the 362,880 possible combinations for nine batters.  It takes my desktop about 37 hours to analyze all possible combinations for a team.  Part of the problem is that there are so many possible combinations, but another issue is that each order must be simulated at least 10,000 times to generate a result that reduces error to a meaningful level.  So even at about 25,000 innings per second, it takes a while.  I need more power!

I’ve completed a trial run at simulating the optimal batting order for the Tigers. Their current order is projected to be: Kinsler, Davis, Cabrera, VMart, Cespedes, JMart, Castellanos, Avila, Iglesias … but simulation says it should be: Kinsler, Cabrera, Castellanos, VMart, Cespedes, JMart, Davis, Avila, Iglesias. In other words, Davis moves from 2 to 7, Castellanos from 7 to 3, and Cabrera from 3 to 2.  Plugging these two lineups into the nine-inning simulation suggests that the new order is about 20 runs better … or about 2 wins.


Baseball Simulation: Is there life after Max?

Maybe so.

There has been plenty of player movement over the last few weeks culminating with the Nats signing Max for megabucks.  I have updated the batting and pitching lineup information and rerun the simulation.  Interestingly, neither the Tigers nor the Nats changed much.

AL EastNL East
AL CentralNL Central
AL WestNL West

Baseball Simulation Redux

The underlying assumption of my first attempt at simulation was that a player will perform in 2015 as he did in 2014. This was quite reasonable in order to get the simulator working, but it’s a bit of a stretch in reality to say the least. I decided to generate transition-state probabilities based on the events of the past three seasons as a better (i.e. more reliable) source of information.

I combined the Retrosheet event files for 2012, 2013, and 2014 and reran the simulation. Think of the new transition-state probabilities as the result of one big season that was three times as long.

Here are the results:

AL EastNL East
AL CentralNL Central
AL WestNL West


These seem to be quite reasonable. I like these better simply because they are based on the analysis of three times as much data. On the other hand (as confirmation bias rears its ugly head), I ‘like’ the suggestion that the Cubs might win ~81 a lot better than ~91 :-þ

Once again, these results are for the lineups of early January, and many changes have occurred over the past couple of weeks. Next on the list will be to update the lineups and rerun the simulation.

Parallel Processing with R

I realized that if I could simulate game outputs using batting orders that I ought to be able to simulate runs produced by different batting order combinations.

The first thing you find out when you start doing this is that you need more power. Preliminary testing showed that my desktop (2 x 2.7 GHz, 6GB RAM) was going to take about three days to test the 362,880 combinations of a nine-player batting order.  One of the major considerations is that it takes 10000 simulations of each lineup to get to a tolerable level of reproduction. It’s quite likely that I could improve programming speed with better R scripting … but that’s probably down the road a bit as I am learning R on the fly (having been using it for 3 months).

In the meantime, to make lineup simulation a little more practical, I have had to figure out how to get R to run using both processors.

Here’s how I do it:

1. Installed the ‘parallel’, ‘foreach’, ‘doParellel’,and ‘iterators’ packages.
2. Added these lines to the start of any R script for which I want to use both processors:

cl <- makeCluster(2)

3. Changed from running a for loop to using foreach. In other words, changed from:

for (i in 1:combos) {
program lines


results_lineups <- foreach ( i = 1:combos ) %dopar% {
program lines

4. Added this line to the end of the script:


5. I had added a progress bar to the original script, but that doesn’t work in ‘foreach’. The answer begins with this post.

The code from that post makes a text file with an entry for each foreach iteration. In my case, this was a text file with ~362,880 lines. That’s a bit much and so I’ve changed it to this:

results_lineups <- foreach ( i = 1:combos ) %dopar% {

if (i %% 100 == 0) {
sink(“log.txt”, append=TRUE)

… program lines …


What this does is divide i by 100 and if the remainder is 0, prints i to the log file. iow, it prints to the file every 100 passes. It also prints a stream of numbers separated by a comma as opposed to using a separate line each time. Now the log file is only about 30KB for 362,880 iterations. All I have to do is open the log file during the run to see where things are at.

Processing time for the nine-inning simulator is now under 5 hours for 1000 seasons and for the lineup simulator, it’s about 37 hours  … and I’m working on a presentation for my wife to convince her that more computer power is needed :-þ

Results to follow …

Baseball Simulation

Okay, so I’m going to give blogging a whirl. Bear with me while I figure a few things out.

I’ve been working for the last three months on developing a baseball simulator using Chapter 9 of Analyzing Baseball Data with R as the starting point. The simulator uses transition probabilities (i.e. the probability of moving from one transition state to another) to generate an estimate of run production.   The simulation uses the projected batting lineups and pitching rotations found at Rotochamp and the results presented here are from the lineups of early January (which have changed considerably since).

I wrote a PERL script to generate a csv file that lists all 2430 games with a home team, away team, home team pitcher, and away team pitcher.  The pitchers were used throughout the season on a five-man rotation.  One of the main reasons for writing this script was to have a means to generate an updated file quickly whenever lineup changes occur.

The transition-state probabilities were prepared from data obtained free of charge and copyrighted by Retrosheet. Interested parties may contact Retrosheet at “www.retrosheet.org”.

I started with batters only. Once this was working, I realized it should be better to consider batters versus a particular pitcher. The overall probability is the product of the batter’s probability multiplied by the pitcher’s probability divided by the overall probability of the transition:

(PB * PP) / PMLB

From there, I added in consideration of the park where the simulated game takes place.

The simulator was based on the assumption that each player would perform in 2015 as he did in 2014. Not a bad place to start, but not the best place either.

Other considerations:

  1. Batting lineup is unchanged throughout the game.
  2. Pitcher remains the same throughout the game.
  3. Pitching rotations are unchanged throughout the season.
  4. Each ‘game’ is nine simulated innings where the batter who leads off any inning is the one who was on deck at the end of the previous inning.
  5. Home team is given an advantage of 0.1 runs per game.  In effect, this enabled the breaking of ties at the end of the nine innings.
  6. The home team does not bat in the bottom of the ninth if they are ahead.
  7. Results are the average of 1000 simulated seasons
  8. Players who did not play in 2014 have no retrosheet ID and no transition-state probabilities.  I substituted an average player for them, but down the road would prefer to substitute a replacement-level player … and even replacement-level by position.
  9. If the game was in an NL park, the pitcher batted.

The results seem reasonable (at least Cubs fans might think so) :

AL EastNL East
AL CentralNL Central
AL WestNL West

where xWins is the expected number of wins from:

( rf ^ 1.82 ) / ( rf ^ 1.82 + ra ^ 1.82 )