Unleash the hounds :-þ

I’ve had it in the back of my mind for some time now that it wouldn’t be too much of a leap to alter my parser from the collection of batter-related transition state changes to the collection of pitch data … both sets of information are in the same xml file.  So, last night, after I got down to the last three differences between my parsing results and those from Retrosheet, I decided to give it a whirl.

Oh, oh, oh … it was a piece of cake compared to generating transition state changes … and now I have a database table of 715819 pitches by 51 fields for each pitch.  Woohoo!

Now the question is:  what questions to ask about pitches???  I think it’s PCA time … and also time to revisit charts in R.

Pitch locations for the first three games of the season:

Pitches from the first three games that were ‘called balls’:


… not bad there, ump!

On the other hand … pitches that were ‘called strikes’:


… no wonder the players get cheesed at the umpires :-þ

On another other hand though … pitches that were swung at:


… pretty sure the umpires figure the batters are blind too!

And now … hits!

Miguel Cabrera’s spray chart for 2016.  Representative field 330′ down the lines and 371 to center.



Parsing is even harder …

… when different sources (i.e. Retrosheet and MLB) record a play differently.

I’m down to the short strokes with my new parser.  Rather than having to interpret myriad text descriptions of plays that don’t involve the batter-runner, I am processing the movement of players from base to base using runner events only.  I’ve had to figure out how to do a secondary sort of xml child elements to group the two types of runner events in the correct sequences, and I also had to figure out how to loop through the analysis without writing to the database unless a batter-involved transition state change had occurred.  However, the results have been worth it as the results are now achieved with much less code and with little to no ambiguity.

However, results are only as good as the original data … so here’s an interesting play:

In the September 17, 2016 game between the Royals and the WhiteSox, top of 4 … Todd Frazier steals second on the same pitch on which Jason Coats is called out on strikes. Did Coats strike out with a man on first (transition from 100 1 to 100 2) or with a man on second (010 1 to 010 2)?  When I get some time, I’m going to try to wade through the Official Rules at MLB to see if there is a description of how this situation should be handled for scoring.

MLB has it all happening as one event, which I think is incorrect, resulting in the transition 100 1 to 010 2.  Retrosheet has the strikeout occurring with a man at second, 010 1 to 010 2.  Doesn’t sound like much, but to me different is different.

I’ve also found a few events where MLB doesn’t appear to have been consistent with using a separate event ID for runner events.  These result in transition state changes that aren’t correct.  I’d love to tell MLB about them but there doesn’t seem to be any way to do that … at least not yet.

Parsing is hard work …

… especially when the source contains odd descriptions.

One thing I have to do with my parser is to decipher text descriptions of base-running plays.  It’s clunky but it works.  Except when this sort of thing appears:

<action b=”0 s=”1 o=”0 des=”With Welington Castillo batting, Michael Bourn advances to 3rd base on a caught stealing error by Tommy Joseph, assist to pitcher Adam Morgan to third baseman Maikel Franco to second baseman Cesar Hernandez to third baseman Maikel Franco. des_es=”Con Welington Castillo bateando, Michael Bourn atrapado robando, avanza a 3ra por error de Tommy Joseph, asistencia para lanzador Adam Morgan a tercera base Maikel Franco a segunda base Cesar Hernandez a tercera base Maikel Franco. event=”Caught Stealing 3B event_es=”Retirado en Intento de Robo 3B tfs=”003748 tfs_zulu=”2016-06-18T00:37:48Z player=”456422 pitch=”1 event_num=”300 play_guid=”3bc5d844-1d09-46b7-8d49-8365b8466403 home_team_runs=”2 away_team_runs=”3/>

This is from the 5th inning of the PHI-ARI game on June 17.  Bourn actually winds up at 2nd base when the ball is dropped there.  Officially, he is caught stealing, but he doesn’t wind up at 3rd.  So, I have to go into the original xml file and edit out the offending passage in order to generate the correct transition state change.

Looks like I’m going to have to live with this sort of thing for now.  And it’s good motivation to rewrite the parser so that it can ignore text descriptions and just use the information from the ‘runner’ child elements in atbats … assuming that everything else is correct.

The bright side is that it’s only one event out of almost 200,000 …. btw I’m down to the last 20 differences.

Transition state ch-ch-ch-changes

In order to use the probability of transition state change as the basis for a game simulator, I want to have up-to-date information.  To that end, I collect game data from MLB and have written a parser that updates transition state probabilities for each player as they progress through the season.

The problem is that using MLB gameday xml files means I need to have the parser read through descriptions of plays to determine when a transition state change actually occurred and what that change was.  One of the major headaches has been that MLB does not list the runner events in the correct order.  The parser needs to proceed around the bases: by this I mean that I have to figure out what has happened at 3rd base before I can figure out what has happened at 2nd base etc.

To enable this, I have had to figure out how to sort xml child elements by their content. It was a struggle but I figured it out.

Another issue is that for certain events (in particular, base-running plays such as steals, MLB will use a player’s name and not his numerical ID.   Next on the list will be to figure out how to look up a player’s ID from the parser so that I can do a better job of determining who is at each base … without having to parse the myriad text descriptions that appear.  The problem with doing it this way is that each time the slightest variation in a text description appears, I have to write a new line of code to process it.  It would be much easier if the parser just looked at who is at each base.

Another issue is that MLB lists each base-running event such as steals in two places … which means it has to be processed once and ignored the second time.

So, my parser isn’t completely accurate.  I know this because I can compare the transition-state-change table that I get from the data I retrieve from MLB with the table that I get from parsing retrosheet’s data with their parser.  Unfortunately for me, I can only do this once a year in December when Retrosheet posts the data from the previous season.  I was able to make my parser’s results agree with Retrosheet’s data after the 2015 season, but the 2016 results show that things still aren’t quite correct.

Here is the table that results from parsing Retrosheet’s data:

And here is the table from parsing MLB’s xml files:

And here is a table showing the differences between the two:


Actually, it’s not as bad as it looks.  There are only 344 to track down and each fix may eliminate more than one difference.

[This was definitely the case.  I found that MLB had added an attibute to a particular child element that broke a regular expression search.  Fixing this leaves only 94 differences.]

So, it’s back to the code … to figure out why the differences have occurred and to eliminate them …




Pitchers who change their throwing hand during a game also drive me crazy!

It’s back to Hack28.

I’ve been working on this for some time.  I wanted to be able to generate an events list throughout the season using the data from MLB’s Gameday XML files – instead of having to wait for Retrosheet at the end of the season.

I’ve managed to get results that are very close to agreeing with Retrosheet (gotta figure that they’re right and mine are wrong) but had a difference of 62 events out of a total of ~189700.  So I made a massive spreadsheet (over 7 million cells) of the two datasets side-by-side and looked for lines that were different by totalling up the playerIDs of the baserunners.

The way to make the script generate results similar to Retrosheet is to change the atbat number in the xml file.  But this still left me with 62 additional events.  Turns out that every time Pat Venditte changed his throwing hand, it was an event.

Now, I have to either edit my results or add a line to my script to treat this special case.

I did it the hard way first and I’m here to tell you that Pat Venditte switched his throwing hand 62 times and I’ve found and deleted every damn one of them!  aarrgghh :-þ

But now I’ve got an event list that has the same total number of events as Retrosheet. … and all I have to do is add a couple of lines to the script for next season’s data collection.

Simulation 2016 … try, try again :-þ

Simulation 2016
First crack at simulating the 2016 MLB season.  Last year’s results weren’t very good.   Actually, they were bad …. since the RMSE was greater than that from guessing that every team would go 81-81. So, there’s room for improvement :-þ

For 2016, I’ve changed how the transition state probabilities are calculated by tweaking the pseudocounts, and added in consideration of how each team plays at home versus on the road.  Right now, that means how they performed in 2015 at home and on the road … which of course doesn’t mean they’ll do the same this season.  I’m working on how to add regression into the equation to make a better estimate of how each team might perform at home and on the road _this_ season.

Lineups and pitching rotations used for the simulation are those posted at rotochamp.

wins rf ra xWins wins rf ra xWins
tor 93 811 686 93 nyn 88 665 596 89
nya 84 701 688 82 was 88 670 608 88
bos 79 679 689 80 mia 80 599 623 78
bal 79 676 699 79 phi 69 560 657 69
tba 79 646 665 79 atl 69 541 640 69
wins rf ra xWins wins rf ra xWins
kca 85 682 661 83 sln 92 714 624 91
cle 85 679 658 83 chn 87 685 634 87
min 80 667 677 80 pit 87 691 635 87
cha 75 624 660 77 mil 75 617 671 75
det 74 622 673 75 cin 72 586 664 72
wins rf ra xWins wins rf ra xWins
hou 89 727 664 88 lan 89 687 626 88
tex 83 679 680 81 sfn 87 679 628 87
oak 80 635 659 78 ari 81 669 670 81
ana 79 638 658 79 sdn 76 597 646 75
sea 75 618 641 78 col 73 665 730 74

Are the Blue Jays the beasts of the AL East?

The Jays have been up and down this year but are now riding a six-game win streak which has brought them back to within one game of .500 – although they are still four games back of the Yankees.

However, the Jays are now +53 in run differential.  This, and the fact that they have a sorry 4-12 record in one-run games suggests that they could have had a much better record at this point.  The run differential suggests it could be about 34-25 instead of 29-30.

Their rating has also been climbing recently.  Who knows?  Maybe they’re putting it together … finally.



Just like baseball … never assume :-þ

In the most recent simulation results, the total runs scored by all teams added up to 17572. At the time, I wondered if this might be an issue because the actual total in any mlb season is about 20000. I reasoned that the simulation total might not be a concern because: a) it’s a simulation, not a duplication, and b) the only pitchers involved are starters and so fewer runs might be expected given that relievers (presumably inferior to starters, otherwise they’d be starting) were not being used … although 2500 runs did seem like a lot.

However, in trying to move on from simulating a season to asking questions about batting order optimization, I realized that there might be a flaw in the original programming.  I reviewed the functions that perform the nine-inning simulations and realized that there was a problem with the logic used to set the leadoff batter and subsequent batters of any next inning. The functions have been changed and I believe they are now performing as they should.

I’ve rerun the season simulation and now the runs total 18695; still short of 20000, but less of a worry … at least for now.

AL EastNL East
AL CentralNL Central
AL WestNL West


Now it’s back to asking questions about actual batting order … which is a slow process because of the 362,880 possible combinations for nine batters.  It takes my desktop about 37 hours to analyze all possible combinations for a team.  Part of the problem is that there are so many possible combinations, but another issue is that each order must be simulated at least 10,000 times to generate a result that reduces error to a meaningful level.  So even at about 25,000 innings per second, it takes a while.  I need more power!

I’ve completed a trial run at simulating the optimal batting order for the Tigers. Their current order is projected to be: Kinsler, Davis, Cabrera, VMart, Cespedes, JMart, Castellanos, Avila, Iglesias … but simulation says it should be: Kinsler, Cabrera, Castellanos, VMart, Cespedes, JMart, Davis, Avila, Iglesias. In other words, Davis moves from 2 to 7, Castellanos from 7 to 3, and Cabrera from 3 to 2.  Plugging these two lineups into the nine-inning simulation suggests that the new order is about 20 runs better … or about 2 wins.


Baseball Simulation: Is there life after Max?

Maybe so.

There has been plenty of player movement over the last few weeks culminating with the Nats signing Max for megabucks.  I have updated the batting and pitching lineup information and rerun the simulation.  Interestingly, neither the Tigers nor the Nats changed much.

AL EastNL East
AL CentralNL Central
AL WestNL West