In order to use the probability of transition state change as the basis for a game simulator, I want to have up-to-date information. To that end, I collect game data from MLB and have written a parser that updates transition state probabilities for each player as they progress through the season.
The problem is that using MLB gameday xml files means I need to have the parser read through descriptions of plays to determine when a transition state change actually occurred and what that change was. One of the major headaches has been that MLB does not list the runner events in the correct order. The parser needs to proceed around the bases: by this I mean that I have to figure out what has happened at 3rd base before I can figure out what has happened at 2nd base etc.
To enable this, I have had to figure out how to sort xml child elements by their content. It was a struggle but I figured it out.
Another issue is that for certain events (in particular, base-running plays such as steals, MLB will use a player’s name and not his numerical ID. Next on the list will be to figure out how to look up a player’s ID from the parser so that I can do a better job of determining who is at each base … without having to parse the myriad text descriptions that appear. The problem with doing it this way is that each time the slightest variation in a text description appears, I have to write a new line of code to process it. It would be much easier if the parser just looked at who is at each base.
Another issue is that MLB lists each base-running event such as steals in two places … which means it has to be processed once and ignored the second time.
So, my parser isn’t completely accurate. I know this because I can compare the transition-state-change table that I get from the data I retrieve from MLB with the table that I get from parsing retrosheet’s data with their parser. Unfortunately for me, I can only do this once a year in December when Retrosheet posts the data from the previous season. I was able to make my parser’s results agree with Retrosheet’s data after the 2015 season, but the 2016 results show that things still aren’t quite correct.
Here is the table that results from parsing Retrosheet’s data:
And here is the table from parsing MLB’s xml files:
And here is a table showing the differences between the two:
Actually, it’s not as bad as it looks. There are only 344 to track down and each fix may eliminate more than one difference.
[This was definitely the case. I found that MLB had added an attibute to a particular child element that broke a regular expression search. Fixing this leaves only 94 differences.]
So, it’s back to the code … to figure out why the differences have occurred and to eliminate them …