Unleash the hounds :-þ

I’ve had it in the back of my mind for some time now that it wouldn’t be too much of a leap to alter my parser from the collection of batter-related transition state changes to the collection of pitch data … both sets of information are in the same xml file.  So, last night, after I got down to the last three differences between my parsing results and those from Retrosheet, I decided to give it a whirl.

Oh, oh, oh … it was a piece of cake compared to generating transition state changes … and now I have a database table of 715819 pitches by 51 fields for each pitch.  Woohoo!

Now the question is:  what questions to ask about pitches???  I think it’s PCA time … and also time to revisit charts in R.

Pitch locations for the first three games of the season:

Pitches from the first three games that were ‘called balls’:


… not bad there, ump!

On the other hand … pitches that were ‘called strikes’:


… no wonder the players get cheesed at the umpires :-þ

On another other hand though … pitches that were swung at:


… pretty sure the umpires figure the batters are blind too!

And now … hits!

Miguel Cabrera’s spray chart for 2016.  Representative field 330′ down the lines and 371 to center.



Parsing is even harder …

… when different sources (i.e. Retrosheet and MLB) record a play differently.

I’m down to the short strokes with my new parser.  Rather than having to interpret myriad text descriptions of plays that don’t involve the batter-runner, I am processing the movement of players from base to base using runner events only.  I’ve had to figure out how to do a secondary sort of xml child elements to group the two types of runner events in the correct sequences, and I also had to figure out how to loop through the analysis without writing to the database unless a batter-involved transition state change had occurred.  However, the results have been worth it as the results are now achieved with much less code and with little to no ambiguity.

However, results are only as good as the original data … so here’s an interesting play:

In the September 17, 2016 game between the Royals and the WhiteSox, top of 4 … Todd Frazier steals second on the same pitch on which Jason Coats is called out on strikes. Did Coats strike out with a man on first (transition from 100 1 to 100 2) or with a man on second (010 1 to 010 2)?  When I get some time, I’m going to try to wade through the Official Rules at MLB to see if there is a description of how this situation should be handled for scoring.

MLB has it all happening as one event, which I think is incorrect, resulting in the transition 100 1 to 010 2.  Retrosheet has the strikeout occurring with a man at second, 010 1 to 010 2.  Doesn’t sound like much, but to me different is different.

I’ve also found a few events where MLB doesn’t appear to have been consistent with using a separate event ID for runner events.  These result in transition state changes that aren’t correct.  I’d love to tell MLB about them but there doesn’t seem to be any way to do that … at least not yet.

Parsing is hard work …

… especially when the source contains odd descriptions.

One thing I have to do with my parser is to decipher text descriptions of base-running plays.  It’s clunky but it works.  Except when this sort of thing appears:

<action b=”0 s=”1 o=”0 des=”With Welington Castillo batting, Michael Bourn advances to 3rd base on a caught stealing error by Tommy Joseph, assist to pitcher Adam Morgan to third baseman Maikel Franco to second baseman Cesar Hernandez to third baseman Maikel Franco. des_es=”Con Welington Castillo bateando, Michael Bourn atrapado robando, avanza a 3ra por error de Tommy Joseph, asistencia para lanzador Adam Morgan a tercera base Maikel Franco a segunda base Cesar Hernandez a tercera base Maikel Franco. event=”Caught Stealing 3B event_es=”Retirado en Intento de Robo 3B tfs=”003748 tfs_zulu=”2016-06-18T00:37:48Z player=”456422 pitch=”1 event_num=”300 play_guid=”3bc5d844-1d09-46b7-8d49-8365b8466403 home_team_runs=”2 away_team_runs=”3/>

This is from the 5th inning of the PHI-ARI game on June 17.  Bourn actually winds up at 2nd base when the ball is dropped there.  Officially, he is caught stealing, but he doesn’t wind up at 3rd.  So, I have to go into the original xml file and edit out the offending passage in order to generate the correct transition state change.

Looks like I’m going to have to live with this sort of thing for now.  And it’s good motivation to rewrite the parser so that it can ignore text descriptions and just use the information from the ‘runner’ child elements in atbats … assuming that everything else is correct.

The bright side is that it’s only one event out of almost 200,000 …. btw I’m down to the last 20 differences.