Your team is under .500 … give this a whirl

Steve Phillips of wrote an article about the Jays batting order … which got me thinking … what does my simulator say?  The simulation uses everyone’s tendencies of the past four seasons as input … not just who is hot at this time.

And … the simulator says the Jays should be using this order (and it’s not even close):

Kevin Pillar
Troy Tulowitzki
Jose Bautista
Russell Martin
Josh Donaldson
Steve Pearce
Justin Smoak
Darwin Barney
Luke Maile

C’mon boys … today’s order didn’t work so give it a whirl.

Bold predictions for 2017 … part I

Here is what my simulator thinks is going to happen in 2017.  Based on the lineups posted at rotochamp.  2430 games x 1000 seasons.  Same assumptions as last year:  constant, uninterrupted 5-man rotations (haha), starter pitches the whole game (bwahahaha) and, everyone performs the same way this year as they did last year (equally hilarious) … however, you have to start somewhere :-þ

Next up … adding in provisions for starters pitching 6 innings and considering bullpens …

After some experimentation with pseudocounts … I like this better … not so great a spread between highest and lowest win totals.

2013 was a bad year for pitchf/x data

I’ve run this query to find games where average pfx_z movement of fastballs (PITCH_TYPE  = FF) differed from one game to another.  There is probably a better way using a nested SELECT, but this worked for me:

create temporary table pfx_temp_01 as
Select concat(game_id, home) as new_game_id, pfx_z from baseball_13.mlb_pitches WHERE PITCH_TYPE = ‘FF’;
create temporary table pfx_temp_02 as
select new_game_id, avg(pfx_z) as avg_pfx_z from pfx_temp_01 group by new_game_id;
select new_game_id, avg_pfx_z from pfx_temp_02 where avg_pfx_z < 0 ;
drop temporary table pfx_temp_01, pfx_temp_02;

This gives me a list of 173 games from 2013 where the average pfx_z for a game was less than zero.  The average (of the average) for these games is -4.01 (sd = 0.53).  The average (of the average) for all the other games of that year (i.e. where the average pfx_z was greater than zero) is 9.06 (sd = 1.38).

Then I ran this query to extract the individual pitches from the subset of games where the average pfx_z was less than zero:

create temporary table pfx_temp_01 as
Select concat(game_id, home) as new_game_id, pfx_z from baseball_13.mlb_pitches WHERE PITCH_TYPE = ‘FF’;
create temporary table pfx_temp_02 as
select new_game_id, avg(pfx_z) as avg_pfx_z from pfx_temp_01 group by new_game_id;
create temporary table pfx_temp_03 as
select new_game_id, avg_pfx_z from pfx_temp_02 where avg_pfx_z < 0 ;
select a.new_game_id, a.pfx_z from pfx_temp_01 a join pfx_temp_03 b where a.new_game_id = b.new_game_id;
drop temporary table pfx_temp_01, pfx_temp_02, pfx_temp_03;

The average pfx_z of the 18,548 individual FF pitches for these games is -3.99 (sd = 1.07)

The average pfx_z of the 227,067 individual FF pitches for the games where the average pfx_z for the game was greater than zero is 9.11 (sd = 2.34)

These have got to be two very distinct groups.  Now to try and figure out why …

total pitches thrown in 2013 = 709630
FF pitches in 2013 = 245615


And, as it turns out, 2013 isn’t the only year.  I’ve downloaded game data back to 2010 and the results are:

Year # games where average FF pfx_z < 0
2010 0
2011 1
2012 72
2013 173
2014 0
2015 0
2016 0

The mystery deepens …

I’ve looked at pfx_x and pfx_z movement on a day-by-day basis to see if there are other days with the same issue.  And, yes there are; here are the plots for the games from June 26 to June 30:

The game_id 201306282 is the second game of a doubleheader between Cleveland and the WhiteSox and so I looked up the starters to see if their pitches of this day were different from the rest of their starts.

Here are Jose Quintana’s plots for 2013:

And here are Carlos Carrasco’s plots for 2013:

So, both pitchers had the same change in overall pitch movement on the same day.  This just can’t be right.

Explorations continue …


Three days of the pitchf/x condor …

There appears to be something funky about at least three games worth of pitchf/x data from 2013.

I was playing around with learning to use ggplot2 and using Justin Verlanders data from 2013.  This revealed a strange pattern of pitch movement during the three starts on 4/25, 7/9, and 9/7 where the pfx_x and pfx_z data appear to be scaled differently.

pfx_x vs pfx_z for JV 2013-2016

I’ve taken two other three-game samples from 2013 and also the entire season (without the three odd games) and the average pfx_x and pfx_z data look like this:


All games except 4/25 7/9 9/7
Games of 5/5 7/4 9/13
Games of 5/6 7/6 9/16
Games of 4/25 7/9 9/7
PITCH pfx_x pfx_z
FF -1.577 8.204
FF -1.961 9.155
FF -1.599 8.059
FF -0.849 3.501
PITCH pfx_x pfx_z
CH -1.08 3.923
CH -1.227 4.475
CH -0.725 4.079
CH -1.246 0.694
PITCH pfx_x pfx_z
CU 3.129 -5.93
CU 3.114 -5.58
CU 3.079 -5.952
CU 2.202 -9.302
PITCH pfx_x pfx_z
SL 1.809 0.333
SL 1.851 1.257
SL 1.013 0.114
SL 1.426 -3.141

I’d love to hear from anyone who has seen anything similar, or who might have an idea as to what might be going on.  I’m sure the problem lies with the data, but I’ve :cough: been wrong before.

And … bear with me while I figure out how to make tables and plots.  Tons of fun, but I’m at the simple stage for now :-þ

Well, this is odd …

I’ve reloaded data from MLB for the last four years to include pitchf/x and hit data.  In learning how to use ggplot2, I’ve come across something perplexing.

Looking at the pitchf/x data for Justin Verlander over the years 2013-2016, there appears to be three games in 2013 where the PFX_X and PFX_Z data are scaled differently.  These three games are from 20130425, 20130709, and 20130907 and can be seen in the center of the bottom half of the 2013 plot.

It’s not a plotting issue … the pitchf/x data for those pitches is different.  Anybody else seen this?  More to follow as I try to figure out what is going on …




Unleash the hounds :-þ

I’ve had it in the back of my mind for some time now that it wouldn’t be too much of a leap to alter my parser from the collection of batter-related transition state changes to the collection of pitch data … both sets of information are in the same xml file.  So, last night, after I got down to the last three differences between my parsing results and those from Retrosheet, I decided to give it a whirl.

Oh, oh, oh … it was a piece of cake compared to generating transition state changes … and now I have a database table of 715819 pitches by 51 fields for each pitch.  Woohoo!

Now the question is:  what questions to ask about pitches???  I think it’s PCA time … and also time to revisit charts in R.

Pitch locations for the first three games of the season:

Pitches from the first three games that were ‘called balls’:


… not bad there, ump!

On the other hand … pitches that were ‘called strikes’:


… no wonder the players get cheesed at the umpires :-þ

On another other hand though … pitches that were swung at:


… pretty sure the umpires figure the batters are blind too!

And now … hits!

Miguel Cabrera’s spray chart for 2016.  Representative field 330′ down the lines and 371 to center.



Parsing is even harder …

… when different sources (i.e. Retrosheet and MLB) record a play differently.

I’m down to the short strokes with my new parser.  Rather than having to interpret myriad text descriptions of plays that don’t involve the batter-runner, I am processing the movement of players from base to base using runner events only.  I’ve had to figure out how to do a secondary sort of xml child elements to group the two types of runner events in the correct sequences, and I also had to figure out how to loop through the analysis without writing to the database unless a batter-involved transition state change had occurred.  However, the results have been worth it as the results are now achieved with much less code and with little to no ambiguity.

However, results are only as good as the original data … so here’s an interesting play:

In the September 17, 2016 game between the Royals and the WhiteSox, top of 4 … Todd Frazier steals second on the same pitch on which Jason Coats is called out on strikes. Did Coats strike out with a man on first (transition from 100 1 to 100 2) or with a man on second (010 1 to 010 2)?  When I get some time, I’m going to try to wade through the Official Rules at MLB to see if there is a description of how this situation should be handled for scoring.

MLB has it all happening as one event, which I think is incorrect, resulting in the transition 100 1 to 010 2.  Retrosheet has the strikeout occurring with a man at second, 010 1 to 010 2.  Doesn’t sound like much, but to me different is different.

I’ve also found a few events where MLB doesn’t appear to have been consistent with using a separate event ID for runner events.  These result in transition state changes that aren’t correct.  I’d love to tell MLB about them but there doesn’t seem to be any way to do that … at least not yet.

Parsing is hard work …

… especially when the source contains odd descriptions.

One thing I have to do with my parser is to decipher text descriptions of base-running plays.  It’s clunky but it works.  Except when this sort of thing appears:

<action b=”0 s=”1 o=”0 des=”With Welington Castillo batting, Michael Bourn advances to 3rd base on a caught stealing error by Tommy Joseph, assist to pitcher Adam Morgan to third baseman Maikel Franco to second baseman Cesar Hernandez to third baseman Maikel Franco. des_es=”Con Welington Castillo bateando, Michael Bourn atrapado robando, avanza a 3ra por error de Tommy Joseph, asistencia para lanzador Adam Morgan a tercera base Maikel Franco a segunda base Cesar Hernandez a tercera base Maikel Franco. event=”Caught Stealing 3B event_es=”Retirado en Intento de Robo 3B tfs=”003748 tfs_zulu=”2016-06-18T00:37:48Z player=”456422 pitch=”1 event_num=”300 play_guid=”3bc5d844-1d09-46b7-8d49-8365b8466403 home_team_runs=”2 away_team_runs=”3/>

This is from the 5th inning of the PHI-ARI game on June 17.  Bourn actually winds up at 2nd base when the ball is dropped there.  Officially, he is caught stealing, but he doesn’t wind up at 3rd.  So, I have to go into the original xml file and edit out the offending passage in order to generate the correct transition state change.

Looks like I’m going to have to live with this sort of thing for now.  And it’s good motivation to rewrite the parser so that it can ignore text descriptions and just use the information from the ‘runner’ child elements in atbats … assuming that everything else is correct.

The bright side is that it’s only one event out of almost 200,000 …. btw I’m down to the last 20 differences.

Transition state ch-ch-ch-changes

In order to use the probability of transition state change as the basis for a game simulator, I want to have up-to-date information.  To that end, I collect game data from MLB and have written a parser that updates transition state probabilities for each player as they progress through the season.

The problem is that using MLB gameday xml files means I need to have the parser read through descriptions of plays to determine when a transition state change actually occurred and what that change was.  One of the major headaches has been that MLB does not list the runner events in the correct order.  The parser needs to proceed around the bases: by this I mean that I have to figure out what has happened at 3rd base before I can figure out what has happened at 2nd base etc.

To enable this, I have had to figure out how to sort xml child elements by their content. It was a struggle but I figured it out.

Another issue is that for certain events (in particular, base-running plays such as steals, MLB will use a player’s name and not his numerical ID.   Next on the list will be to figure out how to look up a player’s ID from the parser so that I can do a better job of determining who is at each base … without having to parse the myriad text descriptions that appear.  The problem with doing it this way is that each time the slightest variation in a text description appears, I have to write a new line of code to process it.  It would be much easier if the parser just looked at who is at each base.

Another issue is that MLB lists each base-running event such as steals in two places … which means it has to be processed once and ignored the second time.

So, my parser isn’t completely accurate.  I know this because I can compare the transition-state-change table that I get from the data I retrieve from MLB with the table that I get from parsing retrosheet’s data with their parser.  Unfortunately for me, I can only do this once a year in December when Retrosheet posts the data from the previous season.  I was able to make my parser’s results agree with Retrosheet’s data after the 2015 season, but the 2016 results show that things still aren’t quite correct.

Here is the table that results from parsing Retrosheet’s data:

And here is the table from parsing MLB’s xml files:

And here is a table showing the differences between the two:


Actually, it’s not as bad as it looks.  There are only 344 to track down and each fix may eliminate more than one difference.

[This was definitely the case.  I found that MLB had added an attibute to a particular child element that broke a regular expression search.  Fixing this leaves only 94 differences.]

So, it’s back to the code … to figure out why the differences have occurred and to eliminate them …