2013 was a bad year for pitchf/x data

I’ve run this query to find games where average pfx_z movement of fastballs (PITCH_TYPE  = FF) differed from one game to another.  There is probably a better way using a nested SELECT, but this worked for me:

create temporary table pfx_temp_01 as
Select concat(game_id, home) as new_game_id, pfx_z from baseball_13.mlb_pitches WHERE PITCH_TYPE = ‘FF’;
create temporary table pfx_temp_02 as
select new_game_id, avg(pfx_z) as avg_pfx_z from pfx_temp_01 group by new_game_id;
select new_game_id, avg_pfx_z from pfx_temp_02 where avg_pfx_z < 0 ;
drop temporary table pfx_temp_01, pfx_temp_02;

This gives me a list of 173 games from 2013 where the average pfx_z for a game was less than zero.  The average (of the average) for these games is -4.01 (sd = 0.53).  The average (of the average) for all the other games of that year (i.e. where the average pfx_z was greater than zero) is 9.06 (sd = 1.38).

Then I ran this query to extract the individual pitches from the subset of games where the average pfx_z was less than zero:

create temporary table pfx_temp_01 as
Select concat(game_id, home) as new_game_id, pfx_z from baseball_13.mlb_pitches WHERE PITCH_TYPE = ‘FF’;
create temporary table pfx_temp_02 as
select new_game_id, avg(pfx_z) as avg_pfx_z from pfx_temp_01 group by new_game_id;
create temporary table pfx_temp_03 as
select new_game_id, avg_pfx_z from pfx_temp_02 where avg_pfx_z < 0 ;
select a.new_game_id, a.pfx_z from pfx_temp_01 a join pfx_temp_03 b where a.new_game_id = b.new_game_id;
drop temporary table pfx_temp_01, pfx_temp_02, pfx_temp_03;

The average pfx_z of the 18,548 individual FF pitches for these games is -3.99 (sd = 1.07)

The average pfx_z of the 227,067 individual FF pitches for the games where the average pfx_z for the game was greater than zero is 9.11 (sd = 2.34)

These have got to be two very distinct groups.  Now to try and figure out why …

………………
total pitches thrown in 2013 = 709630
FF pitches in 2013 = 245615

………………….

And, as it turns out, 2013 isn’t the only year.  I’ve downloaded game data back to 2010 and the results are:

Year # games where average FF pfx_z < 0
2010 0
2011 1
2012 72
2013 173
2014 0
2015 0
2016 0

The mystery deepens …

I’ve looked at pfx_x and pfx_z movement on a day-by-day basis to see if there are other days with the same issue.  And, yes there are; here are the plots for the games from June 26 to June 30:

The game_id 201306282 is the second game of a doubleheader between Cleveland and the WhiteSox and so I looked up the starters to see if their pitches of this day were different from the rest of their starts.

Here are Jose Quintana’s plots for 2013:

And here are Carlos Carrasco’s plots for 2013:

So, both pitchers had the same change in overall pitch movement on the same day.  This just can’t be right.

Explorations continue …

 

Three days of the pitchf/x condor …

There appears to be something funky about at least three games worth of pitchf/x data from 2013.

I was playing around with learning to use ggplot2 and using Justin Verlanders data from 2013.  This revealed a strange pattern of pitch movement during the three starts on 4/25, 7/9, and 9/7 where the pfx_x and pfx_z data appear to be scaled differently.

pfx_x vs pfx_z for JV 2013-2016

I’ve taken two other three-game samples from 2013 and also the entire season (without the three odd games) and the average pfx_x and pfx_z data look like this:

 

Dataset
All games except 4/25 7/9 9/7
Games of 5/5 7/4 9/13
Games of 5/6 7/6 9/16
Games of 4/25 7/9 9/7
PITCH pfx_x pfx_z
FF -1.577 8.204
FF -1.961 9.155
FF -1.599 8.059
FF -0.849 3.501
PITCH pfx_x pfx_z
CH -1.08 3.923
CH -1.227 4.475
CH -0.725 4.079
CH -1.246 0.694
PITCH pfx_x pfx_z
CU 3.129 -5.93
CU 3.114 -5.58
CU 3.079 -5.952
CU 2.202 -9.302
PITCH pfx_x pfx_z
SL 1.809 0.333
SL 1.851 1.257
SL 1.013 0.114
SL 1.426 -3.141

I’d love to hear from anyone who has seen anything similar, or who might have an idea as to what might be going on.  I’m sure the problem lies with the data, but I’ve :cough: been wrong before.

And … bear with me while I figure out how to make tables and plots.  Tons of fun, but I’m at the simple stage for now :-þ

Well, this is odd …

I’ve reloaded data from MLB for the last four years to include pitchf/x and hit data.  In learning how to use ggplot2, I’ve come across something perplexing.

Looking at the pitchf/x data for Justin Verlander over the years 2013-2016, there appears to be three games in 2013 where the PFX_X and PFX_Z data are scaled differently.  These three games are from 20130425, 20130709, and 20130907 and can be seen in the center of the bottom half of the 2013 plot.

It’s not a plotting issue … the pitchf/x data for those pitches is different.  Anybody else seen this?  More to follow as I try to figure out what is going on …