Finding the maximum sum path through a two-dimensional array

I’m interested in simulating batting orders to compare the results with the actual orders used by major league teams. There are two reasons: first is the hope that testing will confirm that the simulator works reasonably well, second is to generate orders that can be used to resimulate the season after the fact as another way to test function and to make improvements.

I retrieve gameday files every day of the season, so I have an events file that lists the batter. I queried the database table to copy the list of events to excel and then used MATCH to find the first batter for each team in each game. From there, I looked for the first nine batters assuming that this was the batting order. Results were pooled and a list of all batters who played for each team during the season was generated. Some more excel logic tabulated the number of times each batter batted in each position (1-9) in the batting order. There are more players involved than you might think, and more variation from a set order as well. The result for each team is a table of about 20 players x the nine spots in the order.

The next step is to determine the consensus order for each team. First, players with only a few games can be eliminated. Then it’s a matter of finding the maximum sum path through the array. You can eyeball this, but to be sure it’s better to use the computer. The problem is the immense number of combinations that need to be tested.  And there is no method on the web that does the job either … at least I couldn’t find one.   So, I decided to figure it out.

A list of 12 players x nine spots in the order yields 12^9 possible (5,159,780,352) combinations. Even with eight processors, it’s going to take a very long time … and that’s just for one team. So more players need to be eliminated. I sorted each list and took the player who started the most games at each position … giving me a list of nine players per team. The 9 x 9 array still has 387,420,489 combinations but this can be reviewed in about 20 minutes.

That’s still too long though as that’s just for one team. Then I realized that there were a great many combinations that did not need to be tested, i.e. all the ones with duplicate positions.  The problem was how to do that programmatically.  Then it hit me: what I really wanted to test is the 362,880 (9!) permutations for 9 players. I realized that if I could generate a list of these permutations, I could use the numbers as coordinates to indicate which array element was needed.

So, what I needed was a list of permutations like this:

1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 9 8
1 2 3 4 5 6 9 7 8
1 2 3 4 5 6 9 8 7
1 2 3 4 5 9 6 7 8

9 8 7 6 5 4 3 2 1

i.e. all the combinations of 1-9 where no number is repeated. The problem was: how do I generate this list. I tried a few things and then decided to poke around the web. What a struggle … nothing was straightforward, and nothing was going to give me the list I wanted … until I stumbled across TextMechanic. This site has a whole host of text manipulation tools and among them is a permutations generator. Now, I have my list.

All I had to do was revise the R script to use each row of the permutations file as a list of array elements. Now I can find the maximum sum path through a 9 x 9 array in about 7 seconds.

Here is the R script that finds the path … all it takes is about 30 lines.


team.csv is a list of batting order spots by the number of times each player hit in that spot.

output.csv is the list of permutations obtained by using the app at

The result for the Jays is:

3  7  4  9  6  5  2  8  1

Looks like this:

namepositionspot in batting order
Jose BautistaOF5119533310000
Russell MartinC034416212111
Kendrys MoralesDH001274534000
Darwin Barney2B030006222737
Steve Pearce OF1130521211750
Justin Smoak1B005145394720
Josh Donaldson3B07437100000
Ryan GoinsSS0000013213840
Kevin PillarOF5810092913337



More fun with the Jays batting order

The simulation summarized in the last post was the result from the start of the game only.  In other words, how would each lineup do at the start of the game.

I reran the simulation by reprogramming the simulator to run for nine innings – beginning each new inning with the batter who was on deck at the end of the previous inning.

Each combination (there are 362,880) was simulated 1000 times.  I took the top 363 results and totalled up the number of times each player appeared in each position in the order.

The overwhelming favorite to lead off is Kevin Pillar and the top four seem quite certain.

Position in Batting Order
Player 1 2 3 4 5 6 7 8 9
Kevin Pillar 335 0 0 0 2 7 9 3 7
Troy Tulowitzki 8 85 35 6 71 54 42 28 34
Jose Bautista 16 17 151 66 29 35 19 19 11
Steve Pearce 0 3 13 175 44 37 28 32 31
Josh Donaldson 4 247 51 3 40 10 1 2 5
Luke Maile 0 0 0 10 35 62 90 81 85
Russell Martin 0 9 13 71 63 66 49 52 40
Justin Smoak 0 0 97 31 47 46 48 49 45
Darwin Barney 0 2 3 1 32 46 77 97 105

So, here is what the maximum path through the table says the Jays need to do:


… and then send me a modest slice of the pie after they win it all :-þ


Your team is under .500 … give this a whirl

Steve Phillips of wrote an article about the Jays batting order … which got me thinking … what does my simulator say?  The simulation uses everyone’s tendencies of the past four seasons as input … not just who is hot at this time.

And … the simulator says the Jays should be using this order (and it’s not even close):

Kevin Pillar
Troy Tulowitzki
Jose Bautista
Russell Martin
Josh Donaldson
Steve Pearce
Justin Smoak
Darwin Barney
Luke Maile

C’mon boys … today’s order didn’t work so give it a whirl.

Bold predictions for 2017 … part I

Here is what my simulator thinks is going to happen in 2017.  Based on the lineups posted at rotochamp.  2430 games x 1000 seasons.  Same assumptions as last year:  constant, uninterrupted 5-man rotations (haha), starter pitches the whole game (bwahahaha) and, everyone performs the same way this year as they did last year (equally hilarious) … however, you have to start somewhere :-þ

Next up … adding in provisions for starters pitching 6 innings and considering bullpens …

After some experimentation with pseudocounts … I like this better … not so great a spread between highest and lowest win totals.

2013 was a bad year for pitchf/x data

I’ve run this query to find games where average pfx_z movement of fastballs (PITCH_TYPE  = FF) differed from one game to another.  There is probably a better way using a nested SELECT, but this worked for me:

create temporary table pfx_temp_01 as
Select concat(game_id, home) as new_game_id, pfx_z from baseball_13.mlb_pitches WHERE PITCH_TYPE = ‘FF’;
create temporary table pfx_temp_02 as
select new_game_id, avg(pfx_z) as avg_pfx_z from pfx_temp_01 group by new_game_id;
select new_game_id, avg_pfx_z from pfx_temp_02 where avg_pfx_z < 0 ;
drop temporary table pfx_temp_01, pfx_temp_02;

This gives me a list of 173 games from 2013 where the average pfx_z for a game was less than zero.  The average (of the average) for these games is -4.01 (sd = 0.53).  The average (of the average) for all the other games of that year (i.e. where the average pfx_z was greater than zero) is 9.06 (sd = 1.38).

Then I ran this query to extract the individual pitches from the subset of games where the average pfx_z was less than zero:

create temporary table pfx_temp_01 as
Select concat(game_id, home) as new_game_id, pfx_z from baseball_13.mlb_pitches WHERE PITCH_TYPE = ‘FF’;
create temporary table pfx_temp_02 as
select new_game_id, avg(pfx_z) as avg_pfx_z from pfx_temp_01 group by new_game_id;
create temporary table pfx_temp_03 as
select new_game_id, avg_pfx_z from pfx_temp_02 where avg_pfx_z < 0 ;
select a.new_game_id, a.pfx_z from pfx_temp_01 a join pfx_temp_03 b where a.new_game_id = b.new_game_id;
drop temporary table pfx_temp_01, pfx_temp_02, pfx_temp_03;

The average pfx_z of the 18,548 individual FF pitches for these games is -3.99 (sd = 1.07)

The average pfx_z of the 227,067 individual FF pitches for the games where the average pfx_z for the game was greater than zero is 9.11 (sd = 2.34)

These have got to be two very distinct groups.  Now to try and figure out why …

total pitches thrown in 2013 = 709630
FF pitches in 2013 = 245615


And, as it turns out, 2013 isn’t the only year.  I’ve downloaded game data back to 2010 and the results are:

Year # games where average FF pfx_z < 0
2010 0
2011 1
2012 72
2013 173
2014 0
2015 0
2016 0

The mystery deepens …

I’ve looked at pfx_x and pfx_z movement on a day-by-day basis to see if there are other days with the same issue.  And, yes there are; here are the plots for the games from June 26 to June 30:

The game_id 201306282 is the second game of a doubleheader between Cleveland and the WhiteSox and so I looked up the starters to see if their pitches of this day were different from the rest of their starts.

Here are Jose Quintana’s plots for 2013:

And here are Carlos Carrasco’s plots for 2013:

So, both pitchers had the same change in overall pitch movement on the same day.  This just can’t be right.

Explorations continue …


Three days of the pitchf/x condor …

There appears to be something funky about at least three games worth of pitchf/x data from 2013.

I was playing around with learning to use ggplot2 and using Justin Verlanders data from 2013.  This revealed a strange pattern of pitch movement during the three starts on 4/25, 7/9, and 9/7 where the pfx_x and pfx_z data appear to be scaled differently.

pfx_x vs pfx_z for JV 2013-2016

I’ve taken two other three-game samples from 2013 and also the entire season (without the three odd games) and the average pfx_x and pfx_z data look like this:


All games except 4/25 7/9 9/7
Games of 5/5 7/4 9/13
Games of 5/6 7/6 9/16
Games of 4/25 7/9 9/7
PITCH pfx_x pfx_z
FF -1.577 8.204
FF -1.961 9.155
FF -1.599 8.059
FF -0.849 3.501
PITCH pfx_x pfx_z
CH -1.08 3.923
CH -1.227 4.475
CH -0.725 4.079
CH -1.246 0.694
PITCH pfx_x pfx_z
CU 3.129 -5.93
CU 3.114 -5.58
CU 3.079 -5.952
CU 2.202 -9.302
PITCH pfx_x pfx_z
SL 1.809 0.333
SL 1.851 1.257
SL 1.013 0.114
SL 1.426 -3.141

I’d love to hear from anyone who has seen anything similar, or who might have an idea as to what might be going on.  I’m sure the problem lies with the data, but I’ve :cough: been wrong before.

And … bear with me while I figure out how to make tables and plots.  Tons of fun, but I’m at the simple stage for now :-þ

Well, this is odd …

I’ve reloaded data from MLB for the last four years to include pitchf/x and hit data.  In learning how to use ggplot2, I’ve come across something perplexing.

Looking at the pitchf/x data for Justin Verlander over the years 2013-2016, there appears to be three games in 2013 where the PFX_X and PFX_Z data are scaled differently.  These three games are from 20130425, 20130709, and 20130907 and can be seen in the center of the bottom half of the 2013 plot.

It’s not a plotting issue … the pitchf/x data for those pitches is different.  Anybody else seen this?  More to follow as I try to figure out what is going on …




Unleash the hounds :-þ

I’ve had it in the back of my mind for some time now that it wouldn’t be too much of a leap to alter my parser from the collection of batter-related transition state changes to the collection of pitch data … both sets of information are in the same xml file.  So, last night, after I got down to the last three differences between my parsing results and those from Retrosheet, I decided to give it a whirl.

Oh, oh, oh … it was a piece of cake compared to generating transition state changes … and now I have a database table of 715819 pitches by 51 fields for each pitch.  Woohoo!

Now the question is:  what questions to ask about pitches???  I think it’s PCA time … and also time to revisit charts in R.

Pitch locations for the first three games of the season:

Pitches from the first three games that were ‘called balls’:


… not bad there, ump!

On the other hand … pitches that were ‘called strikes’:


… no wonder the players get cheesed at the umpires :-þ

On another other hand though … pitches that were swung at:


… pretty sure the umpires figure the batters are blind too!

And now … hits!

Miguel Cabrera’s spray chart for 2016.  Representative field 330′ down the lines and 371 to center.



Parsing is even harder …

… when different sources (i.e. Retrosheet and MLB) record a play differently.

I’m down to the short strokes with my new parser.  Rather than having to interpret myriad text descriptions of plays that don’t involve the batter-runner, I am processing the movement of players from base to base using runner events only.  I’ve had to figure out how to do a secondary sort of xml child elements to group the two types of runner events in the correct sequences, and I also had to figure out how to loop through the analysis without writing to the database unless a batter-involved transition state change had occurred.  However, the results have been worth it as the results are now achieved with much less code and with little to no ambiguity.

However, results are only as good as the original data … so here’s an interesting play:

In the September 17, 2016 game between the Royals and the WhiteSox, top of 4 … Todd Frazier steals second on the same pitch on which Jason Coats is called out on strikes. Did Coats strike out with a man on first (transition from 100 1 to 100 2) or with a man on second (010 1 to 010 2)?  When I get some time, I’m going to try to wade through the Official Rules at MLB to see if there is a description of how this situation should be handled for scoring.

MLB has it all happening as one event, which I think is incorrect, resulting in the transition 100 1 to 010 2.  Retrosheet has the strikeout occurring with a man at second, 010 1 to 010 2.  Doesn’t sound like much, but to me different is different.

I’ve also found a few events where MLB doesn’t appear to have been consistent with using a separate event ID for runner events.  These result in transition state changes that aren’t correct.  I’d love to tell MLB about them but there doesn’t seem to be any way to do that … at least not yet.