I never take no for an answer from the computer … or, how to get an R function to return two values

Whenever I run a batting order simulation, I test all 362,880 possible combinations.  Simulating all those combinations 1000 times each takes 8 hours using an 8-core, 4 GHz, 32GB RAM computer.

As it stands, I can return only one value (in this case: total runs) from the function.  As part of determining if the simulation was working correctly, I thought it might be interesting to see how many batters came to the plate in a 9-inning simulation.

So, I wanted to return both runs and batters each time a simulation is run.  However, an R function can return only one value.  I poked around and found out that you can return a list as a way to return more than one value … but have been unable to make it work.  Part of the problem (I think) is that I am using ‘replicate’.

Here is the very much simplified version of what is being done in R:


create the cluster

#read in the list of all possible lineup combinations
#do the whole thing using parellel processing (8 processors)

results_lineups <- foreach(i = 1:combos) %dopar% {

#read in the transition-state-changes table for the combination being simulated
#simulate the combination 1000 times

results <- replicate(1000, simulate.lineups.game( variables list ))
results_lineups[i] <- mean(results)
#print the results to a csv file and stop the cluster


I’ve tried several different bits of syntax but nothing has worked.  So time for a brainstorm …

I’ve used CONCATENATE in Excel to create unique strings from two pieces of information and I wondered if I couldn’t create an artificial value that would do the same thing.

So, I modified the function to determine total runs and total batters and created a single value by:

total = (total.runs * 10000) + (batters/10000)

which creates a single value, for example: 40960.0038034

Then all I have to do is deconstruct the returned value into it’s component parts, in this case: 4.096 runs and 38.034 batters for that particular simulation.

So, it works, but I still wonder if there isn’t a way to return two values in R.  If anyone knows, or can point me in the right direction, I’m all ears … :-þ

So … where could Lorenzo Cain go in the Jays batting order?

I had a great time this weekend at the 2nd Annual Canadian Baseball History Symposium in St. Marys, Ontario.  Lots of fascinating presentations about baseball history in Canada from Addy to Zastryzny.

One of the conversations I had centered around the Jays batting order and the question was asked: If they sign Lorenzo, where should he go in the order?

I have run the simulation substituting Lorenzo for Jose (thanks for the memories … can’t wait to see your number installed in the Level of Excellence) Bautista.

… and the answer is: eighth

Russell MartinC13241245742490454379359346
Kendrys MoralesDH231435156663632531406413
Darwin Barney2B08810235295498675753823
Troy TulowitzkiSS124980449107383353309296268
Steve PearceLF221752831194343328320299305
Lorenzo CainRF1720213476343466508543548
Justin Smoak1B12853399401370388440415
Josh Donaldson3B148175268515730989524433
Kevin PillarCF27870434279107129118


Same methodology … 1000 simulations of each possible combination … consensus of top 3629 out of 362,880 … shortest path through results array.

Of course, it’s all moot if they don’t sign Donaldson :-þ


Thanks to Scott Crawford of the Canadian Baseball Hall of Fame for permission to include my photo of the Roy Halladay tribute that was on display at the symposium.  Roy was inducted into the Canadian Baseball Hall of Fame this past summer.

Finding the maximum sum path through a two-dimensional array

I’m interested in simulating batting orders to compare the results with the actual orders used by major league teams. There are two reasons: first is the hope that testing will confirm that the simulator works reasonably well, second is to generate orders that can be used to resimulate the season after the fact as another way to test function and to make improvements.

I retrieve gameday files every day of the season, so I have an events file that lists the batter. I queried the database table to copy the list of events to excel and then used MATCH to find the first batter for each team in each game. From there, I looked for the first nine batters assuming that this was the batting order. Results were pooled and a list of all batters who played for each team during the season was generated. Some more excel logic tabulated the number of times each batter batted in each position (1-9) in the batting order. There are more players involved than you might think, and more variation from a set order as well. The result for each team is a table of about 20 players x the nine spots in the order.

The next step is to determine the consensus order for each team. First, players with only a few games can be eliminated. Then it’s a matter of finding the maximum sum path through the array. You can eyeball this, but to be sure it’s better to use the computer. The problem is the immense number of combinations that need to be tested.  And there is no method on the web that does the job either … at least I couldn’t find one.   So, I decided to figure it out.

A list of 12 players x nine spots in the order yields 12^9 possible (5,159,780,352) combinations. Even with eight processors, it’s going to take a very long time … and that’s just for one team. So more players need to be eliminated. I sorted each list and took the player who started the most games at each position … giving me a list of nine players per team. The 9 x 9 array still has 387,420,489 combinations but this can be reviewed in about 20 minutes.

That’s still too long though as that’s just for one team. Then I realized that there were a great many combinations that did not need to be tested, i.e. all the ones with duplicate positions.  The problem was how to do that programmatically.  Then it hit me: what I really wanted to test is the 362,880 (9!) permutations for 9 players. I realized that if I could generate a list of these permutations, I could use the numbers as coordinates to indicate which array element was needed.

So, what I needed was a list of permutations like this:

1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 9 8
1 2 3 4 5 6 9 7 8
1 2 3 4 5 6 9 8 7
1 2 3 4 5 9 6 7 8

9 8 7 6 5 4 3 2 1

i.e. all the combinations of 1-9 where no number is repeated. The problem was: how do I generate this list. I tried a few things and then decided to poke around the web. What a struggle … nothing was straightforward, and nothing was going to give me the list I wanted … until I stumbled across TextMechanic. This site has a whole host of text manipulation tools and among them is a permutations generator. Now, I have my list.  … [ And then I remembered that I had already figured out how to do this in R … what a dope :-þ ]

All I had to do was revise the R script to use each row of the permutations file as a list of array elements. Now I can find the maximum sum path through a 9 x 9 array in about 7 seconds.

Here is the R script that finds the path … all it takes is about 30 lines.


team.csv is a list of batting order spots by the number of times each player hit in that spot.

output.csv is the list of permutations obtained by using the app at TextMechanic.com

The result for the Jays is:

3  7  4  9  6  5  2  8  1

Looks like this:

namepositionspot in batting order
Jose BautistaOF5119533310000
Russell MartinC034416212111
Kendrys MoralesDH001274534000
Darwin Barney2B030006222737
Steve Pearce OF1130521211750
Justin Smoak1B005145394720
Josh Donaldson3B07437100000
Ryan GoinsSS0000013213840
Kevin PillarOF5810092913337



More fun with the Jays batting order

The simulation summarized in the last post was the result from the start of the game only.  In other words, how would each lineup do at the start of the game.

I reran the simulation by reprogramming the simulator to run for nine innings – beginning each new inning with the batter who was on deck at the end of the previous inning.

Each combination (there are 362,880) was simulated 1000 times.  I took the top 363 results and totalled up the number of times each player appeared in each position in the order.

The overwhelming favorite to lead off is Kevin Pillar and the top four seem quite certain.

Position in Batting Order
Player 1 2 3 4 5 6 7 8 9
Kevin Pillar 335 0 0 0 2 7 9 3 7
Troy Tulowitzki 8 85 35 6 71 54 42 28 34
Jose Bautista 16 17 151 66 29 35 19 19 11
Steve Pearce 0 3 13 175 44 37 28 32 31
Josh Donaldson 4 247 51 3 40 10 1 2 5
Luke Maile 0 0 0 10 35 62 90 81 85
Russell Martin 0 9 13 71 63 66 49 52 40
Justin Smoak 0 0 97 31 47 46 48 49 45
Darwin Barney 0 2 3 1 32 46 77 97 105

So, here is what the maximum path through the table says the Jays need to do:


… and then send me a modest slice of the pie after they win it all :-þ


Your team is under .500 … give this a whirl

Steve Phillips of TSN.ca wrote an article about the Jays batting order … which got me thinking … what does my simulator say?  The simulation uses everyone’s tendencies of the past four seasons as input … not just who is hot at this time.

And … the simulator says the Jays should be using this order (and it’s not even close):

Kevin Pillar
Troy Tulowitzki
Jose Bautista
Russell Martin
Josh Donaldson
Steve Pearce
Justin Smoak
Darwin Barney
Luke Maile

C’mon boys … today’s order didn’t work so give it a whirl.

Bold predictions for 2017 … part I

Here is what my simulator thinks is going to happen in 2017.  Based on the lineups posted at rotochamp.  2430 games x 1000 seasons.  Same assumptions as last year:  constant, uninterrupted 5-man rotations (haha), starter pitches the whole game (bwahahaha) and, everyone performs the same way this year as they did last year (equally hilarious) … however, you have to start somewhere :-þ

Next up … adding in provisions for starters pitching 6 innings and considering bullpens …

After some experimentation with pseudocounts … I like this better … not so great a spread between highest and lowest win totals.

2013 was a bad year for pitchf/x data

I’ve run this query to find games where average pfx_z movement of fastballs (PITCH_TYPE  = FF) differed from one game to another.  There is probably a better way using a nested SELECT, but this worked for me:

create temporary table pfx_temp_01 as
Select concat(game_id, home) as new_game_id, pfx_z from baseball_13.mlb_pitches WHERE PITCH_TYPE = ‘FF’;
create temporary table pfx_temp_02 as
select new_game_id, avg(pfx_z) as avg_pfx_z from pfx_temp_01 group by new_game_id;
select new_game_id, avg_pfx_z from pfx_temp_02 where avg_pfx_z < 0 ;
drop temporary table pfx_temp_01, pfx_temp_02;

This gives me a list of 173 games from 2013 where the average pfx_z for a game was less than zero.  The average (of the average) for these games is -4.01 (sd = 0.53).  The average (of the average) for all the other games of that year (i.e. where the average pfx_z was greater than zero) is 9.06 (sd = 1.38).

Then I ran this query to extract the individual pitches from the subset of games where the average pfx_z was less than zero:

create temporary table pfx_temp_01 as
Select concat(game_id, home) as new_game_id, pfx_z from baseball_13.mlb_pitches WHERE PITCH_TYPE = ‘FF’;
create temporary table pfx_temp_02 as
select new_game_id, avg(pfx_z) as avg_pfx_z from pfx_temp_01 group by new_game_id;
create temporary table pfx_temp_03 as
select new_game_id, avg_pfx_z from pfx_temp_02 where avg_pfx_z < 0 ;
select a.new_game_id, a.pfx_z from pfx_temp_01 a join pfx_temp_03 b where a.new_game_id = b.new_game_id;
drop temporary table pfx_temp_01, pfx_temp_02, pfx_temp_03;

The average pfx_z of the 18,548 individual FF pitches for these games is -3.99 (sd = 1.07)

The average pfx_z of the 227,067 individual FF pitches for the games where the average pfx_z for the game was greater than zero is 9.11 (sd = 2.34)

These have got to be two very distinct groups.  Now to try and figure out why …

total pitches thrown in 2013 = 709630
FF pitches in 2013 = 245615


And, as it turns out, 2013 isn’t the only year.  I’ve downloaded game data back to 2010 and the results are:

Year # games where average FF pfx_z < 0
2010 0
2011 1
2012 72
2013 173
2014 0
2015 0
2016 0

The mystery deepens …

I’ve looked at pfx_x and pfx_z movement on a day-by-day basis to see if there are other days with the same issue.  And, yes there are; here are the plots for the games from June 26 to June 30:

The game_id 201306282 is the second game of a doubleheader between Cleveland and the WhiteSox and so I looked up the starters to see if their pitches of this day were different from the rest of their starts.

Here are Jose Quintana’s plots for 2013:

And here are Carlos Carrasco’s plots for 2013:

So, both pitchers had the same change in overall pitch movement on the same day.  This just can’t be right.

Explorations continue …


Three days of the pitchf/x condor …

There appears to be something funky about at least three games worth of pitchf/x data from 2013.

I was playing around with learning to use ggplot2 and using Justin Verlanders data from 2013.  This revealed a strange pattern of pitch movement during the three starts on 4/25, 7/9, and 9/7 where the pfx_x and pfx_z data appear to be scaled differently.

pfx_x vs pfx_z for JV 2013-2016

I’ve taken two other three-game samples from 2013 and also the entire season (without the three odd games) and the average pfx_x and pfx_z data look like this:


All games except 4/25 7/9 9/7
Games of 5/5 7/4 9/13
Games of 5/6 7/6 9/16
Games of 4/25 7/9 9/7
PITCH pfx_x pfx_z
FF -1.577 8.204
FF -1.961 9.155
FF -1.599 8.059
FF -0.849 3.501
PITCH pfx_x pfx_z
CH -1.08 3.923
CH -1.227 4.475
CH -0.725 4.079
CH -1.246 0.694
PITCH pfx_x pfx_z
CU 3.129 -5.93
CU 3.114 -5.58
CU 3.079 -5.952
CU 2.202 -9.302
PITCH pfx_x pfx_z
SL 1.809 0.333
SL 1.851 1.257
SL 1.013 0.114
SL 1.426 -3.141

I’d love to hear from anyone who has seen anything similar, or who might have an idea as to what might be going on.  I’m sure the problem lies with the data, but I’ve :cough: been wrong before.

And … bear with me while I figure out how to make tables and plots.  Tons of fun, but I’m at the simple stage for now :-þ

Well, this is odd …

I’ve reloaded data from MLB for the last four years to include pitchf/x and hit data.  In learning how to use ggplot2, I’ve come across something perplexing.

Looking at the pitchf/x data for Justin Verlander over the years 2013-2016, there appears to be three games in 2013 where the PFX_X and PFX_Z data are scaled differently.  These three games are from 20130425, 20130709, and 20130907 and can be seen in the center of the bottom half of the 2013 plot.

It’s not a plotting issue … the pitchf/x data for those pitches is different.  Anybody else seen this?  More to follow as I try to figure out what is going on …