Side-tracked … again

I’m having some fun looking at batting orders through the years thanks to the data available from Retrosheet … and MySQL queries that I have written to summarize who was in the lineup at the start of a game.

Today I’ve been looking at the 1960 season (the first one with complete pbp data) and I’m trying to determine if the queries are doing what I hope they are doing. I look at the Baltimore information … and there is one player (retroID = bthom105) who appeared in the starting lineup once in the 1960 season. My totals don’t add up quite the way I expected, so I backtrack through the query line-by-line to check … and sure enough everything is OK. But I start to wonder about bthom105 … so I look him up at Retrosheet and it’s Bobby Thomson … THE Bobby Thomson.

So I look him up at Baseball-Reference (his ID there is thomsbo01 but that’s another story) and double-check the game logs and everything is good there … and then I click on the link to read his SABR bio …

All told , it’s an hour later … and I haven’t accomplished much at all … but I sure had a helluva lot of fun doing it.

I’m never going to finish at this rate :-þ

Evolution of MLB batting over the years …

Batting data from 1930 to 2018 summarized by transition-state changes and analyzed by Principal Component Analysis. Now … where to go with this … or, what I’ll be doing this winter when I’m not working on my golf swing :-þ

The information used here was obtained free of charge from and is copyrighted by Retrosheet.  Interested parties may contact Retrosheet at "".

Data was analyzed using R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL

Data table to pdf … via PERL … with a side-trip to EXCEL :-þ

So …. you have a data table and you want to make a pdf of it from within a PERL script and can’t make sense of the available modules? Well then, let Excel do it for you.

This is pieced in part from stuff found out there, but I had to solve a few problems to make it all work.


Create an Excel document containing the data of interest and format (landscape/portrait etc.) it the way you want it. Save the file.

Record a macro called ‘save2pdf’ in that workbook. Stop recording without actually doing anything … creates an empty macro.

View macros and step into the macro you just made. Paste this text into the VB editor.

Option Explicit

Sub save2pdf()

‘ save2pdf Macro

ActiveSheet.ExportAsFixedFormat Type:=xlTypePDF, Filename:= _
“C:\Users\username\Documents\foldername\filename.pdf”, Quality:=xlQualityMinimum, _
IncludeDocProperties:=True, IgnorePrintAreas:=False, OpenAfterPublish:= _
End Sub

Save the file as a macro-enabled workbook (ie. as filename.xlsm)

Now for the PERL script:

use strict;
use warnings;
use Win32::OLE;

my $excel= Win32::OLE->new(‘Excel.Application’)
or die “Could not create Excel.Application!\n”
$excel->Workbooks->open( “C:\\Users\\username\\Documents\\foldername\\filename\.xlsm” );
$excel->run( ‘save2pdf’ );
$excel->Workbooks->save( “C:\\Users\\username\\Documents\\foldername\\filename\.xlsm” );

Don’t leave out the \\ in the PERL script.  You have to escape the \ in the path.

That’s all it takes.

I use this in a task-scheduled event for a table that I import from MySQL. The table is refreshed every day via a PERL script. When the script here is run, the table in Excel is refreshed upon opening and saved to a pdf file. The computer does all the work!

From chaos, (batting) order

Well, this has probably been done before but, among other things, I like to figure out how to do things myself. So …

For each player who batted during the 2017 season, where did they bat in the order, and how many times did they bat in each spot?

First things first:

the information used here was obtained free of charge from and is copyrighted by Retrosheet.

Regular season event files were downloaded from Retrosheet and stored in database tables. The data for each season was queried to determine who the first nine batters were in each and every game. Sounds simple enough, but the data had to be segregated into visiting team and home team for all games and then queried. The query was designed to find the first nine players who had a batting event (and not a baserunning event) from the start of each game. There were separate queries for home and away and these were then combined to provide the final table.

For 2017 it looks like this:



Coming soon (well, reasonably soon) the batting order tables for all seasons back to 1930 … and … regularly updated data for the 2018 season.

After the deluge … dealing with changes to the gameday xml repository at MLB

I retrieve MLB’s gameday xml files on a daily basis.  Of course, at the start of the season, it’s a good idea to make sure things have worked as expected.  So, when I checked after day 1 of the season, it was a bit of a shock to see that no data had been downloaded.

Turns out, MLB made a very slight change to the syntax of directory listings from:



The only difference?  Missing the slash at the end, but that broke the contruction of all URLs.

If you use anything based on the scripts in Baseball Hacks, you’ll need to change your PERL scripts.


Here are the changes I had to make:

$dayurl = “$baseurl/year_$year/month_$mon/day_$mday/”;


$dayurl = “$baseurl/year_$year/month_$mon/day_$mday”;


while($html =~ m/<a href=\”(gid_\w+\/)\”/g ) {


while($html =~ m/<a href=\”day\_[0-9]{1,2}\/(gid_\w+\/)\”/g ) {


and then wherever you have:


change it to:

” . “filename.xml”



Pretty sure that’s it.  Goodness, I thought the world had ended for a bit, but it’s all good now …


I never take no for an answer from the computer … or, how to get an R function to return two values

Whenever I run a batting order simulation, I test all 362,880 possible combinations.  Simulating all those combinations 1000 times each takes 8 hours using an 8-core, 4 GHz, 32GB RAM computer.

As it stands, I can return only one value (in this case: total runs) from the function.  As part of determining if the simulation was working correctly, I thought it might be interesting to see how many batters came to the plate in a 9-inning simulation.

So, I wanted to return both runs and batters each time a simulation is run.  However, an R function can return only one value.  I poked around and found out that you can return a list as a way to return more than one value … but have been unable to make it work.  Part of the problem (I think) is that I am using ‘replicate’.

Here is the very much simplified version of what is being done in R:


create the cluster

#read in the list of all possible lineup combinations
#do the whole thing using parellel processing (8 processors)

results_lineups <- foreach(i = 1:combos) %dopar% {

#read in the transition-state-changes table for the combination being simulated
#simulate the combination 1000 times

results <- replicate(1000, variables list ))
results_lineups[i] <- mean(results)
#print the results to a csv file and stop the cluster


I’ve tried several different bits of syntax but nothing has worked.  So time for a brainstorm …

I’ve used CONCATENATE in Excel to create unique strings from two pieces of information and I wondered if I couldn’t create an artificial value that would do the same thing.

So, I modified the function to determine total runs and total batters and created a single value by:

total = (total.runs * 10000) + (batters/10000)

which creates a single value, for example: 40960.0038034

Then all I have to do is deconstruct the returned value into it’s component parts, in this case: 4.096 runs and 38.034 batters for that particular simulation.

So, it works, but I still wonder if there isn’t a way to return two values in R.  If anyone knows, or can point me in the right direction, I’m all ears … :-þ

So … where could Lorenzo Cain go in the Jays batting order?

I had a great time this weekend at the 2nd Annual Canadian Baseball History Symposium in St. Marys, Ontario.  Lots of fascinating presentations about baseball history in Canada from Addy to Zastryzny.

One of the conversations I had centered around the Jays batting order and the question was asked: If they sign Lorenzo, where should he go in the order?

I have run the simulation substituting Lorenzo for Jose (thanks for the memories … can’t wait to see your number installed in the Level of Excellence) Bautista.

… and the answer is: eighth

Russell MartinC13241245742490454379359346
Kendrys MoralesDH231435156663632531406413
Darwin Barney2B08810235295498675753823
Troy TulowitzkiSS124980449107383353309296268
Steve PearceLF221752831194343328320299305
Lorenzo CainRF1720213476343466508543548
Justin Smoak1B12853399401370388440415
Josh Donaldson3B148175268515730989524433
Kevin PillarCF27870434279107129118


Same methodology … 1000 simulations of each possible combination … consensus of top 3629 out of 362,880 … shortest path through results array.

Of course, it’s all moot if they don’t sign Donaldson :-þ


Thanks to Scott Crawford of the Canadian Baseball Hall of Fame for permission to include my photo of the Roy Halladay tribute that was on display at the symposium.  Roy was inducted into the Canadian Baseball Hall of Fame this past summer.

Finding the maximum sum path through a two-dimensional array

I’m interested in simulating batting orders to compare the results with the actual orders used by major league teams. There are two reasons: first is the hope that testing will confirm that the simulator works reasonably well, second is to generate orders that can be used to resimulate the season after the fact as another way to test function and to make improvements.

I retrieve gameday files every day of the season, so I have an events file that lists the batter. I queried the database table to copy the list of events to excel and then used MATCH to find the first batter for each team in each game. From there, I looked for the first nine batters assuming that this was the batting order. Results were pooled and a list of all batters who played for each team during the season was generated. Some more excel logic tabulated the number of times each batter batted in each position (1-9) in the batting order. There are more players involved than you might think, and more variation from a set order as well. The result for each team is a table of about 20 players x the nine spots in the order.

The next step is to determine the consensus order for each team. First, players with only a few games can be eliminated. Then it’s a matter of finding the maximum sum path through the array. You can eyeball this, but to be sure it’s better to use the computer. The problem is the immense number of combinations that need to be tested.  And there is no method on the web that does the job either … at least I couldn’t find one.   So, I decided to figure it out.

A list of 12 players x nine spots in the order yields 12^9 possible (5,159,780,352) combinations. Even with eight processors, it’s going to take a very long time … and that’s just for one team. So more players need to be eliminated. I sorted each list and took the player who started the most games at each position … giving me a list of nine players per team. The 9 x 9 array still has 387,420,489 combinations but this can be reviewed in about 20 minutes.

That’s still too long though as that’s just for one team. Then I realized that there were a great many combinations that did not need to be tested, i.e. all the ones with duplicate positions.  The problem was how to do that programmatically.  Then it hit me: what I really wanted to test is the 362,880 (9!) permutations for 9 players. I realized that if I could generate a list of these permutations, I could use the numbers as coordinates to indicate which array element was needed.

So, what I needed was a list of permutations like this:

1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 9 8
1 2 3 4 5 6 9 7 8
1 2 3 4 5 6 9 8 7
1 2 3 4 5 9 6 7 8

9 8 7 6 5 4 3 2 1

i.e. all the combinations of 1-9 where no number is repeated. The problem was: how do I generate this list. I tried a few things and then decided to poke around the web. What a struggle … nothing was straightforward, and nothing was going to give me the list I wanted … until I stumbled across TextMechanic. This site has a whole host of text manipulation tools and among them is a permutations generator. Now, I have my list.  … [ And then I remembered that I had already figured out how to do this in R … what a dope :-þ ]

All I had to do was revise the R script to use each row of the permutations file as a list of array elements. Now I can find the maximum sum path through a 9 x 9 array in about 7 seconds.

Here is the R script that finds the path … all it takes is about 30 lines.


team.csv is a list of batting order spots by the number of times each player hit in that spot.

output.csv is the list of permutations obtained by using the app at

The result for the Jays is:

3  7  4  9  6  5  2  8  1

Looks like this:

namepositionspot in batting order
Jose BautistaOF5119533310000
Russell MartinC034416212111
Kendrys MoralesDH001274534000
Darwin Barney2B030006222737
Steve Pearce OF1130521211750
Justin Smoak1B005145394720
Josh Donaldson3B07437100000
Ryan GoinsSS0000013213840
Kevin PillarOF5810092913337



More fun with the Jays batting order

The simulation summarized in the last post was the result from the start of the game only.  In other words, how would each lineup do at the start of the game.

I reran the simulation by reprogramming the simulator to run for nine innings – beginning each new inning with the batter who was on deck at the end of the previous inning.

Each combination (there are 362,880) was simulated 1000 times.  I took the top 363 results and totalled up the number of times each player appeared in each position in the order.

The overwhelming favorite to lead off is Kevin Pillar and the top four seem quite certain.

Position in Batting Order
Player 1 2 3 4 5 6 7 8 9
Kevin Pillar 335 0 0 0 2 7 9 3 7
Troy Tulowitzki 8 85 35 6 71 54 42 28 34
Jose Bautista 16 17 151 66 29 35 19 19 11
Steve Pearce 0 3 13 175 44 37 28 32 31
Josh Donaldson 4 247 51 3 40 10 1 2 5
Luke Maile 0 0 0 10 35 62 90 81 85
Russell Martin 0 9 13 71 63 66 49 52 40
Justin Smoak 0 0 97 31 47 46 48 49 45
Darwin Barney 0 2 3 1 32 46 77 97 105

So, here is what the maximum path through the table says the Jays need to do:


… and then send me a modest slice of the pie after they win it all :-þ