The Major League Baseball All-Star Game occurs a little more than halfway through every season. All-Star rosters consist of 32 players on each side, made up of twenty position players and twelve pitchers, and each team’s starting lineup is determined by a fan vote that takes place from May to July. Reserves are voted in by a combination of fans, players, and the Commissioner’s Office, and every MLB team is ensured at least one All-Star on their league’s roster.
In this project, I will create a model which predicts whether or not a given player will make his league’s All-Star team. This model will focus only on position players, although I eventually intend to produce a pitcher-specific model.
Using a small web scraper I wrote in Python, I quickly extracted the last thirty years of first-half player data from FanGraphs, and filtered my query to include only players with at least two hundred plate appearances at the time of the All-Star break. Additionally, I utilized the 2017 version of the Lahman Database for All-Star Game roster data, as well as regular-season player appearance data (used to determine the defensive position that each player most often played during a given season).
Ultimately, I compiled the following features. Note that not every feature is included in the final model.
- Player statistics (batting average, on-base percentage, slugging percentage, home runs, Def, WAR, etc.)
- Defensive position (the position at which each player most often appeared during the season, with any first-place ties broken at random)
- League rank in each statistic at appropriate position (for example: In 2003, Barry Bonds led all National League outfielders in home runs at the All-Star break, and will therefore receive a league rank of 1 in home runs)
- Team rank in each statistic (for example: In 2017, Bryce Harper led the Washington Nationals in home runs at the All-Star break, and will therefore receive a team rank of 1 in home runs)
- Separate dummy variables for whether the player was a member of the Yankees, Red Sox, Dodgers, and Cubs, as well as a flag for whether the player was on any of the four teams
- Approximate age at start of the season
- Flags for whether the player’s team won or lost the prior-year World Series (inspired by the Royals’ 2015 All-Star turnout)
The final dataset consists of roughly 6,800 rows, each one representing a player’s first-half numbers for a particular season. Of these rows, approximately 1,200 (or 18%) are players who made their league’s All-Star roster. Additionally, the dataset includes 65 possible features. As evidenced by the dark red and blue tiles below, many of these features are strongly correlated with one another:
Not every feature is listed on the axes above, but AVG, BABIP, OBP, SLG, WAR, wOBA, and wRC+ all share correlations of at least 0.60. This stands to reason, as each statistic is a reflection of a batter’s offensive output. Similarly, team and league ranks for these metrics are highly correlated, and there is a strong negative correlation between offensive statistics and player ranks (since a player’s home run rank, for example, decreases as his home run total rises).
I found that a strong model could be produced without the vast majority of these features, and included in the model only a player’s position, AVG, HR, K%, SB, SLG, WAR, and Def, as well as whether his team won or lost the prior-year World Series and whether the player is a member of the Yankees, Red Sox, Cubs, or Dodgers. Unexpectedly, player rank – whether against the league or against the player’s team – was completely unnecessary in the model, and did not predict an All-Star Game appearance any better than by simply using the statistics themselves.
First, I will discuss a few unsuccessful modeling strategies - that is to say, techniques that did not significantly improve over my initial, simpler model. For instance, I had little success oversampling from the minority class with SMOTE and ADASYN. Only 18% of first-half player seasons in our dataset resulted in that player making the All-Star team, so I was concerned that this imbalance would prove problematic when training and testing models. Surprisingly, though, oversampling did not improve this process (and the class imbalance, in the end, was not problematic at all).
Additionally, I initially suspected that ridge or lasso regression would work well for the project, as they penalize - or, in the case of lasso regression, completely remove – highly correlated variables (of which there were many). As it turned out, logistic regression consistently outperformed both lasso and ridge regression, and, as a result, my final model is built using logistic regression. I achieved similar results using random forest modeling as with logistic regression, but would like to maintain the explanatory power that logistic regression allows. All models were trained and tested with an 80/20 split, respectively.
The logistic regression model achieved an AUC of 92.5. Using a classification threshold of 0.40 (which allows for a roughly equal number of Type I and Type II errors), the model performs as follows on the test dataset:
The model performed very well, both in precision and recall, for non-All-Stars. For players that made the All-Star team, the model is 70% correct in each measure. This means that of every actual All-Star in the dataset, the model correctly identifies 70% of them, and of all times that the model classifies a test row as an All-Star, 70% of these classifications are correct.
As follows is the set of feature coefficients. A one-unit change in each predictor will result in a corresponding change to the odds of the given player being named an All-Star. Positive weights correspond with increased chances at playing in the Midsummer Classic, and negative weights the opposite.
|Won prior-year WS||1.00|
|Lost prior-year WS||1.17|
Unsurprisingly, WAR is one of the most important predictors of an All-Star Game appearance. It is also interesting that playing on a prior-year World Series team – whether or not that team won – is another key positive predictor (although there may be some inherent bias in this conclusion; World Series teams, generally being better than an average team, are more likely to have All-Star-quality players).
Additionally, first basemen and designated hitters are the least likely to make the All-Star Game, and catchers and shortstops are the positions most likely to do so. This phenomenon may relate to position specialty; first base and designated hitter are arguably the least specialized of any defensive position, and very few designated hitters typically get selected for the All-Star Game at all. Since a higher number of players is likely to qualify for either of these positions, the chances of any individual first baseman making the All-Star Game are much lower than for a catcher or shortstop - two of the most specialized defensive positions in the game. There are, in fact, nearly two hundred more primary first baseman than catchers in the dataset, adding further credence to this theory.
By deploying our model on the full set of players from 1988 to 2017, we can see the biggest All-Star snubs (i.e., those who did not make their league’s All-Star roster, but deserved to). Percentages refer to the model’s predicted probability of each player making the All-Star Game.
|(1) Gary Sheffield, 2007 (96.4%)||DH, Tigers||.303/.410/.560, 21 HR, 3.2 WAR|
|(2) Travis Hafner, 2006 (95.4%)||DH, Indians||.322/.461/.650, 25 HR, 4.2 WAR|
|(3) J.D. Drew, 2004 (94.7%)||OF, Braves||.312/.434/.628, 21 HR, 5.1 WAR|
|(4) Paul Lo Duca, 2001 (93.9%)||C, Dodgers||.346/.384/.615, 14 HR, 2.8 WAR|
|(5) Hanley Ramirez, 2007 (93.6%)||SS, Marlins||.331/.388/.538, 14 HR, 2.6 WAR|
Conversely, as follows are the players who were voted to the All-Star Game, but should not have been. For at least a few of these players, their defense appears to have played a large role in their respective elections, although the rule requiring at least one player per Major League team was likely the key contributor. Cal Ripken’s 2001 All-Star appearance, on the other hand, is likely more a result of voter sentimentality than anything else.
|(1) Cal Ripken, Jr., 2001 (0.46%)||3B, Orioles||.240/.270/.324, 4 HR, -0.6 WAR|
|(2) Lenny Dykstra, 1995 (0.68%)||OF, Phillies||.262/.347/.325, 0 HR, 0.7 WAR|
|(3) Carlos Garcia, 1994 (0.80%)||2B, Pirates||.267/.307/.332, 3 HR, 0.1 WAR|
|(4) Ken Caminiti, 1997 (0.82%)||3B, Padres||.247/.337/.379, 6 HR, 0.2 WAR|
|(5) Scott Rolen, 2011 (1.36%)||3B, Reds||.241/.276/.398, 5 HR, 1.0 WAR|
Just for fun, here are the three least-deserving first halves overall (whether or not the player actually made the All-Star team):
|(1) Adam Dunn, 2011 (0.008%)||DH, White Sox||.160/.292/.305, 9 HR, -1.7 WAR|
|(2) Ike Davis, 2013 (0.009%)||1B, Mets||.165/.255/.250, 5 HR, -1.5 WAR|
|(3) Ryan Howard, 2016 (0.015%)||1B, Phillies||.154/.214/.353, 12 HR, -1.6 WAR|
Finally, we will deploy the model on players from the 2018 season. With the same 0.40 threshold here as used above, the model correctly classifies approximately 62% of All-Stars and 92% of non-All-Stars.
Besides the Dodgers’ Max Muncy (94.3%), there were no particularly heinous snubs in the National League; the biggest missed predictions were the Mets’ Asdrubal Cabrera (60.7%), the Dodgers’ Yasmani Grandal (58.9%), and the Nationals’ Anthony Rendon (45.8%), none of whom made the All-Star Game. On the other hand, Bryce Harper was voted in despite a probability of just 12.3%.
The American League, however, had a number of deserving players left off of the All-Star roster. The Angels’ Andrelton Simmons (77.9%), Eddie Rosario (76.7%), and Didi Gregorius (76.6%) all achieved statistics worthy of an All-Star bid, but none could surpass fellow American League infielders Jose Ramirez (99.9%), Francisco Lindor (99.0%), Alex Bregman (98.9%), Manny Machado (98.9%), or Jose Altuve (96.4%). Also noteworthy is Orioles first baseman Chris Davis, whose historically bad 2018 resulted in an All-Star probability of just 0.004%, and the Indians’ Jose Ramirez (99.87%), who narrowly beat out Boston’s Mookie Betts (99.86%) for the highest All-Star probability in the majors.
Thanks for reading – I hope you found this a fun project! All code is available on my GitHub.