The 2018-19 season doesn’t start until October 16, but there’s no need to wait until then to observe league results – instead, we can simulate the season before its first game even begins. I won’t be spending too much time discussing the technical details of my work, but will instead outline my steps in preparing and running the simulation.
1. Web scraping - 2016-17 and 2017-18 seasons
We will be training a model on the past two seasons’ worth of NBA games, and will deploy the model on a dataset of the 2018-19 schedule. First, though, comes data collection. Scraped datasets include the following; when not otherwise specified, data comes directly from basketball-reference:
- The last two seasons’ worth of starting lineup and bench history (not scraped; directly from Kaggle
- The last four seasons’ worth of player-by-player statistics
- Current team-by-team depth charts (from ESPN)
- Every team’s 2018-19 schedule
- The league’s draft history since 2000, including every player’s respective draft position
- Every individual season of rookie-year statistics from the last ten seasons
2. Lineup-level data aggregation – 2016-17 and 2017-18 seasons
For every game over the last two seasons, I subset each team’s lineup to only the five starting players and the three bench players with the most minutes in the game. Any ties (for example, the third and fourth bench players playing the same number of minutes) were broken at random. Note that each of the two lineups in a game are given separate rows in the dataset – one row from the home team’s perspective and one row from the away team’s perspective.
Using the past three seasons of player data, I computed weighted totals of four separate statistics – Win Shares (WS), Box Plus/Minus (BPM), minutes played per game (MP/G), and percentage of games started (GS%) – for each NBA player. All four statistics were weighted by the number of minutes each player played in a given season. The Celtics’ Gordon Heyward, for instance, played only five minutes in the 2017-18 campaign, so his statistics from that season bear virtually no weight in his metrics.
I then calculated the average WS, BPM, MP/G, and GS% of both teams’ respective starting lineups and bench players, as well as the differences between a team’s WS, BPM, MP/G, and GS% and their opponent’s WS, BPM, MP/G, and GS%, respectively. Finally, for each lineup, I added columns for the time and distance (Euclidean, based on city latitude and longitude) since the team’s previous game.
3. Train model
Before training my model, I performed principal components analysis (PCA) on the differences between team and opponent starting lineup and bench WS, MP/G, and GS%. Because Win Shares are a counting statistic, players accumulate more Win Shares the more they play, which will presumably lead to high correlation between the three metrics. The combination of MP/G and GS% has the same problem, as players who starts a large percentage of their games are also likely to play a large number of minutes per game.
BPM, however, I excluded from PCA, as it is a rate statistic rather than a counting one. A player might have a relatively high WS total simply by virtue of receiving a large share of minutes, but this is no guarantee of a high BPM. Andrew Wiggins and Dwight Howard, for instance, both started virtually all of their teams’ games in 2017-18, but neither played particularly well; both players have lower BPMs than their minutes totals would suggest.
Condensing WS, MP/G, and GS% with PCA (separately for starters and bench players) confirms that the three variables are very highly correlated; the first principal component contains 92% of starters’ total variance, and 97% of bench players’ total variance.
After trying different types of modeling – logistic regression, decision trees, random forests, and gradient boosted trees – I found that random forests’ results consistently outperformed those of the other three models. The best model consisted of the following features, with win/loss as a response:
- Team location (home/away)
- Time since previous game
- (Euclidean) distance from previous game
- Team BPM less opponent BPM (starters)
- Team BPM less opponent BPM (bench players)
- Principal component 1 (starters)
- Principal component 2 (bench players)
Aside from the seven features above, I also initially included team winning percentage less opponent winning percentage as a feature, and ran a version of the model with starter and bench BPM included in their respective principal component analyses.
To tune the random forest model, I performed a repeated grid search across three parameters: maximum tree depth, maximum number of features, and minimum samples per leaf. Because the training & testing sets differed with each iteration, the optimal parameters changed slightly each time. The most common set of optimal parameters, however, were a maximum tree depth of four, five maximum features, and five minimum samples per leaf.
Of the seven final features, by far the most important for win/loss classification is the difference in starting BPM. Team location and starters’ principal component 1 come in a distant second and third, respectively. Overall model accuracy varies with each training & testing split, but for the most part hovers somewhere between 62% and 66%. This result serves our purposes perfectly well, although it is less accurate than Vegas predictions.
4. Player-level data aggregation – 2018-19 season
For the 2018-19 season, lineups are uncertain, since each of our season simulations will generate new player injuries and will randomly choose different bench players to fill injured starters’ spots in the starting lineup. Therefore, my script does not aggregate any 2018-19 lineup-level statistics until it actually determines what each game’s lineups will be. In Step 2, I described the process of collecting and weighting player WS, BPM, MP/G, and GS%; this same process is used to gather 2018-19 player metrics.
This method, however, is ineffective for any rookies, as they have no prior-year numbers to aggregate. To combat this issue, I imputed all 2018-19 rookie statistics based on their draft positions. Looking the medians of the last ten years of rookie-season numbers, there are clear trends in each of the four relevant player metrics as we move from pick #1 to pick #60. Take GS%, for example:
For #1 pick DeAndre Ayton, then, I assumed his GS% to be approximately 80% - in line with the orange data point that corresponds with the first overall pick. Similar intuition is used for all rookie WS, BPM, and MP/G.
5. Simulate the 2018-19 season
To simulate the upcoming season, I initialized a dictionary of player injuries – sourced from CBS Sports and RotoWorld – which determines how many games at the start of the season each player will miss. DeMarcus Cousins, for example, is still recovering from his Achilles injury, so I projected him to miss the first ten games of the year. Jimmy Butler, on the other hand, I have conservatively projected to miss the entirety of the season, since he is still yet to be traded from the Timberwolves at the time of writing.
Using the 2018-19 schedule we scraped earlier, we can now produce simulated sets of starters and bench players for every game in the season, all of which come directly from the first- and second-string players on the ESPN depth chart. Because my model aggregates bench statistics from only the first three bench players, I randomly select three players from all second-stringers. When randomly selecting bench players, probabilities are weighted by each player’s respective weighted MP/G (or, for rookies, imputed MP/G), so that bench players who have historically received the most playing time will be selected more often.
Next, for every game, my script iterates through every player on each team and assigns them a random number. If that random number is suitably small, that player has a one-in-eight chance of a “severe” injury, and a seven-in-eight chance of a “mild” injury. Depending on the severity of a player’s injury, their number of games missed will be randomly chosen from one of two exponential distributions. The nature of an exponential distribution makes it more likely for a player to miss only a small number of games, but the larger scale of the “severe” distribution increases the odds of a severely injured player missing a larger number of games than a mildly injured player.
Once we know which players will miss each game, we will be left with new gaps in starting lineups and benches. Any lineup with fewer than five starters will have their gaps filled with second-string players, and any bench with under four players will have their gaps filled with third-stringers. Just as bench players were randomly selected with weights based on their MP/G, gaps in starting lineups and benches are filled in the same manner.
By applying the random forest model on the full 2018-19 schedule (now with full sets of starting and opposing lineups and bench players), we now have continuous zero-to-one predictions of a given lineup winning their matchup. Again, remember that each game occurs from two teams’ perspectives, and therefore comprises two distinct rows in the dataset. As such, each team for any given game will have its own winning likelihood - and both teams’ likelihoods do not necessarily sum to 100%. A game between Boston and Philadelphia, for example, might result in the Celtics being given a 57% winning probability and the Sixers being given a 48% winning probability. (These, of course, add up to 105%, not 100%.) For more separation between team-opponent winning likelihoods, I raise each percentage to power of 1.4. Boston and Philadelphia’s respective 57% and 48% winning probabilities, then, become 46% and 36%. After standardizing this set of figures to sum to 100%, the Celtics are left with a 56% chance to win, and the Sixers with a 44% chance. A winner for the game is randomly chosen using these two probabilities.
Once the above process completes for all 2,460 matchups, we will have successfully simulated the 2018-19 NBA season. To properly account for the in-season and between-season variance that can occur, though, we need to perform more than just one simulation. Certain teams, like Golden State, will win the majority of their games even after an injury to one of their star players; the Lakers, on the other hand, will struggle without LeBron. Each of our simulations has different randomly generated injuries, so by simulating many different times – say, five hundred – we can gain a much clearer picture of every team’s projected win total for the 2018-19 season.
After five hundred simulations, as follows are the projected final standings of the upcoming NBA season:
With the exception of the Raptors, who project to win the East by thirteen games, the East’s playoff picture appears very competitive. Boston, Philadelphia, Milwaukee, and Washington all finish a game apart from one another, and the Hornets surprisingly manage to sneak ahead of the Heat for the sixth seed in the conference. Indiana, unexpectedly, misses the playoffs altogether, and the Knicks bumble their way to the worst record in the league. In the West, Golden State outpaces both the Rockets and Thunder for the top seed in the conference, and LeBron’s Lakers finish with a respectable 45 wins. Notably, neither the Butler-less Timberwolves nor the Pelicans reached the playoffs, and Portland finished with eight fewer wins than in 2017-18. Even with these surprises, though, these projections appear totally plausible, and by next spring we’ll know just how accurate they were.
Thank you for reading – I hope you found the project interesting!
All code is available on my GitHub: https://github.com/saisenberg