Pitcher Similarities, By Way of Their Pitch Mixes


To directly access the interactive web application, click here!


In this project, I’ll be creating a web application which computes similarities between different pitchers, based on the metrics of their respective pitch mixes. These metrics consist of pitch velocities, spin rates, and the actual pitch mixes themselves (i.e. which pitch types each pitcher throws). Teams might use this tool to identify underrated pitchers with high chances of big-league success. If a specific pitcher’s pitches are very similar to those of perennial All-Stars, for instance, it might suggest that the pitcher himself could have the same potential.

Of course, a major caveat involved in this analysis is that the similarity metric does not consider pitcher command – clearly a crucial component of pitcher success. While the pitch mixes of many relievers contain impressive velocities and spin rates, their struggles with control ultimately limit their respective ceilings. Conversely, some starting pitchers – Dallas Keuchel being a notable example – have sufficient command to compensate for a lack of top-notch “stuff.”

The data used for this project was collected directly from BaseballSavant. Only fastball (four-seam, two-seam, and cutter), changeup, curveball, slider, sinker, and splitter data was collected from the past three seasons. I used different cutoffs – by number of each pitch type thrown by each pitcher – for every pitch, in order to exclude pitchers without much big-league playing time, as well as pitchers with mislabeled pitch types. For instance, only the slider metrics of pitchers who have thrown at least two hundred cumulative sliders over the last three seasons are included in the dataset.

Depending on the pitch mix a user inputs into the application, behind-the-scenes functions collect the data of every pitcher in the dataset with that pitch mix. On the Select by pitcher tab, users can also enter a specific pitcher, rather than a list of pitch types, and the functions will use the pitcher’s n most commonly thrown pitches as the pitch mix.

Before calculating the distances between different pitchers’ respective mixes, however, it is crucial to scale the data. Revolutions per minute and miles per hour – the units for spin rate and velocity, respectively – operate on entirely different scales. A ten-mile per hour difference between two pitchers’ fastballs suggests a significant overall difference between the pitchers, while a ten-RPM difference between fastballs is next to negligible. Leaving the data unscaled would place a disproportionate emphasis on spin rate, as every pitcher would be considered relatively “similar” in terms of velocity. For this reason, the web application scales each set of pitch metrics separately before running any sort of distance algorithm.

The nearest-neighbor calculations are performed using Euclidean distance. I considered reducing the dimensions of the pitch mix datasets prior to running any distance algorithm, since pitchers’ velocities between different pitch types are likely correlated. However, I ultimately determined that maintaining a hundred percent of the variance between pitch types was too important to forego, and refrained from dimension reduction. Additionally, I also considered assigning different “importances” to each pitch type, commensurate to how often every pitcher throws each pitch. In the end, though, I decided this was unnecessary, as pitchers can easily change the relative frequencies of each of their pitches (but cannot always do the same with their pitches’ velocities and spin rates). Instead, however, I added a feature that allows users to place an overall emphasis on velocity or spin rate. By placing the slider at -1, distance calculations will consider pitch velocities twice as heavily as spin rates, and with the slider at 1, the opposite will occur.

As a user selects a specific pitch mix or pitcher, a technique called Multi-Dimensional Scaling, (or MDS) approximates distances between players on a two-dimensional plot. Because it is impossible for us to visualize any more than three dimensions at a time, MDS attempts to maintain distances between players as accurately as possible in as many dimensions as one wishes. Unlike t-SNE, which only maintains local structures of distances (i.e. Player A should be close to Player B), MDS works to maintain distances throughout the entire dataset. You may notice that pitcher neighbors, highlighted in orange on the Select by pitcher tab, occasionally stray from the data point that represents the selected pitcher. This is simply because MDS only approximates the distances between each point, as only so much variance can be captured in just two dimensions. Also, keep in mind that the x- and y-axes do not represent any specific metrics, and that the data tables that appear below the scatterplot are more accurate than the scatterplot itself in terms of player distances.

The full web application can be found here, and all code is available on my GitHub. Thanks for reading!

Updated: