Datascience: WTA Height Advantages

September 11, 2018
python datascience

I’ve been playing with Pandas and the Jupyter Notebook to learn how to clean up and extract insights from large datasets. Here’s an example of discovering the relationship between player height and win %.

What advantage does height infer in Women’s Tennis?

Using dataset: https://www.kaggle.com/joaoevangelista/wta-matches-and-rankings#wta.zip

import pandas as pd 
%matplotlib inline
import matplotlib.pyplot as plt

df = pd.read_csv('/Users/sam/Downloads/wta/matches.csv', low_memory=False, dtype={
})
df.shape
(50577, 33)
df.head()
best_of draw_size loser_age loser_entry loser_hand loser_ht loser_id loser_ioc loser_name loser_rank ... winner_hand winner_ht winner_id winner_ioc winner_name winner_rank winner_rank_points winner_seed year Unnamed: 32
0 3 128 17.859001 NaN R NaN 200002 CRO Mirjana Lucic 49.0 ... R 170.0 200001.0 SUI Martina Hingis 1.0 6003.0 1.0 2000.0 NaN
1 3 128 27.118412 Q R NaN 200004 AUS Kerry Anne Guse 133.0 ... R 167.0 200003.0 BEL Justine Henin 63.0 510.0 NaN 2000.0 NaN
2 3 128 31.378508 NaN R NaN 200005 USA Jolene Watanabe Giltz 118.0 ... R NaN 200006.0 SVK Karina Habsudova 53.0 574.0 NaN 2000.0 NaN
3 3 128 22.006845 NaN R NaN 200007 CRO Silvija Talaja 23.0 ... R 182.0 200008.0 AUS Alicia Molik 116.0 245.0 NaN 2000.0 NaN
4 3 128 24.821355 NaN R NaN 200010 ITA Rita Grande 60.0 ... R 165.0 200009.0 THA Tamarine Tanasugarn 72.0 439.0 NaN 2000.0 NaN

5 rows × 33 columns

df.year.value_counts().sort_index()
2000.0    2893
2001.0    3098
2002.0    3140
2003.0    2930
2004.0    2805
2005.0    2843
2006.0    2787
2007.0    2778
2008.0    2790
2009.0    2722
2010.0    2781
2011.0    2804
2012.0    2910
2013.0    2776
2014.0    2785
2015.0    2651
2016.0    2900
2017.0    2181
Name: year, dtype: int64

We have data on 50577 matches from 2000-2007

# Clean data
df.drop(df[df['winner_ht'] == 'R'].index, inplace=True, axis='rows')
df['winner_ht'] = df['winner_ht'].astype(float)
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)], axis=1, inplace=True)
# New subset
hts = pd.DataFrame(df[['winner_ht', 'loser_ht']])
hts.dropna(inplace=True)

# Add absolute height difference column
hts['ht_diff'] = abs(hts['winner_ht'] - hts['loser_ht'])

# Add boolean 'taller player won' column
hts['winner_taller'] = hts['winner_ht'] > hts['loser_ht']

hts.head()
winner_ht loser_ht ht_diff winner_taller
42 163.0 168.0 5.0 False
53 180.0 170.0 10.0 True
58 165.0 169.0 4.0 False
64 170.0 167.0 3.0 True
79 168.0 158.0 10.0 True

Let’s check how many matches we have for each absolute height difference:

hts.ht_diff.value_counts()
2.0     1764
3.0     1704
5.0     1627
7.0     1499
1.0     1434
4.0     1397
6.0     1304
8.0     1248
10.0    1209
9.0      997
12.0     899
11.0     892
0.0      889
13.0     632
14.0     604
15.0     553
16.0     450
17.0     430
19.0     290
18.0     278
20.0     186
21.0     159
22.0     104
23.0      78
24.0      46
25.0      39
27.0      27
26.0      20
29.0      13
28.0      12
31.0       6
32.0       3
30.0       2
Name: ht_diff, dtype: int64
# Remove rows where there was no difference in height (as 'winner_taller' always false)
hts.drop(hts[hts.ht_diff == 0].index, inplace=True)

# Remove heights differences for which we don't have enough data
hts.drop(hts[hts.ht_diff > 23].index, inplace=True)
hts.shape
(19738, 4)

We ended up with 19,738 matches where there was a height difference, for which we have enough data to make a meaningful comparison

# Plot the absolute height difference against the mean of whether taller player won (0.0 -> 1.0)
plt.figure(figsize=(10,5))
plt.plot(hts.groupby('ht_diff')['winner_taller'].mean())
plt.xlabel('Height difference (cm)')
plt.ylabel('Taller player win %')
plt.show()

png