5 min read

Body Height and BMI in European Soccer Players over the years 2008-2016

Change of body parameters in soccer players

How did body mass index and body size of professional football players change over the years? Body size, not so much. It has stayed rather constant between 2008 and 2016. Average height might have even fallen a little bit. Body weight and, correspondingly, Body Mass Index on the other hand, have fallen even more. This means the top clubs want thinner players. It also means, in soccer, possibilities to increase the athleticism of players is limited.

As mentioned in previous blogposts, I’ve downloaded a zipfile (36 MB) with Football data from data science community Kaggle.com. The archive contained an SQLite Database.

library(DBI)
con <- dbConnect(odbc::odbc(), "well-sqlite-footballdb")

For the sake of brevity, to see the R packages I’ve included to process the data, see the previous blogpost. Only the tidyr package is new here.

library(tidyr)   # gather

The database consists of 7 tables. We’ll read in all of them. The Player table contains basic data of ~10000 soccer players from the top leagues of 14 European countries. They are not strictly “European soccer players”, but players from all over the globe, competing in the top leagues of certain Western European countries.

Tables in the Sqlite database:

##                     Rows Columns
## Player_Attributes 183978      42
## Player             11060       7
## Match              25979     115
## League                11       3
## Country               11       2
## Team                 299       5
## Team_Attributes     1458      25

The Player table lists the soccer players’ data.

Add some new attributes to the table.

pounds_per_kg <- 0.453592
sizes <- c("large" = 190, "small" = 175)

Player <- Player %>%
        mutate(birthday = ymd(as_datetime(birthday))) %>% # was string
        mutate(weight = weight * pounds_per_kg) %>%
        mutate(bmi = weight /((height/100)^2)) %>%
        mutate(size = factor(
                if_else(height >= sizes["large"], "large",
                        if_else(height >= sizes["small"], "medium", "small"))))

# from sqlite database foreign key constraints
#   FOREIGN KEY(`home_player_1`) REFERENCES `Player`(`player_api_id`),

The Match table contains data from 25979 matches, played between 2008-07-18 and 2016-05-25. The table contains 115 columns.

The names of the players comprising the teams are listed in 22 columns named home_player_1 to home_player_11, and away_player_1 to away_player_11. The substitute players nominated for the match are not known.

This database table has an untidy format. Let’s tidy it with the tidyr::gather() function:

Match_players  <- Match %>%
        select(match_api_id, season, league_id,
        home_player_1:away_player_11) %>%
        rename("match_id" = match_api_id)  %>%
        gather( k, player_api_id,  -match_id, -season, -league_id)

This table now looks like this:

head(Match_players %>% filter(!is.na(player_api_id)), 5)
##   match_id    season league_id             k player_api_id
## 1   493016 2008/2009         1 home_player_1         39890
## 2   493017 2008/2009         1 home_player_1         38327
## 3   493018 2008/2009         1 home_player_1         95597
## 4   493020 2008/2009         1 home_player_1         30934
## 5   493021 2008/2009         1 home_player_1         37990

I’ve only shown 5 rows of this 571538 x 5 table. This table can now be joined with the Player table:

Player_in_match <- Player %>%
        inner_join(Match_players, by = c("player_api_id" = "player_api_id")) %>%
        mutate(year_of_birth = year(birthday)) %>%
        rename("player_id" = id)

We’ll start with body size, because it is an easier quantity to understand.

## there are too few players in the dataset
# born earlier than 1975, and after 1998
yob_thresh_max <- 1975
yob_thresh_min <- 1997


Player_height_by_year <- Player_in_match %>%
        select(year_of_birth, size, height ) %>%
        filter(year_of_birth >= yob_thresh_max) %>%
        filter(year_of_birth <= yob_thresh_min) %>%
        group_by(year_of_birth) %>%
        summarize( avg_height = mean(height))

# average body height seems to go down a little bit
Player_height_by_year %>%
        ggplot(aes(year_of_birth, avg_height)) +
        geom_point() +
        geom_smooth(method="loess", span=1, se=TRUE) +
        ylab("Average Player Height [cm]") +
        xlab("Player's Year of Birth") +
        ylim(c(175, 190))

Average body size might have even fallen a bit in professional soccer. This trend is unlike Icehockey, where another data scientist has collected evidence in a blogpost that professional players have an average body size 183 cm in 2010, and this increases about 0.1 cm per year.

BMI per year

How did Body Mass Index change over the years?

The following plots show that BMI seems to fall. Maybe this is an artifact, because for the years 2016 the database contains much more younger players who are not fully grown up biologically and have less massive bodies.

Player_bmi_by_year <- Player_in_match %>%
        select(year_of_birth, season, player_api_id, bmi) %>%
        group_by(year_of_birth, season) %>%
        #        summarize(distinct_players = n())
        summarize(distinct_players =n_distinct(player_api_id)) %>% 
        ungroup()




Player_bmi_by_year %>%
        group_by(year_of_birth) %>%
        summarize(distinct_players = sum(distinct_players)) %>%
        ggplot(aes(year_of_birth, distinct_players)) +
        geom_point() +
        geom_smooth(method="loess", se=FALSE) +
        ylab("Number of BMI by  Players in Database") +
        xlab("Players' Year of Birth") +
        geom_vline(xintercept = c(yob_thresh_max, yob_thresh_min),
                   color="maroon") +
        ggtitle("European Soccer Database: Players' Appearances in Matches",
                subtitle = "By year of birth. Double counted for each season. Seasons 2008-2016.")

# remove players where we have too few counts 
Player_bmi_by_year <- Player_in_match %>%
        select(year_of_birth, bmi ) %>%
        filter(year_of_birth >= yob_thresh_max) %>%
        filter(year_of_birth <= yob_thresh_min) 

Player_bmi_by_season <- Player_in_match %>%
        select(season, bmi ) 

Player_bmi_by_year.2 <- Player_bmi_by_year %>%
        group_by(year_of_birth) %>%
        summarize( avg_bmi = mean(bmi))

# average BMI over the years
Player_bmi_by_year.2 %>%
        ggplot(aes(year_of_birth, avg_bmi)) +
        geom_point() +
        geom_smooth(method="loess", span=1, se=FALSE) +
        ylab("Average Player BMI") +
        xlab("Players' Year of Birth")

# 1 boxplot per year - trend is less obvious
Player_bmi_by_year %>%  
        mutate(year_of_birth = factor(year_of_birth)) %>%  
        ggplot(aes(year_of_birth, bmi)) +
        geom_boxplot() +
        ylab("BMI") +
        ggtitle("Body Mass Index of Soccer Players over years")

# BMI per season
Player_bmi_by_season %>%  
        ggplot(aes(season, bmi)) +
        geom_violin() +
        ylab("BMI") +
        theme(axis.text.x=element_text(angle=60,hjust=1)) +
        ggtitle("Body Mass Index of Soccer Players over years")

These violin plots show the distributions of players’ BMIs per season. They appear to be very much alike. You might think that the range of values goes down - indicating that players get more uniform with respect to BMI. I also think clubs prefer thinner, younger, more agile players. Or the residence time of players in clubs has gone down, or younger players get into top clubs at an earlier age.