12 min read

The oldest active European Soccer Player in season 2008

Still playing with a Soccer Database from Kaggle.com

As an exercise, I wanted to find the oldest players participating in any match. During my analysis I also learned something about the data quality of this Kaggle dataset.

As mentioned in a previous blogpost, I’ve downloaded a zipfile (36 MB) with Football data from data science community Kaggle.com. The archive contained an SQLite Database.

library(DBI)
con <- dbConnect(odbc::odbc(), "well-sqlite-footballdb")

For the sake of brevity, to see the R packages I’ve included to process the data, see the previous blogpost. Only the tidyr package is new here.

library(tidyr)   # gather

The database consists of 7 tables. We’ll read in all of them. The Player table contains basic data of ~10000 soccer players from the top leagues of 14 European countries. They are not strictly “European soccer players”, but players from all over the globe, competing in the top leagues of certain Western European countries.

Tables in the Sqlite database:

##                     Rows Columns
## Player_Attributes 183978      42
## Player             11060       7
## Match              25979     115
## League                11       3
## Country               11       2
## Team                 299       5
## Team_Attributes     1458      25

The Player table lists the soccer players’ data.

The Match table contains data from 25979 matches, played between 2008-07-18 and 2016-05-25. The table contains 115 columns.

The names of the players comprising the teams are listed in 22 columns named home_player_1 to home_player_11, and away_player_1 to away_player_11. The substitute players nominated for the match are not known.

This database table has an untidy format. Let’s tidy it with the tidyr::gather() function:

Match_players  <- Match %>%
        select(match_api_id, season, league_id, 
               home_player_1:away_player_11) %>%
        rename("match_id" = match_api_id)  %>%
        gather( k, player_api_id,  -match_id, -season, -league_id)

This table now looks like this:

head(Match_players %>% filter(!is.na(player_api_id)), 5)
##   match_id    season league_id             k player_api_id
## 1   493016 2008/2009         1 home_player_1         39890
## 2   493017 2008/2009         1 home_player_1         38327
## 3   493018 2008/2009         1 home_player_1         95597
## 4   493020 2008/2009         1 home_player_1         30934
## 5   493021 2008/2009         1 home_player_1         37990

I’ve only shown 5 rows of this 571538 x 5 table. This table can now be joined with the Player table:

Player_in_match <- Player %>%
        inner_join(Match_players, by = c("player_api_id" = "player_api_id")) %>%
        mutate(year_of_birth = year(birthday)) %>%
        rename("player_id" = id)

This table, Player_in_match is a 542281 x 12 table. Remember, the original, standalone Match_players table had 571538 records. The join reduced this to 542281 athletes. Hence, many players didn’t actually play, or the screenscraping process did not yield any information about the matches they ’ve participated in; or the parse-process of many downloaded files containing the matches data failed. In any case, they actually played or were nominated for the starting formation of their squads.

Player_distinct_by_year <- Player_in_match %>%
        select(year_of_birth, season, player_api_id) %>%
        group_by(year_of_birth, season) %>%
        summarize(distinct_players =n_distinct(player_api_id))

Player_distinct_by_year %>%
        ggplot(aes(year_of_birth, distinct_players, color=season)) +
        geom_point( size=.4) +
        geom_smooth(method="loess", se=FALSE, size=0.5) +
        ylab("Number of distinct Players in Database") +
        xlab("Player's Year of Birth") +
        ggtitle("European Soccer Database: Players' appearances",
                subtitle = "Counting Players' appearances in the squads. Seasons 2008-2016.\n
Each player is counted once per season no matter how often he played.")

Some players were rarely nominated for the squad of any match. This can best be seen at the left side of the graph. These data points represent older players born before 1970.

Why do the humps have different heights? I think for older seasons, there is simply less data available at the websites of the commercial data vendors, where the data was scraped from.

Histogram of matches played, by year of birth

This code generates a histogram of the “age distribution” of players in the dataset. These are players for which match data is known. I think the axes are pretty self-explanatory.

Pl_lbl <- Player_in_match %>%
        arrange(birthday) %>% 
        select(year_of_birth, player_name) %>%
        filter(year_of_birth <= 1970) %>% 
        group_by(year_of_birth) %>% 
        arrange(year_of_birth) %>% 
        unique()

Pl_lbl  <- Pl_lbl %>% unique() %>% head(1)


Player_in_match %>%
        ggplot(aes(year_of_birth)) + 
        geom_histogram(binwidth=1, fill="maroon") +
        geom_text(data = Pl_lbl, 
                  aes(x=I(year_of_birth+3.5), y=I(4500), 
                      label = player_name, angle = 45), size=3) +
        ggtitle(sprintf("Matches by %s soccer players in top leagues in Europe, 2008-2016", nrow(Player_in_match)),
                subtitle="By year of birth. Histogram shows players for which match participation data is known.") +
        xlab("Year of Birth") +
        ylab("Appearances in Matches")

Find the oldest player in any match

We can use the Player_in_match table to find the oldest players who participated in any match. This corresponds to the leftmost bars in the histogram above. This bar corresponds to the matches of a the oldest soccer players, born in 1970 or earlier.

oldest_players_in_any_match <- Player_in_match %>%
        inner_join(League, by = c("league_id" = "id") ) %>%
        filter(year_of_birth < 1971) %>%
        select(year_of_birth, season, player_name, match_id, name) %>%
        group_by(year_of_birth, season, player_name, match_id) %>%
        summarize(n_matches = n()) %>%
        arrange(season) %>%
        ungroup()

Result:

(oldest <- oldest_players_in_any_match %>% 
        select(player_name, year_of_birth, n_matches) %>% 
        group_by(player_name, year_of_birth) %>% 
         summarize(n_matches=sum(n_matches)) %>% 
         arrange(year_of_birth, player_name) %>% 
         ungroup())
## # A tibble: 20 x 3
##    player_name         year_of_birth n_matches
##    <chr>                       <dbl>     <int>
##  1 Alberto Fontana             1967.         4
##  2 Paolo Maldini               1968.        30
##  3 Dean Windass                1969.         1
##  4 Francesco Antonioli         1969.        98
##  5 Jens Lehmann                1969.        66
##  6 Luca Bucci                  1969.         1
##  7 Michael Tarnat              1969.         6
##  8 Rob van Dijk                1969.        59
##  9 Antonio Chimenti            1970.         2
## 10 Chris Swailes               1970.        22
## 11 David James                 1970.        61
## 12 David Weir                  1970.       111
## 13 Dean Kiely                  1970.         3
## 14 Edwin van der Sar           1970.        87
## 15 Eugenio Corini              1970.        12
## 16 Hans Vonk                   1970.         4
## 17 Nico van Kerckhoven         1970.        32
## 18 Pedro Roma                  1970.         2
## 19 Sander Boschker             1970.        58
## 20 Tugay Kerimoglou            1970.        15

Before I studied this dataset, the oldest player, Alberto Fontana, was completely unknown to me. So let’s lookup some metadata about this player, and the other players born before 1971, in the Google Knowledge Graph.

The calculation in the following code block is a bit awkward and verbose, because

  • Google Knowledge Graph does not return records with a perfectly consistent structure. I have to transform it to a consistent 2-column (key-value) table.
  • the Google Knowledge Graph data contains -in this context- irrelevant info such as license data about photographs etc that I had too filter out.
  • the database does not keep track of accents and special characters of the Turkish alphabet, so I better use the correctly encoded string from Wikipedia/Google.
# set/replace this one manually. Google returned another wrong athlete (wheelchair)
oldest[oldest$player_name == "David Weir", "player_name"] <- "David Gillespie Weir"

oldest.2 <- map(oldest$player_name, 
                kgapi_lookup_kv, 
                apikey=Sys.getenv("GOO_KGR_KEY"))



oldest.3 <- map_df(oldest.2, function(x) { 
        
    as_data_frame(x ) %>% 
    filter(str_detect(key, "result.name|detailedDescription") == TRUE) %>% 
    select(key, value) %>% 
    mutate(key = str_replace(key, "result.detailedDescription.", ""), 
           value = unlist(value)) %>% 
    spread(key=key, value=value) %>% 
    select(-license) %>% 
    mutate(Name =str_c("[", result.name, "](", url, ")"),
           Description = str_replace_all(articleBody, "\n", "")) %>% 
    select(result.name, Name,  Description) })



oldest.3 <- oldest.3 %>% 
   left_join(oldest, by=c("result.name" = "player_name")) %>%
   rename("Born" = "year_of_birth", "N" = "n_matches") %>%
   select(-result.name) %>% 
        mutate(IsG = ifelse(str_detect( Description, "goal"), "Y", "")) %>% 
        mutate_if(is.numeric, function(x) ifelse(is.na(x), "", x)) %>% 
  select(Name, Description, Born, IsG, N)

In the table below, column ‘IsG’ means “Is Goalie”, ‘N’ means number of matches in Database.

It seems to be quite common that goalkeepers have long careers:

(Click on the Wikipedia links below for more information).

Name Description Born IsG N
1 Alberto Fontana Alberto Fontana is a retired Italian footballer who played as a goalkeeper.During his career, in which he represented nine different clubs, he played until the age of 42. 1967 Y 4
2 Paolo Maldini Paolo Cesare Maldini is an Italian former professional footballer who played as a left-back and central defender for A.C. Milan and the Italy national team. 1968 30
3 Dean Windass Dean Windass is an English former professional footballer who played as a striker. He is best known for his spells at Bradford City and contributing to his hometown team Hull City’s promotion to the Premier League in 2008. He currently plays for AFC. 1969 1
4 Francesco Antonioli Francesco Antonioli is an Italian former footballer who played as goalkeeper. He was the oldest footballer in Serie A until his club Cesena were relegated to Serie B at the end of the 2011–12 season, after which he retired from professional football. 1969 Y 98
5 Jens Lehmann Jens Gerhard Lehmann is a retired German footballer who played as a goalkeeper. He was voted UEFA Club Goalkeeper of the Year for the 1996–97 and 2005–06 seasons, and was selected for three World Cup squads. 1969 Y 66
6 Luca Bucci Luca Bucci is a retired Italian football goalkeeper. Bucci played for several Italian clubs throughout his career; he is mostly remembered for his successful spell with Parma, where he won various domestic and European titles. 1969 Y 1
7 Michael Tarnat Michael Tarnat is a German former footballer, currently employed by Bayern Munich as the leader of the U12–U16 youth teams. He is a left-footed full-back who has also played wingback and defensive midfield. 1969 6
8 Rob van Dijk Robert ‘Rob’ van Dijk is a Dutch retired footballer who played as a goalkeeper.Over the course of exactly 20 seasons as a professional, he appeared in 364 Eredivisie games, mainly with RKC and Feyenoord. He retired at the age of 43. 1969 Y 59
9 Antonio Chimenti Antonio Chimenti is a retired Italian football goalkeeper. 1970 Y 2
10 Chris Swailes Christopher William Swailes is an English former professional footballer. He is currently manager of Dunston UTS. 1970 22
11 David James David Benjamin James MBE is an English former footballer who played as a goalkeeper. He is the current manager at Kerala Blasters FC in the Indian Super League. He is also a pundit on BT Sport’s football coverage. 1970 Y 61
12 David Weir David Gillespie Weir is a Scottish football player and coach.Born in Falkirk, Weir began his professional career with his home-town club, Falkirk, after having attended the University of Evansville in the United States.
13 Dean Kiely Dean Laurence Kiely is a former professional footballer who played as a goalkeeper. He is the goalkeeping coach at Premier League club Crystal Palace. He won eleven caps for the Republic of Ireland as a player. Kiely has previously worked as the goalkeeping coach at West Bromwich Albion and Norwich City. 1970 Y 3
14 Edwin van der Sar Edwin van der Sar OON is a Dutch former professional footballer who played as a goalkeeper. During his career he played for Ajax, Juventus, Fulham and Manchester United. He is the second most capped player in the Netherlands national team’s history. He currently works as the chief executive officer at Ajax. He came out of retirement to play for Dutch amateur team VV Noordwijk. 1970 Y 87
15 Eugenio Corini Eugenio Corini is an Italian association football coach and former player. He was also captained as a player in the 1990s. He is currently the head coach of Novara. 1970 12
16 Hans Vonk Johannes Adrianus “Hans” Vonk is a retired South African professional football goalkeeper of Dutch parentage who last played for Ajax Cape Town FC. 1970 Y 4
17 Nico Van Kerckhoven Nicolas “Nico” Van Kerckhoven is a retired Belgian football left fullback, last on the books of K.V.C. Westerlo where he was also the captain of the team. His former clubs include Lierse and FC Schalke 04. Van Kerckhoven was part of the Belgian national team for the 1998 and 2002 World Cups.
18 Pedro Roma Pedro Miguel da Mota Roma is a retired Portuguese footballer who played as a goalkeeper.Having played with Académica for over 15 years, he later worked with his main club as a goalkeepers’ coach. 1970 Y 2
19 Sander Boschker Sander Bernard Jozef Boschker is a Dutch former footballer who played as a goalkeeper.During a professional career which spanned 25 years he played mostly for Twente, appearing in 555 Eredivisie games and winning four major titles. 1970 Y 58
20 Tugay Kerimoğlu Tugay Kerimoğlu is a Turkish former footballer who is a current manager of Şanlıurfaspor and played the majority of his career for Galatasaray and Blackburn Rovers.

More quality checks: Tugay’s last matches

One of the older players is Tugay Kerimoğlu, born 1970. I didn’t know anything about him before. Wikipedia says Tugay played for the Blackburn Rovers during his last years.

Does the database also say that? Luckily, yes:

Let’s find out which teams Tugay played for during his last seasons. For each record in the Match table, there exists an entry with home_team_api_id and away_team_api_id. In order to resolve anonymous team-ids to the actual team names, I join the team-ids twice to the oldest_players_in_any_match table. This yields two augmented temporary tables.

# get a smaller version of the Team table
Team_sm <- Team %>%  select(id, team_api_id, team_long_name)

oldest_augm <- oldest_players_in_any_match %>%
        select(-season) %>%
        inner_join(Match, by = c("match_id" = "match_api_id")) %>%
        inner_join(Team_sm, by = c("home_team_api_id" = "team_api_id")) %>%
        rename("home_team" = "team_long_name") %>% 
        filter(player_name == "Tugay Kerimoglou")


oldest_augm2 <- oldest_augm %>%
        inner_join(Team_sm, by = c("away_team_api_id" = "team_api_id")) %>%
        rename("away_team" = "team_long_name")

oldest_augm2 %>%
        select(season, match_id, home_team, away_team) %>%
        arrange(season, match_id) %>% 
        ht()           # head() and tail()
## # A tibble: 12 x 4
##    season    match_id home_team        away_team           
##    <chr>        <int> <chr>            <chr>               
##  1 2008/2009   489097 Newcastle United Blackburn Rovers    
##  2 2008/2009   489107 Blackburn Rovers Manchester United   
##  3 2008/2009   489119 Bolton Wanderers Blackburn Rovers    
##  4 2008/2009   489188 Portsmouth       Blackburn Rovers    
##  5 2008/2009   489197 Blackburn Rovers Liverpool           
##  6 2008/2009   489210 Wigan Athletic   Blackburn Rovers    
##  7 2008/2009   489353 Liverpool        Blackburn Rovers    
##  8 2008/2009   489378 Blackburn Rovers Wigan Athletic      
##  9 2008/2009   489388 Manchester City  Blackburn Rovers    
## 10 2008/2009   489398 Blackburn Rovers Portsmouth          
## 11 2008/2009   489412 Chelsea          Blackburn Rovers    
## 12 2008/2009   489419 Blackburn Rovers West Bromwich Albion

So the database says that Tugay Kerimoğlu has played for Blackburn rovers, and did so until the 2008/2009 season. According to the Wikipedia article about him, this is true.

My take home-message

Needless to say, the creators of the database did a great job in compiling this dataset. The database design is pretty clean, and the fact that there are even some Primary-Foreign Key Constraints between the table inside the database make it stand out. The mappings of player-ids to match-ids seem to be correct in most places.

Minor quality problem: players with digits in their names

Further examination revealed that another small problem with the Players table. I mention this here because it is not worth an independent blog-post.

This R code reveals that some player names have been parsed incorrectly.

players_weird_names <- Player %>% 
        select(player_name) %>% 
        filter(str_detect(player_name, "\\d") == TRUE)
player_name
Abdoulaye Diallo Sadio,22
Afonso Alves,24
Alberto Frison,18
Alberto Luque,21
Theo Weeks,24
Yazid Mansouri,30
Yohan Hautcoeur,17
Yohan Lachor,29

There are 80 Players with digits in their names. I’ve only shown a few of them in the table above. This formatting issue looks so systematic that I suspect that it is the result of an incorrect parse, and maybe a misplaced age calculation.