Still playing with a Soccer Database from Kaggle.com
As an exercise, I wanted to find the oldest players participating in any match. During my analysis I also learned something about the data quality of this Kaggle dataset.
As mentioned in a previous blogpost, I’ve downloaded a zipfile (36 MB) with Football data from data science community Kaggle.com. The archive contained an SQLite Database.
library(DBI)
con <- dbConnect(odbc::odbc(), "well-sqlite-footballdb")
For the sake of brevity, to see the R packages I’ve included to process the data, see the previous blogpost. Only the tidyr
package is new here.
library(tidyr) # gather
The database consists of 7 tables. We’ll read in all of them. The Player table contains basic data of ~10000 soccer players from the top leagues of 14 European countries. They are not strictly “European soccer players”, but players from all over the globe, competing in the top leagues of certain Western European countries.
Tables in the Sqlite database:
## Rows Columns
## Player_Attributes 183978 42
## Player 11060 7
## Match 25979 115
## League 11 3
## Country 11 2
## Team 299 5
## Team_Attributes 1458 25
The Player
table lists the soccer players’ data.
The Match
table contains data from 25979 matches, played between 2008-07-18 and 2016-05-25. The table contains 115 columns.
The names of the players comprising the teams are listed in 22 columns named home_player_1
to home_player_11
, and away_player_1
to away_player_11
. The substitute players nominated for the match are not known.
This database table has an untidy format. Let’s tidy it with the tidyr::gather()
function:
Match_players <- Match %>%
select(match_api_id, season, league_id,
home_player_1:away_player_11) %>%
rename("match_id" = match_api_id) %>%
gather( k, player_api_id, -match_id, -season, -league_id)
This table now looks like this:
head(Match_players %>% filter(!is.na(player_api_id)), 5)
## match_id season league_id k player_api_id
## 1 493016 2008/2009 1 home_player_1 39890
## 2 493017 2008/2009 1 home_player_1 38327
## 3 493018 2008/2009 1 home_player_1 95597
## 4 493020 2008/2009 1 home_player_1 30934
## 5 493021 2008/2009 1 home_player_1 37990
I’ve only shown 5 rows of this 571538 x 5 table. This table can now be joined with the Player
table:
Player_in_match <- Player %>%
inner_join(Match_players, by = c("player_api_id" = "player_api_id")) %>%
mutate(year_of_birth = year(birthday)) %>%
rename("player_id" = id)
This table, Player_in_match
is a 542281 x 12 table. Remember, the original, standalone Match_players
table had 571538 records. The join reduced this to 542281 athletes. Hence, many players didn’t actually play, or the screenscraping process did not yield any information about the matches they ’ve participated in; or the parse-process of many downloaded files containing the matches data failed. In any case, they actually played or were nominated for the starting formation of their squads.
Player_distinct_by_year <- Player_in_match %>%
select(year_of_birth, season, player_api_id) %>%
group_by(year_of_birth, season) %>%
summarize(distinct_players =n_distinct(player_api_id))
Player_distinct_by_year %>%
ggplot(aes(year_of_birth, distinct_players, color=season)) +
geom_point( size=.4) +
geom_smooth(method="loess", se=FALSE, size=0.5) +
ylab("Number of distinct Players in Database") +
xlab("Player's Year of Birth") +
ggtitle("European Soccer Database: Players' appearances",
subtitle = "Counting Players' appearances in the squads. Seasons 2008-2016.\n
Each player is counted once per season no matter how often he played.")
Some players were rarely nominated for the squad of any match. This can best be seen at the left side of the graph. These data points represent older players born before 1970.
Why do the humps have different heights? I think for older seasons, there is simply less data available at the websites of the commercial data vendors, where the data was scraped from.
Histogram of matches played, by year of birth
This code generates a histogram of the “age distribution” of players in the dataset. These are players for which match data is known. I think the axes are pretty self-explanatory.
Pl_lbl <- Player_in_match %>%
arrange(birthday) %>%
select(year_of_birth, player_name) %>%
filter(year_of_birth <= 1970) %>%
group_by(year_of_birth) %>%
arrange(year_of_birth) %>%
unique()
Pl_lbl <- Pl_lbl %>% unique() %>% head(1)
Player_in_match %>%
ggplot(aes(year_of_birth)) +
geom_histogram(binwidth=1, fill="maroon") +
geom_text(data = Pl_lbl,
aes(x=I(year_of_birth+3.5), y=I(4500),
label = player_name, angle = 45), size=3) +
ggtitle(sprintf("Matches by %s soccer players in top leagues in Europe, 2008-2016", nrow(Player_in_match)),
subtitle="By year of birth. Histogram shows players for which match participation data is known.") +
xlab("Year of Birth") +
ylab("Appearances in Matches")
Find the oldest player in any match
We can use the Player_in_match
table to find the oldest players who participated in any match. This corresponds to the leftmost bars in the histogram above. This bar corresponds to the matches of a the oldest soccer players, born in 1970 or earlier.
oldest_players_in_any_match <- Player_in_match %>%
inner_join(League, by = c("league_id" = "id") ) %>%
filter(year_of_birth < 1971) %>%
select(year_of_birth, season, player_name, match_id, name) %>%
group_by(year_of_birth, season, player_name, match_id) %>%
summarize(n_matches = n()) %>%
arrange(season) %>%
ungroup()
Result:
(oldest <- oldest_players_in_any_match %>%
select(player_name, year_of_birth, n_matches) %>%
group_by(player_name, year_of_birth) %>%
summarize(n_matches=sum(n_matches)) %>%
arrange(year_of_birth, player_name) %>%
ungroup())
## # A tibble: 20 x 3
## player_name year_of_birth n_matches
## <chr> <dbl> <int>
## 1 Alberto Fontana 1967. 4
## 2 Paolo Maldini 1968. 30
## 3 Dean Windass 1969. 1
## 4 Francesco Antonioli 1969. 98
## 5 Jens Lehmann 1969. 66
## 6 Luca Bucci 1969. 1
## 7 Michael Tarnat 1969. 6
## 8 Rob van Dijk 1969. 59
## 9 Antonio Chimenti 1970. 2
## 10 Chris Swailes 1970. 22
## 11 David James 1970. 61
## 12 David Weir 1970. 111
## 13 Dean Kiely 1970. 3
## 14 Edwin van der Sar 1970. 87
## 15 Eugenio Corini 1970. 12
## 16 Hans Vonk 1970. 4
## 17 Nico van Kerckhoven 1970. 32
## 18 Pedro Roma 1970. 2
## 19 Sander Boschker 1970. 58
## 20 Tugay Kerimoglou 1970. 15
Before I studied this dataset, the oldest player, Alberto Fontana, was completely unknown to me. So let’s lookup some metadata about this player, and the other players born before 1971, in the Google Knowledge Graph.
The calculation in the following code block is a bit awkward and verbose, because
- Google Knowledge Graph does not return records with a perfectly consistent structure. I have to transform it to a consistent 2-column (key-value) table.
- the Google Knowledge Graph data contains -in this context- irrelevant info such as license data about photographs etc that I had too filter out.
- the database does not keep track of accents and special characters of the Turkish alphabet, so I better use the correctly encoded string from Wikipedia/Google.
# set/replace this one manually. Google returned another wrong athlete (wheelchair)
oldest[oldest$player_name == "David Weir", "player_name"] <- "David Gillespie Weir"
oldest.2 <- map(oldest$player_name,
kgapi_lookup_kv,
apikey=Sys.getenv("GOO_KGR_KEY"))
oldest.3 <- map_df(oldest.2, function(x) {
as_data_frame(x ) %>%
filter(str_detect(key, "result.name|detailedDescription") == TRUE) %>%
select(key, value) %>%
mutate(key = str_replace(key, "result.detailedDescription.", ""),
value = unlist(value)) %>%
spread(key=key, value=value) %>%
select(-license) %>%
mutate(Name =str_c("[", result.name, "](", url, ")"),
Description = str_replace_all(articleBody, "\n", "")) %>%
select(result.name, Name, Description) })
oldest.3 <- oldest.3 %>%
left_join(oldest, by=c("result.name" = "player_name")) %>%
rename("Born" = "year_of_birth", "N" = "n_matches") %>%
select(-result.name) %>%
mutate(IsG = ifelse(str_detect( Description, "goal"), "Y", "")) %>%
mutate_if(is.numeric, function(x) ifelse(is.na(x), "", x)) %>%
select(Name, Description, Born, IsG, N)
In the table below, column ‘IsG’ means “Is Goalie”, ‘N’ means number of matches in Database.
It seems to be quite common that goalkeepers have long careers:
(Click on the Wikipedia links below for more information).
Name | Description | Born | IsG | N | |
---|---|---|---|---|---|
1 | Alberto Fontana | Alberto Fontana is a retired Italian footballer who played as a goalkeeper.During his career, in which he represented nine different clubs, he played until the age of 42. | 1967 | Y | 4 |
2 | Paolo Maldini | Paolo Cesare Maldini is an Italian former professional footballer who played as a left-back and central defender for A.C. Milan and the Italy national team. | 1968 | 30 | |
3 | Dean Windass | Dean Windass is an English former professional footballer who played as a striker. He is best known for his spells at Bradford City and contributing to his hometown team Hull City’s promotion to the Premier League in 2008. He currently plays for AFC. | 1969 | 1 | |
4 | Francesco Antonioli | Francesco Antonioli is an Italian former footballer who played as goalkeeper. He was the oldest footballer in Serie A until his club Cesena were relegated to Serie B at the end of the 2011–12 season, after which he retired from professional football. | 1969 | Y | 98 |
5 | Jens Lehmann | Jens Gerhard Lehmann is a retired German footballer who played as a goalkeeper. He was voted UEFA Club Goalkeeper of the Year for the 1996–97 and 2005–06 seasons, and was selected for three World Cup squads. | 1969 | Y | 66 |
6 | Luca Bucci | Luca Bucci is a retired Italian football goalkeeper. Bucci played for several Italian clubs throughout his career; he is mostly remembered for his successful spell with Parma, where he won various domestic and European titles. | 1969 | Y | 1 |
7 | Michael Tarnat | Michael Tarnat is a German former footballer, currently employed by Bayern Munich as the leader of the U12–U16 youth teams. He is a left-footed full-back who has also played wingback and defensive midfield. | 1969 | 6 | |
8 | Rob van Dijk | Robert ‘Rob’ van Dijk is a Dutch retired footballer who played as a goalkeeper.Over the course of exactly 20 seasons as a professional, he appeared in 364 Eredivisie games, mainly with RKC and Feyenoord. He retired at the age of 43. | 1969 | Y | 59 |
9 | Antonio Chimenti | Antonio Chimenti is a retired Italian football goalkeeper. | 1970 | Y | 2 |
10 | Chris Swailes | Christopher William Swailes is an English former professional footballer. He is currently manager of Dunston UTS. | 1970 | 22 | |
11 | David James | David Benjamin James MBE is an English former footballer who played as a goalkeeper. He is the current manager at Kerala Blasters FC in the Indian Super League. He is also a pundit on BT Sport’s football coverage. | 1970 | Y | 61 |
12 | David Weir | David Gillespie Weir is a Scottish football player and coach.Born in Falkirk, Weir began his professional career with his home-town club, Falkirk, after having attended the University of Evansville in the United States. | |||
13 | Dean Kiely | Dean Laurence Kiely is a former professional footballer who played as a goalkeeper. He is the goalkeeping coach at Premier League club Crystal Palace. He won eleven caps for the Republic of Ireland as a player. Kiely has previously worked as the goalkeeping coach at West Bromwich Albion and Norwich City. | 1970 | Y | 3 |
14 | Edwin van der Sar | Edwin van der Sar OON is a Dutch former professional footballer who played as a goalkeeper. During his career he played for Ajax, Juventus, Fulham and Manchester United. He is the second most capped player in the Netherlands national team’s history. He currently works as the chief executive officer at Ajax. He came out of retirement to play for Dutch amateur team VV Noordwijk. | 1970 | Y | 87 |
15 | Eugenio Corini | Eugenio Corini is an Italian association football coach and former player. He was also captained as a player in the 1990s. He is currently the head coach of Novara. | 1970 | 12 | |
16 | Hans Vonk | Johannes Adrianus “Hans” Vonk is a retired South African professional football goalkeeper of Dutch parentage who last played for Ajax Cape Town FC. | 1970 | Y | 4 |
17 | Nico Van Kerckhoven | Nicolas “Nico” Van Kerckhoven is a retired Belgian football left fullback, last on the books of K.V.C. Westerlo where he was also the captain of the team. His former clubs include Lierse and FC Schalke 04. Van Kerckhoven was part of the Belgian national team for the 1998 and 2002 World Cups. | |||
18 | Pedro Roma | Pedro Miguel da Mota Roma is a retired Portuguese footballer who played as a goalkeeper.Having played with Académica for over 15 years, he later worked with his main club as a goalkeepers’ coach. | 1970 | Y | 2 |
19 | Sander Boschker | Sander Bernard Jozef Boschker is a Dutch former footballer who played as a goalkeeper.During a professional career which spanned 25 years he played mostly for Twente, appearing in 555 Eredivisie games and winning four major titles. | 1970 | Y | 58 |
20 | Tugay Kerimoğlu | Tugay Kerimoğlu is a Turkish former footballer who is a current manager of Şanlıurfaspor and played the majority of his career for Galatasaray and Blackburn Rovers. |
More quality checks: Tugay’s last matches
One of the older players is Tugay Kerimoğlu, born 1970. I didn’t know anything about him before. Wikipedia says Tugay played for the Blackburn Rovers during his last years.
Does the database also say that? Luckily, yes:
Let’s find out which teams Tugay played for during his last seasons. For each record in the Match table, there exists an entry with home_team_api_id
and away_team_api_id
. In order to resolve anonymous team-ids to the actual team names, I join the team-ids twice to the oldest_players_in_any_match table. This yields two augmented temporary tables.
# get a smaller version of the Team table
Team_sm <- Team %>% select(id, team_api_id, team_long_name)
oldest_augm <- oldest_players_in_any_match %>%
select(-season) %>%
inner_join(Match, by = c("match_id" = "match_api_id")) %>%
inner_join(Team_sm, by = c("home_team_api_id" = "team_api_id")) %>%
rename("home_team" = "team_long_name") %>%
filter(player_name == "Tugay Kerimoglou")
oldest_augm2 <- oldest_augm %>%
inner_join(Team_sm, by = c("away_team_api_id" = "team_api_id")) %>%
rename("away_team" = "team_long_name")
oldest_augm2 %>%
select(season, match_id, home_team, away_team) %>%
arrange(season, match_id) %>%
ht() # head() and tail()
## # A tibble: 12 x 4
## season match_id home_team away_team
## <chr> <int> <chr> <chr>
## 1 2008/2009 489097 Newcastle United Blackburn Rovers
## 2 2008/2009 489107 Blackburn Rovers Manchester United
## 3 2008/2009 489119 Bolton Wanderers Blackburn Rovers
## 4 2008/2009 489188 Portsmouth Blackburn Rovers
## 5 2008/2009 489197 Blackburn Rovers Liverpool
## 6 2008/2009 489210 Wigan Athletic Blackburn Rovers
## 7 2008/2009 489353 Liverpool Blackburn Rovers
## 8 2008/2009 489378 Blackburn Rovers Wigan Athletic
## 9 2008/2009 489388 Manchester City Blackburn Rovers
## 10 2008/2009 489398 Blackburn Rovers Portsmouth
## 11 2008/2009 489412 Chelsea Blackburn Rovers
## 12 2008/2009 489419 Blackburn Rovers West Bromwich Albion
So the database says that Tugay Kerimoğlu has played for Blackburn rovers, and did so until the 2008/2009 season. According to the Wikipedia article about him, this is true.
My take home-message
Needless to say, the creators of the database did a great job in compiling this dataset. The database design is pretty clean, and the fact that there are even some Primary-Foreign Key Constraints between the table inside the database make it stand out. The mappings of player-ids to match-ids seem to be correct in most places.
Minor quality problem: players with digits in their names
Further examination revealed that another small problem with the Players table. I mention this here because it is not worth an independent blog-post.
This R code reveals that some player names have been parsed incorrectly.
players_weird_names <- Player %>%
select(player_name) %>%
filter(str_detect(player_name, "\\d") == TRUE)
player_name |
---|
Abdoulaye Diallo Sadio,22 |
Afonso Alves,24 |
Alberto Frison,18 |
Alberto Luque,21 |
… |
Theo Weeks,24 |
Yazid Mansouri,30 |
Yohan Hautcoeur,17 |
Yohan Lachor,29 |
There are 80 Players with digits in their names. I’ve only shown a few of them in the table above. This formatting issue looks so systematic that I suspect that it is the result of an incorrect parse, and maybe a misplaced age calculation.