Open In App

How to get Football Data with a Python Package

Last Updated : 23 Apr, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Football (soccer) is one of the most popular sports worldwide, captivating millions of fans with its thrilling matches and compelling narratives. In this article, we’ll explore how to easily access football data using Python.

 We’ll explore in this article all the free football data that Statsbomb shares on its Python package statsbombpy.

Steps to get Football Data with a Python Package

Step 1: Installing and Importing the StatsBombPy Package

Begin by installing the StatsBombPy package via pip:

pip install statsbombpy

Importing statsbombpy using following code:

Python3
from statsbombpy import sb


Step 2: Exploring Available Competitions:

To view the available competitions within the StatsBomb dataset, use:

Python3
sb.competitions()

Output:


competition_id season_id country_name competition_name competition_gender competition_youth competition_international season_name match_updated match_updated_360 match_available_360 match_available
0 9 27 Germany 1. Bundesliga male False False 2015/2016 2023-12-12T07:43:33.436182 None None 2023-12-12T07:43:33.436182
1 1267 107 Africa African Cup of Nations male False True 2023 2024-02-14T05:41:27.566989 None None 2024-02-14T05:41:27.566989
2 16 4 Europe Champions League male False False 2018/2019 2023-03-07T12:20:48.118250 2021-06-13T16:17:31.694 None 2023-03-07T12:20:48.118250
3 16 1 Europe Champions League male False False 2017/2018 2021-08-27T11:26:39.802832 2021-06-13T16:17:31.694 None 2021-01-23T21:55:30.425330
4 16 2 Europe Champions League male False False 2016/2017 2021-08-27T11:26:39.802832 2021-06-13T16:17:31.694 None 2020-07-29T05:00
... ... ... ... ... ... ... ... ... ... ... ... ...
66 55 43 Europe UEFA Euro male False True 2020 2023-02-24T21:26:47.128979 2023-04-27T22:38:34.970148 2023-04-27T22:38:34.970148 2023-02-24T21:26:47.128979
67 35 75 Europe UEFA Europa League male False False 1988/1989 2023-06-18T19:28:39.443883 2021-06-13T16:17:31.694 None 2023-06-18T19:28:39.443883
68 53 106 Europe UEFA Women's Euro female False True 2022 2023-10-24T03:36:54.066267 2023-10-24T03:37:29.085948 2023-10-24T03:37:29.085948 2023-10-24T03:36:54.066267
69 72 107 International Women's World Cup female False True 2023 2023-12-12T14:06:50.626363 2023-12-12T14:12:41.561162 2023-12-12T14:12:41.561162 2023-12-12T14:06:50.626363
70 72 30 International Women's World Cup female False True 2019 2023-07-27T10:33:48.273734 2021-06-13T16:17:31.694 None 2023-07-27T10:33:48.273734
71 rows × 12 columns

To Filter out duplicate entries to display unique competitions

  • drop_duplicates(['country_name', 'competition_name']) removes duplicate rows from the DataFrame based on the specified columns (‘country_name’ and ‘competition_name’). If there are multiple rows with the same country name and competition name, only the first occurrence is kept, and the rest are dropped.
Python3
sb.competitions().drop_duplicates(['country_name', 'competition_name'])

Output:

    competition_id    season_id    country_name    competition_name    competition_gender    competition_youth    competition_international    season_name    match_updated    match_updated_360    match_available_360    match_available
0 9 27 Germany 1. Bundesliga male False False 2015/2016 2023-12-12T07:43:33.436182 None None 2023-12-12T07:43:33.436182
1 1267 107 Africa African Cup of Nations male False True 2023 2024-02-14T05:41:27.566989 None None 2024-02-14T05:41:27.566989
2 16 4 Europe Champions League male False False 2018/2019 2023-03-07T12:20:48.118250 2021-06-13T16:17:31.694 None 2023-03-07T12:20:48.118250
20 87 84 Spain Copa del Rey male False False 1983/1984 2020-07-29T05:00 2021-06-13T16:17:31.694 None 2020-07-29T05:00
23 37 90 England FA Women's Super League female False False 2020/2021 2023-02-25T14:52:09.326729 2021-06-13T16:17:31.694 None 2023-02-25T14:52:09.326729
26 1470 274 International FIFA U20 World Cup male False False 1979 2023-06-28T10:55:11.501179 None None 2023-06-28T10:55:11.501179
27 43 106 International FIFA World Cup male False True 2022 2023-11-05T04:23:26.649917 2023-11-21T15:37:11.589616 2023-11-21T15:37:11.589616 2023-11-05T04:23:26.649917

This provides insights into competitions such as the FIFA World Cup, Champions League, La Liga, and more.

Step 3: Exploring Specific Matches (e.g., FIFA World Cup 2018):

  1. sb.matches(competition_id=43, season_id=3): This method fetches match data for a specific competition and season. In this case, competition_id=43 specifies the ID of the competition (e.g., Premier League), and season_id=3 specifies the ID of the season (e.g., 2018-2019 season).
  2. df_2018 = sb.matches(competition_id=43, season_id=3): This line assigns the retrieved match data to a DataFrame called df_2018.
  3. df_2018.head(5): This line displays the first 5 rows of the df_2018 DataFrame, providing a glimpse of the match data for the 2018 season
Python3
df_2018 = sb.matches(competition_id=43, season_id=3)
df_2018.head(5)

Output:

    match_id    match_date    kick_off    competition    season    home_team    away_team    home_score    away_score    match_status    ...    last_updated_360    match_week    competition_stage    stadium    referee    home_managers    away_managers    data_version    shot_fidelity_version    xy_fidelity_version
0 7585 2018-07-03 20:00:00.000 International - FIFA World Cup 2018 Colombia England 1 1 available ... 2021-06-13T16:17:31.694 4 Round of 16 Otkritie Bank Arena Mark Geiger José Néstor Pekerman Gareth Southgate 1.0.2 None None
1 7570 2018-06-28 20:00:00.000 International - FIFA World Cup 2018 England Belgium 0 1 available ... 2021-06-13T16:17:31.694 3 Group Stage Stadion Kaliningrad Damir Skomina Gareth Southgate Roberto Martínez Montoliú 1.0.2 None None
2 7586 2018-07-03 16:00:00.000 International - FIFA World Cup 2018 Sweden Switzerland 1 0 available ... 2021-06-13T16:17:31.694 4 Round of 16 Saint-Petersburg Stadium Damir Skomina Jan Olof Andersson Vladimir Petković 1.0.2 None None
3 7557 2018-06-25 20:00:00.000 International - FIFA World Cup 2018 Iran Portugal 1 1 available ... 2021-06-13T16:17:31.694 3 Group Stage Mordovia Arena Enrique Cáceres Carlos Manuel Brito Leal Queiróz Fernando Manuel Fernandes da Costa Santos 1.0.2 None None
4 7542 2018-06-20 14:00:00.000 International - FIFA World Cup 2018 Portugal Morocco 1 0 available ... 2021-06-13T16:17:31.694 2 Group Stage Stadion Luzhniki Mark Geiger Fernando Manuel Fernandes da Costa Santos Hervé Renard 1.0.2 None None
5 rows × 22 columns

Step 4: Retrieving Lineups:

This code retrieves the lineups for a specific football match in the StatsBomb dataset for the 2018 season. Let’s break down the code:

  1. id_final_2018 = 8658: This line defines the id_final_2018 variable and assigns it the match ID 8658. This ID is used to uniquely identify the specific match for which we want to retrieve the lineups.
  2. lineups = sb.lineups(match_id=id_final_2018): This line calls the sb.lineups() method with the match_id=id_final_2018 argument to retrieve the lineups for the match with the specified ID. The result is stored in the lineups variable.
  3. lineups.keys(): This line retrieves the keys (column names) of the lineups DataFrame, which contain information about the players in each team’s lineup for the specified match.
Python3
id_final_2018 = 8658
lineups = sb.lineups(match_id=id_final_2018)
lineups.keys()

Output:

dict_keys(['France', 'Croatia'])

Step 5: Retrieving Match Events:

  1. df_events = sb.events(match_id=id_final_2018): This line calls the sb.events() method with the match_id=id_final_2018 argument to retrieve event data for the match with the specified ID (id_final_2018). The result is stored in the df_events variable, which is a DataFrame containing information about various events that occurred during the match (e.g., goals, fouls, substitutions).
  2. df_events.columns: This line retrieves the column names (keys) of the df_events DataFrame. Each column represents a different attribute or piece of information about the events recorded during the match.
Python3
df_events = sb.events(match_id=id_final_2018)
df_events.columns

Output:

Index(['ball_receipt_outcome', 'ball_recovery_recovery_failure',
'block_deflection', 'carry_end_location', 'clearance_aerial_won',
'counterpress', 'dribble_outcome', 'dribble_overrun', 'duel_outcome',
'duel_type', 'duration', 'foul_committed_advantage',
'foul_committed_card', 'foul_committed_penalty', 'foul_committed_type',
'foul_won_advantage', 'foul_won_defensive', 'goalkeeper_body_part',
'goalkeeper_end_location', 'goalkeeper_outcome', 'goalkeeper_position',
'goalkeeper_technique', 'goalkeeper_type', 'id', 'index',
'injury_stoppage_in_chain', 'interception_outcome', 'location',
'match_id', 'minute', 'pass_aerial_won', 'pass_angle',
'pass_assisted_shot_id', 'pass_backheel', 'pass_body_part',
'pass_cross', 'pass_cut_back', 'pass_deflected', 'pass_end_location',
'pass_goal_assist', 'pass_height', 'pass_length', 'pass_outcome',
'pass_recipient', 'pass_recipient_id', 'pass_shot_assist',
'pass_switch', 'pass_type', 'period', 'play_pattern', 'player',
'player_id', 'position', 'possession', 'possession_team',
'possession_team_id', 'related_events', 'second', 'shot_aerial_won',
'shot_body_part', 'shot_deflected', 'shot_end_location',
'shot_first_time', 'shot_freeze_frame', 'shot_key_pass_id',
'shot_outcome', 'shot_statsbomb_xg', 'shot_technique', 'shot_type',
'substitution_outcome', 'substitution_outcome_id',
'substitution_replacement', 'substitution_replacement_id', 'tactics',
'team', 'team_id', 'timestamp', 'type', 'under_pressure'],
dtype='object')

Step 6: Filtering and sorting of event data

  1. df_events = df_events[['timestamp','team', 'type', 'minute', 'location', 'pass_end_location', 'player']]: This line selects only the specified columns (‘timestamp’, ‘team’, ‘type’, ‘minute’, ‘location’, ‘pass_end_location’, ‘player’) from the df_events DataFrame and assigns the result back to df_events. This step filters the DataFrame to include only these columns for further analysis.
  2. df_events = df_events.sort_values(['minute', 'timestamp']): This line sorts the df_events DataFrame based on the ‘minute’ and ‘timestamp’ columns in ascending order. This ensures that the events are ordered chronologically within each minute of the match.
  3. df_events.tail(30): This line displays the last 30 rows of the df_events DataFrame, showing the most recent events recorded in the match. Each row represents a specific event (e.g., pass, shot, foul) along with the corresponding details such as the team, player, location, and minute of the event.
Python3
df_events = df_events[['timestamp','team', 'type', 'minute', 'location', 'pass_end_location', 'player']]
df_events = df_events.sort_values(['minute', 'timestamp'])
df_events.tail(5)

Output:

timestamp    team    type    minute    location    pass_end_location    player
2215 00:49:45.427 France Carry 94 [5.0, 33.0] NaN Hugo Lloris
2960 00:49:45.427 France Goal Keeper 94 [5.0, 33.0] NaN Hugo Lloris
851 00:50:01.987 France Pass 95 [18.0, 31.0] [52.0, 25.0] Hugo Lloris
2967 00:50:03.760 France Half End 95 NaN NaN NaN
2968 00:50:03.760 Croatia Half End 95 NaN NaN NaN

With the StatsBombPy package, obtaining football data becomes seamless and efficient. By following the steps outlined in this guide, analysts and enthusiasts alike can delve into comprehensive datasets encompassing various competitions, matches, lineups, and events. Empowered with this wealth of data, the possibilities for football analytics projects are boundless.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads