Analyzing Mobile Data Speeds from TRAI with Pandas

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Let’s use a real dataset from TRAI to analyze mobile dataspeeds and try to see the average speeds for a particular operator or state in that month. This will also show how easily Pandas could be used on any real world data to derive interesting results.

About Dataset –
Telecom Regulatory Authority of India (TRAI) releases a monthly dataset of the internet speeds it measures through the MySpeed (TRAI) app. This includes speed tests initiated by the user itself or periodic background tests done by the app. We will try to analyze this dataset and see the average speeds for a particular operator or state in that month.

Inspecting the raw structure of data:

  • Go to TRAI MySpeed Portal and download the latest month’s csv file under the Download section. You can also download the csv file used in this article: sept18_publish.csv or sept18_publish_drive.csv

  • Open this spreadsheet file.
    NOTE: As the dataset is huge, the software may give you an warning that all rows could not be loaded. This is fine. Also if you are using Microsoft Excel, there might be a warning about opening of a SYLK file. This error could be ignored as it is a common bug in Excel.
    Now, let’s take a look at the arrangement of the data-

    Column Names

    Column Names in the dataset

    1st column is of the Network OperatorJIO, Airtel etc.
    2nd column is of the Network Technology3G or 4G.
    3rd column is the Type of Test initiated – upload or download.
    4th column is the Speed Measured in Kilobytes per second.
    5th column is the Signal Strength during the measurement.
    6th column is the Local Service Area(LSA), or the circle where the test was done – Delhi, Orissa etc. We will refer to this as simply ‘states’.

  • NOTE: The Signal Strength may have na (Not Available) values due to some devices unable to capture signal. We will ignore using this parameter in our calculations to make things simpler. However, it could be easily added as a condition while filtering.

    Packages required –

    Pandas – a popular data analysis toolkit. Very powerful for crunching large sets of data.
    Numpy – provides fast and efficient operations on arrays of homogeneous data. We will use this to along with pandas and matplotlib.
    Matplotlib – is a plotting library. We will use its bar plotting function to make bar graphs.

    Lets start analyzing the data.

    Step #1: Import the packages and define some constants.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
      
    # we will define some constants
      
    # name of the csv dataset
    DATASET_FILENAME = 'sept18_publish.csv'
      
    # define the operator to be filtered upon.
    CONST_OPERATOR = 'JIO'
      
    # define the state to be filtered upon.
    CONST_STATE = 'Delhi'
      
    # define the the technology to be filtered upon
    CONST_TECHNOLOGY = '4G'

    chevron_right

    
    

     
    Step #2: Define some lists that will store the final calculated results, so that it could be passed on to the bar plotting function easily. The state (or operator), download speed and upload speed will be stored serially so that for an index, the state (or operator), it’s corresponding download and upload speeds can be accessed.

    For Example, final_states[2], final_download_speeds[2] and final_upload_speeds[2] will give the corresponding values for the 3rd state.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # define lists
    final_download_speeds = []
    final_upload_speeds = []
    final_states = []
    final_operators = []

    chevron_right

    
    

    Step #3: Import the file using Pandas read_csv() function and store it in ‘df’. This will create a DataFrame of the csv contents on which we will work on.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    df = pd.read_csv(DATASET_FILENAME)
      
    # assign headers for each of the columns based on the data
    # this allows us to access columns easily
      
    df.columns = ['Service Provider', 'Technology', 'Test Type',
                       'Data Speed', 'Signal Strength', 'State']

    chevron_right

    
    

    Step #4: First lets find all the unique states and operators in this dataset and store them into corresponding states and operators list.

    We will use the unique() method of the Pandas dataframe.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # find and display the unique states
    states = df['State'].unique()
    print('STATES Found: ', states)
      
    # find and display the unique operators
    operators = df['Service Provider'].unique()
    print('OPERATORS Found: ', operators)

    chevron_right

    
    

    Output:

    STATES Found:  ['Kerala' 'Rajasthan' 'Maharashtra' 'UP East' 'Karnataka' nan
     'Madhya Pradesh' 'Kolkata' 'Bihar' 'Gujarat' 'UP West' 'Orissa'
     'Tamil Nadu' 'Delhi' 'Assam' 'Andhra Pradesh' 'Haryana' 'Punjab'
     'North East' 'Mumbai' 'Chennai' 'Himachal Pradesh' 'Jammu & Kashmir'
     'West Bengal']
    OPERATORS Found:  ['IDEA' 'JIO' 'AIRTEL' 'VODAFONE' 'CELLONE']

     

    Step #5: Define the function fixed_operator, which will keep the operator constant and iterate through all the available states for that operator. We can construct a similar function for a fixed state.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # filter out the operator and technology
    # first as this will be common for all
    filtered = df[(df['Service Provider'] == CONST_OPERATOR) 
                   & (df['Technology'] == CONST_TECHNOLOGY)]
      
    # iterate through each of the states
    for state in states:
      
        # create new dataframe which contains
        # only the data of the current state
        base = filtered[filtered['State'] == state]
      
        # filter only download speeds based on test type
        down = base[base['Test Type'] == 'download']
      
        # filter only upload speeds based on test type
        up = base[base['Test Type'] == 'upload']
      
        # calculate mean of speeds in Data Speed
        # column using the Pandas.mean() method
        avg_down = down['Data Speed'].mean()
      
        # calculate mean of speeds
        # in Data Speed column
        avg_up = up['Data Speed'].mean()
      
        # discard values if mean is not a number(nan)
        # and append only the valid ones
        if (pd.isnull(avg_down) or pd.isnull(avg_up)):
            down, up = 0, 0
          
        else:
            final_states.append(state)
            final_download_speeds.append(avg_down)
            final_upload_speeds.append(avg_up)
      
            # print output upto 2 decimal places
            print(str(state) + ' -- Avg. Download: ' +
                              str('%.2f' % avg_down) + 
             '  Avg. Upload: ' + str('%.2f' % avg_up))

    chevron_right

    
    

    Output:

    Kerala -- Avg. Download: 26129.27  Avg. Upload: 5193.46
    Rajasthan -- Avg. Download: 27784.86  Avg. Upload: 5736.18
    Maharashtra -- Avg. Download: 20707.88  Avg. Upload: 4130.46
    UP East -- Avg. Download: 22451.35  Avg. Upload: 5727.95
    Karnataka -- Avg. Download: 16950.36  Avg. Upload: 4720.68
    Madhya Pradesh -- Avg. Download: 23594.85  Avg. Upload: 4802.89
    Kolkata -- Avg. Download: 26747.80  Avg. Upload: 5655.55
    Bihar -- Avg. Download: 31730.54  Avg. Upload: 6599.45
    Gujarat -- Avg. Download: 16377.43  Avg. Upload: 3642.89
    UP West -- Avg. Download: 23720.82  Avg. Upload: 5280.46
    Orissa -- Avg. Download: 31502.05  Avg. Upload: 6895.46
    Tamil Nadu -- Avg. Download: 16689.28  Avg. Upload: 4107.44
    Delhi -- Avg. Download: 20308.30  Avg. Upload: 4877.40
    Assam -- Avg. Download: 5653.49  Avg. Upload: 2864.47
    Andhra Pradesh -- Avg. Download: 32444.07  Avg. Upload: 5755.95
    Haryana -- Avg. Download: 7170.63  Avg. Upload: 2680.02
    Punjab -- Avg. Download: 14454.45  Avg. Upload: 4981.15
    North East -- Avg. Download: 6702.29  Avg. Upload: 2966.84
    Mumbai -- Avg. Download: 14070.97  Avg. Upload: 4118.21
    Chennai -- Avg. Download: 20054.47  Avg. Upload: 4602.35
    Himachal Pradesh -- Avg. Download: 7436.99  Avg. Upload: 4020.09
    Jammu & Kashmir -- Avg. Download: 8759.20  Avg. Upload: 4418.21
    West Bengal -- Avg. Download: 16821.17  Avg. Upload: 3628.78

    Plotting the data –

    Use the arange() method of Numpy which returns evenly spaced values within a given interval. Here, passing the length of the final_states list, hence we get values from 0 to the number of states in the list like [0, 1, 2, 3 …]
    We can then use these indices to plot a bar at that location. The second bar is plotted by offsetting the location of the first bar by the bar width.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    fig, ax = plt.subplots()
       
    # the width of each bar
    bar_width = 0.25
       
    # opacity of each bar
    opacity = 0.8
       
    # store the positions
    index = np.arange(len(final_states))
       
    # the plt.bar() takes in the position
    # of the bars, data to be plotted,
    # width of each bar and some other 
    # optional parameters, like the opacity and colour
       
    # plot the download bars
    bar_download = plt.bar(index, final_download_speeds,
                           bar_width, alpha=opacity,
                           color='b', label='Download')
    # plot the upload bars  
    bar_upload = plt.bar(index + bar_width, final_upload_speeds, 
                            bar_width, alpha=opacity, color='g',
                                                 label='Upload')
       
    # title of the graph
    plt.title('Avg. Download/Upload speed for '
                         + str(CONST_OPERATOR))
       
    # the x-axis label
    plt.xlabel('States')
       
    # the y-axis label
    plt.ylabel('Average Speeds in Kbps')
       
    # the label below each of the bars,
    # corresponding to the states
    plt.xticks(index + bar_width, final_states, rotation=90)
       
    # draw the legend
    plt.legend()
       
    # make the graph layout tight
    plt.tight_layout()
       
    # show the graph
    plt.show()

    chevron_right

    
    

    Barplot Output

    Bar plot of the calculated speeds

    Comparing data of two months –

    Let’s take some another month’s data as well and plot them together to observe the difference in the data speeds.

    For this example, the previous month’s dataset will be same sept18_publish.csv and the next month’s dataset is oct18_publish.csv.

    We just need to execute the same steps again. Read the another month’s data. Filter it out to subsequent dataframes and then plot it using a slightly different method. During plotting of the bars, we will increment the 3rd and 4th bars(corresponding to the second file’s upload and download) by 2 and 3 times the bar width, so that they are in their correct positions.

    Bar Widths in Compare

    Offsetting logic when plotting 4 bars

    Below is the implementation for comparing 2-months of data:

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import time
       
    # older month
       
    # newer month
       
    CONST_OPERATOR = 'JIO'
    CONST_STATE = 'Delhi'
    CONST_TECHNOLOGY = '4G'
       
    # read file with Pandas and store as Dataframe
    df = pd.read_csv(DATASET_FILENAME)
    df2 = pd.read_csv(DATASET_FILENAME2)
       
    # assign column names
    df.columns = ['Service Provider', 'Technology', 'Test Type',
                       'Data Speed', 'Signal Strength', 'State']
    df2.columns = ['Service Provider', 'Technology', 'Test Type',
                        'Data Speed', 'Signal Strength', 'State']
       
    # find and display the unique states
    states = df['State'].unique()
    print('STATES Found: ', states)
       
    # find and display the unique operators
    operators = df['Service Provider'].unique()
    print('OPERATORS Found: ', operators)
       
    # define lists
    final_download_speeds = []
    final_upload_speeds = []
       
    final_download_speeds_second =[]
    final_upload_speeds_second = []
           
    final_states = []
    final_operators = []
       
    # assign column names to the data
    df.columns = ['Service Provider', 'Technology', 'Test Type'
                       'Data Speed', 'Signal Strength', 'State']
    df2.columns = ['Service Provider', 'Technology', 'Test Type',
                        'Data Speed', 'Signal Strength', 'State']
       
      
    print('\n\nComparing data for' + str(CONST_OPERATOR))
    filtered = df[(df['Service Provider'] == CONST_OPERATOR)
                   & (df['Technology'] == CONST_TECHNOLOGY)]
      
    filtered2 = df2[(df2['Service Provider'] == CONST_OPERATOR)
                     & (df2['Technology'] == CONST_TECHNOLOGY)]
       
    for state in states:
        base = filtered[filtered['State'] == state]
      
        # calculate mean of download speeds
        avg_down = base[base['Test Type'] ==
                   'download']['Data Speed'].mean()
      
        # calculate mean of upload speeds
        avg_up = base[base['Test Type'] ==
                     'upload']['Data Speed'].mean()
      
        base2 = filtered2[filtered2['State'] == state]
      
        # calculate mean of download speeds
        avg_down2 = base2[base2['Test Type'] == 
                 'download']['Data Speed'].mean()
      
        # calculate mean of upload speeds
        avg_up2 = base2[base2['Test Type'] == 
                    'upload']['Data Speed'].mean()
      
      
        # discard values if mean is not a number(nan)
        # and append only the needed speeds
        if (pd.isnull(avg_down) or pd.isnull(avg_up) or
            pd.isnull(avg_down2) or pd.isnull(avg_up2)):
            avg_down = 0
            avg_up = 0
            avg_down2 = 0
            avg_up2 = 0
        else:
            final_states.append(state)
            final_download_speeds.append(avg_down)
            final_upload_speeds.append(avg_up)
            final_download_speeds_second.append(avg_down2)
            final_upload_speeds_second.append(avg_up2)
              
            print('Older: ' + str(state) + ' -- Download: ' +
                  str('%.2f' % avg_down) + '  Upload: ' +
                  str('%.2f' % avg_up))
      
            print('Newer: ' + str(state) + ' -- Download: ' +
                  str('%.2f' % avg_down2) + '  Upload: ' +
                  str('%.2f' % avg_up2))
      
    # plot bargraph
    fig, ax = plt.subplots()
    index = np.arange(len(final_states))
    bar_width = 0.2
    opacity = 0.8
      
    rects1 = plt.bar(index, final_download_speeds, 
                     bar_width, alpha=opacity, color='b',
                         label='Older Month\'s Download')
      
    rects2 = plt.bar(index + bar_width, final_upload_speeds, 
                        bar_width, alpha=opacity, color='g',
                              label='Older Month\'s Upload')
      
    rects3 = plt.bar(index + 2 * bar_width, final_download_speeds_second, 
                                     bar_width, alpha=opacity, color='y',
                                         label='Newer Month\'s Download')
      
    rects4 = plt.bar(index + 3 * bar_width, final_upload_speeds_second, 
                                   bar_width, alpha=opacity, color='r',
                                         label='Newer Month\'s Upload')
      
    plt.xlabel('States')
    plt.ylabel('Average Speeds')
    plt.title('Avg. Download/Upload speed for '
                         + str(CONST_OPERATOR))
      
    plt.xticks(index + bar_width, final_states, rotation=90)
    plt.legend()
    plt.tight_layout()
      
    plt.show()

    chevron_right

    
    

    Output:

    STATES Found:  ['Kerala' 'Rajasthan' 'Maharashtra' 'UP East' 'Karnataka' nan
     'Madhya Pradesh' 'Kolkata' 'Bihar' 'Gujarat' 'UP West' 'Orissa'
     'Tamil Nadu' 'Delhi' 'Assam' 'Andhra Pradesh' 'Haryana' 'Punjab'
     'North East' 'Mumbai' 'Chennai' 'Himachal Pradesh' 'Jammu & Kashmir'
     'West Bengal']
    OPERATORS Found:  ['IDEA' 'JIO' 'AIRTEL' 'VODAFONE' 'CELLONE']
    Comparing data forJIO
    Older: Kerala -- Download: 26129.27  Upload: 5193.46
    Newer: Kerala -- Download: 18917.46  Upload: 4290.13
    Older: Rajasthan -- Download: 27784.86  Upload: 5736.18
    Newer: Rajasthan -- Download: 13973.66  Upload: 4721.17
    Older: Maharashtra -- Download: 20707.88  Upload: 4130.46
    Newer: Maharashtra -- Download: 26285.47  Upload: 5848.77
    Older: UP East -- Download: 22451.35  Upload: 5727.95
    Newer: UP East -- Download: 24368.81  Upload: 6101.20
    Older: Karnataka -- Download: 16950.36  Upload: 4720.68
    Newer: Karnataka -- Download: 33521.31  Upload: 5871.38
    Older: Madhya Pradesh -- Download: 23594.85  Upload: 4802.89
    Newer: Madhya Pradesh -- Download: 16614.49  Upload: 4135.70
    Older: Kolkata -- Download: 26747.80  Upload: 5655.55
    Newer: Kolkata -- Download: 23761.85  Upload: 5153.29
    Older: Bihar -- Download: 31730.54  Upload: 6599.45
    Newer: Bihar -- Download: 34196.09  Upload: 5215.58
    Older: Gujarat -- Download: 16377.43  Upload: 3642.89
    Newer: Gujarat -- Download: 9557.90  Upload: 2684.55
    Older: UP West -- Download: 23720.82  Upload: 5280.46
    Newer: UP West -- Download: 35035.84  Upload: 5797.93
    Older: Orissa -- Download: 31502.05  Upload: 6895.46
    Newer: Orissa -- Download: 31826.96  Upload: 6968.59
    Older: Tamil Nadu -- Download: 16689.28  Upload: 4107.44
    Newer: Tamil Nadu -- Download: 27306.54  Upload: 5537.58
    Older: Delhi -- Download: 20308.30  Upload: 4877.40
    Newer: Delhi -- Download: 25198.16  Upload: 6228.81
    Older: Assam -- Download: 5653.49  Upload: 2864.47
    Newer: Assam -- Download: 5243.34  Upload: 2676.69
    Older: Andhra Pradesh -- Download: 32444.07  Upload: 5755.95
    Newer: Andhra Pradesh -- Download: 19898.16  Upload: 4002.25
    Older: Haryana -- Download: 7170.63  Upload: 2680.02
    Newer: Haryana -- Download: 8496.27  Upload: 2862.61
    Older: Punjab -- Download: 14454.45  Upload: 4981.15
    Newer: Punjab -- Download: 17960.28  Upload: 4885.83
    Older: North East -- Download: 6702.29  Upload: 2966.84
    Newer: North East -- Download: 6008.06  Upload: 3052.87
    Older: Mumbai -- Download: 14070.97  Upload: 4118.21
    Newer: Mumbai -- Download: 26898.04  Upload: 5539.71
    Older: Chennai -- Download: 20054.47  Upload: 4602.35
    Newer: Chennai -- Download: 36086.70  Upload: 6675.70
    Older: Himachal Pradesh -- Download: 7436.99  Upload: 4020.09
    Newer: Himachal Pradesh -- Download: 9277.45  Upload: 4622.25
    Older: Jammu & Kashmir -- Download: 8759.20  Upload: 4418.21
    Newer: Jammu & Kashmir -- Download: 9290.38  Upload: 4533.08
    Older: West Bengal -- Download: 16821.17  Upload: 3628.78
    Newer: West Bengal -- Download: 9763.05  Upload: 2627.28
    Compare - Bar Plot

    Bar plot of the output

    We just learned how to analyze some real world data and draw some interesting observations from it. But note that not all the data will be as nicely formatted and simple to deal with, Pandas makes it incredibly easy to work with such datasets.



    My Personal Notes arrow_drop_up

    Technical Writer Lets have an update

    If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

    Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.