Instagram
youtube
Facebook
Twitter

Groupby

Groupby

  • Pandas groupby is used to group data into categories and then apply a function to the categories. It also aids in the efficient aggregation of data.
  • In pandas, we use groupby() function which splits the data into groups based on some condition.
    • Syntax:
      DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)

      Parameters here: 

    • by: mapping, function, str.                                                                                                               

    • axis: int, default 0                                                                                                                             

    • level: If the axis is a MultiIndex, group by a particular level or levels.                     

    • as_index: For aggregated output, return an object with group labels as the index. Only relevant for DataFrame input. as_index=False is an effective “SQL-style” grouped output.                                                                                                                       sort:

    • Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.                                                                                                   

    • group_keys: When calling apply, add group keys to the index to identify pieces.

    • squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type                                                                                   

    • Returns: GroupBy object.

  • Example: 

    import pandas as pd
    
    arr = [[11, 12, 13], [41, 76, 34], [23, None, 37], [91, 12, 20]]
    
    df = pd.DataFrame(arr, columns=['p','q','r'])
    
    print(df)
    
    sk = df.groupby('q') #spliting the data based on column 'q'
    
    print(sk.first())
    print('----------------')
    print(df.groupby('q').sum()) #printing the sum of other values based on q's values
    print('----------------')
    
    

    Output:

        p     q   r
    0  11  12.0  13
    1  41  76.0  34
    2  23   NaN  37
    3  91  12.0  20
           p   r
    q           
    12.0  11  13
    76.0  41  34
    ----------------
            p   r
    q            
    12.0  102  33
    76.0   41  34
    ----------------

     

  • Example:

    import pandas as pd
    
    df = pd.DataFrame({'Avengers': ['Falcon', 'Falcon',
                                  'Iron Man', 'Iron Man'],
                       'Max Speed': [380., 370., 424., 226.]})
    
    print(df)
    print('---------------------------')
    print(df.groupby('Avengers').mean())

    Output:

       Avengers  Max Speed
    0    Falcon      380.0
    1    Falcon      370.0
    2  Iron Man      424.0
    3  Iron Man      226.0
    ---------------------------
              Max Speed
    Avengers           
    Falcon        375.0
    Iron Man      325.0