Department of Data Science

Course: Tools and Techniques for Data Science

Instructor: Muhammad Arif Butt, Ph.D.

Lecture 3.17 (Pandas-09)

Aggregating and Grouping Dataframes.ipynb¶

In [ ]:

Instructor: Muhammad Arif Butt, Ph.D.

In [ ]:

Learning agenda of this notebook¶

Overview of Aggregation Functions and the agg() method
- Applying a Built-in Aggregation Function on Entire Dataframe Object
- Applying a Built-in Aggregation Function on a Series Object
- Applying a User-Defined/Lambda Function on a Series Object
Computing the Minimum Temperature of each City using hard way
Computing the Minimum Temperature of each City using groupby
Practice GroupBy on Stack Overflow Survey Dataset

In [ ]:

1. Overview of Aggregation Functions and the `agg()` Method¶

An aggregation function is one which takes multiple individual values and returns a result.

In [1]:

import pandas as pd
df = pd.read_csv('datasets/groupbydata2.csv')
df

Out[1]:

	date	city	temperature	humidity
0	01/01/2022	lahore	8	60
1	02/01/2022	lahore	10	58
2	03/01/2022	lahore	5	51
3	04/01/2022	lahore	6	49
4	05/01/2022	lahore	12	54
5	01/01/2022	karachi	18	74
6	02/01/2022	karachi	10	71
7	03/01/2022	karachi	12	78
8	04/01/2022	karachi	15	76
9	05/01/2022	karachi	16	70
10	01/01/2022	murree	-5	61
11	02/01/2022	murree	-3	68
12	03/01/2022	murree	-4	69
13	04/01/2022	murree	-1	63
14	05/01/2022	murree	-7	72

a. Applying a Built-in Aggregation Function on Entire Dataframe Object¶

In [2]:

df.min()

Out[2]:

date           01/01/2022
city              karachi
temperature            -7
humidity               49
dtype: object

In [3]:

df.count()

Out[3]:

date           15
city           15
temperature    15
humidity       15
dtype: int64

In [4]:

# Should be applied to numeric columns only, may raise a warning
df.median()

/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_3651/2026153972.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df.median()

Out[4]:

temperature     8.0
humidity       68.0
dtype: float64

In [5]:

df.median(numeric_only=True)

Out[5]:

temperature     8.0
humidity       68.0
dtype: float64

In [ ]:

We can call the agg() method on the dataframe to apply multiple aggregation functions at a time, by passing the agg() function a list of aggregation functions as strings.

In [7]:

df.agg(['min', 'max',  'count'])

Out[7]:

	date	city	temperature	humidity
min	01/01/2022	karachi	-7	49
max	05/01/2022	murree	18	78
count	15	15	15	15

In [ ]:

We can call the describe() method on the dataframe to get descriptive statistical measures on all its numeric columns.

In [8]:

df.describe()

Out[8]:

	temperature	humidity
count	15.000000	15.000000
mean	6.133333	64.933333
std	8.253715	9.153194
min	-7.000000	49.000000
25%	-2.000000	59.000000
50%	8.000000	68.000000
75%	12.000000	71.500000
max	18.000000	78.000000

In [ ]:

b. Applying a Built-in Aggregation Function on a Series Object¶

In [9]:

df['temperature'].min()

Out[9]:

-7

In [10]:

df['temperature'].max()

Out[10]:

In [11]:

df['temperature'].mean()

Out[11]:

6.133333333333334

We can call the agg() method on a series to apply multiple aggregation functions at a time, by passing the agg() function a list of aggregation functions as strings.

In [ ]:

df['temperature'].agg(['min', 'max', 'mean', 'count'])

We can call the describe() method on the dataframe to get descriptive statistical measures on all its numeric columns.

In [ ]:

df['temperature'].describe()

In [ ]:

c. Applying a User-Defined/Lambda Function on a Series Object using the `apply()` Method¶

We have used this apply() method before as well that is used to invoke function on values of Series and return a resulting series.

In [12]:

df.temperature

Out[12]:

0      8
1     10
2      5
3      6
4     12
5     18
6     10
7     12
8     15
9     16
10    -5
11    -3
12    -4
13    -1
14    -7
Name: temperature, dtype: int64

In [13]:

def ctof(x):
    return x*9/5+32

df.temperature.apply(ctof)

Out[13]:

0     46.4
1     50.0
2     41.0
3     42.8
4     53.6
5     64.4
6     50.0
7     53.6
8     59.0
9     60.8
10    23.0
11    26.6
12    24.8
13    30.2
14    19.4
Name: temperature, dtype: float64

In [ ]:

In [14]:

df.temperature.apply(lambda x: x*9/5+32)

Out[14]:

0     46.4
1     50.0
2     41.0
3     42.8
4     53.6
5     64.4
6     50.0
7     53.6
8     59.0
9     60.8
10    23.0
11    26.6
12    24.8
13    30.2
14    19.4
Name: temperature, dtype: float64

In [ ]:

How to Compute the Minimum Temperature of Each City?¶

2. Doing it the Hard Way¶

In [15]:

import pandas as pd
df = pd.read_csv('datasets/groupbydata1.csv')
df

Out[15]:

	date	city	temperature
0	01/01/2022	lahore	8
1	02/01/2022	lahore	10
2	03/01/2022	lahore	5
3	04/01/2022	lahore	6
4	05/01/2022	lahore	12
5	01/01/2022	karachi	18
6	02/01/2022	karachi	10
7	03/01/2022	karachi	12
8	04/01/2022	karachi	15
9	05/01/2022	karachi	16
10	01/01/2022	murree	-5
11	02/01/2022	murree	-3
12	03/01/2022	murree	-4
13	04/01/2022	murree	-1
14	05/01/2022	murree	-7

a. Splitting the Dataframe¶

We need to use conditional selection technique, in which we pass a Boolean mask for the appropriate city column to be selected. Can do it using two ways:
- Using df[] subscript operator
- Using df.loc method

In [16]:

df[df['city']=='karachi']

Out[16]:

	date	city	temperature
5	01/01/2022	karachi	18
6	02/01/2022	karachi	10
7	03/01/2022	karachi	12
8	04/01/2022	karachi	15
9	05/01/2022	karachi	16

In [17]:

df[df['city']=='lahore']
df.loc[df.city=='lahore', :]

Out[17]:

	date	city	temperature
0	01/01/2022	lahore	8
1	02/01/2022	lahore	10
2	03/01/2022	lahore	5
3	04/01/2022	lahore	6
4	05/01/2022	lahore	12

In [18]:

df[df['city']=='karachi']
df.loc[df.city=='karachi', :]

Out[18]:

	date	city	temperature
5	01/01/2022	karachi	18
6	02/01/2022	karachi	10
7	03/01/2022	karachi	12
8	04/01/2022	karachi	15
9	05/01/2022	karachi	16

In [19]:

df[df['city']=='murree']
df.loc[df.city=='murree', :]

Out[19]:

	date	city	temperature
10	01/01/2022	murree	-5
11	02/01/2022	murree	-3
12	03/01/2022	murree	-4
13	04/01/2022	murree	-1
14	05/01/2022	murree	-7

Limitation:

We have to repeat this process for every city separately.

What if there are over 100 cities in the dataset?

In [ ]:

b. Applying the `min()` Function¶

We need to apply the min() function on the temperature column of all of the above dataframes separately

In [ ]:

df.loc[df.city=='lahore', :].temperature.min()

In [ ]:

df.loc[df.city=='lahore', :].temperature.min()

In [ ]:

df.loc[df.city=='karachi', :].temperature.min()

In [ ]:

df.loc[df.city=='murree', :].temperature.min()

Limitation:

We have to repeat this process for every city separately.

What if there are over 100 cities in the dataset?

In [ ]:

c. Combining the Result¶

Since, we have got the minimum temperature of all the cities, we need to combine them to an appropriate series object to be used for later processing.

In [ ]:

lhr = df.loc[df.city=='lahore', :].temperature.min()
kci = df.loc[df.city=='karachi', :].temperature.min()
murree = df.loc[df.city=='murree', :].temperature.min()

s = pd.Series(data=[lhr, kci, murree], index=['L_min', 'K_min', 'M_min'] )
s.name= 'Min Temperatures'
s

In [ ]:

How to Compute the Minimum Temperature of Each City?¶

3. An Elegant Way¶

In [20]:

import pandas as pd
df = pd.read_csv('datasets/groupbydata1.csv')
df

Out[20]:

	date	city	temperature
0	01/01/2022	lahore	8
1	02/01/2022	lahore	10
2	03/01/2022	lahore	5
3	04/01/2022	lahore	6
4	05/01/2022	lahore	12
5	01/01/2022	karachi	18
6	02/01/2022	karachi	10
7	03/01/2022	karachi	12
8	04/01/2022	karachi	15
9	05/01/2022	karachi	16
10	01/01/2022	murree	-5
11	02/01/2022	murree	-3
12	03/01/2022	murree	-4
13	04/01/2022	murree	-1
14	05/01/2022	murree	-7

a. Step 1: Split Step¶

In the split step we divide the data inside the dataframe into multiple groups
Since we need to calculate the minimum temperature of each city, therefore, we will use groupby() method on the city column of the dataframe.
This will result a DataFrameGroupBy object, which is an iterable containing multiple small dataframes based on the by argument passed to the groupby() method

In [25]:

dfgb = df.groupby('city')
dfgb

Out[25]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc92ad60d90>

In [ ]:

Since this is an iterable, so let us iterate :)

In [26]:

for mydf in dfgb:
    print(mydf)

('karachi',          date     city  temperature
5  01/01/2022  karachi           18
6  02/01/2022  karachi           10
7  03/01/2022  karachi           12
8  04/01/2022  karachi           15
9  05/01/2022  karachi           16)
('lahore',          date    city  temperature
0  01/01/2022  lahore            8
1  02/01/2022  lahore           10
2  03/01/2022  lahore            5
3  04/01/2022  lahore            6
4  05/01/2022  lahore           12)
('murree',           date    city  temperature
10  01/01/2022  murree           -5
11  02/01/2022  murree           -3
12  03/01/2022  murree           -4
13  04/01/2022  murree           -1
14  05/01/2022  murree           -7)

In [ ]:

To display indices of every group in the dataframe, use groups attribute of DataFrameGroupBy object.

Returns a Dictionary object (PrettyDict) with keys as the group value and value as list of corresponding row indices

In [27]:

dfgb.groups   # df.groupby('city').groups

Out[27]:

{'karachi': [5, 6, 7, 8, 9], 'lahore': [0, 1, 2, 3, 4], 'murree': [10, 11, 12, 13, 14]}

In [ ]:

To display records of a specific group, use get_group() method on DataFrameGroupBy object.

Construct and return DataFrame from DataFrameGroupBy object with provided name.

In [29]:

# Display DataFrame of a specific group from groupby object by providing the specific group value
dfgb.get_group('murree') # df.groupby('city').get_group('karachi') 

Out[29]:

	date	city	temperature
10	01/01/2022	murree	-5
11	02/01/2022	murree	-3
12	03/01/2022	murree	-4
13	04/01/2022	murree	-1
14	05/01/2022	murree	-7

In [ ]:

To find the size of each group, use size() method of DataFrameGroupBy object.

It return a series containing number of rows in each each group of the DataFrameGroupBy object as a Series

In [30]:

dfgb.size()  #df.groupby('city').size()

Out[30]:

city
karachi    5
lahore     5
murree     5
dtype: int64

After understanding the groupby() method let us move to step 2, and that is Applying a Function

In [ ]:

b. Step 2: Apply Step¶

Now second step is that we apply appropriate aggregate function on all the groups inside the DataFrameGroupBy object

Let us first apply aggregate function on a specific column of DataFrameGroupBy object, which is a SeriesGroupBy object

In [31]:

df

Out[31]:

	date	city	temperature
0	01/01/2022	lahore	8
1	02/01/2022	lahore	10
2	03/01/2022	lahore	5
3	04/01/2022	lahore	6
4	05/01/2022	lahore	12
5	01/01/2022	karachi	18
6	02/01/2022	karachi	10
7	03/01/2022	karachi	12
8	04/01/2022	karachi	15
9	05/01/2022	karachi	16
10	01/01/2022	murree	-5
11	02/01/2022	murree	-3
12	03/01/2022	murree	-4
13	04/01/2022	murree	-1
14	05/01/2022	murree	-7

In [ ]:

In [32]:

df.groupby('city')

Out[32]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc92ade70d0>

In [33]:

df.groupby('city').get_group('lahore')

Out[33]:

	date	city	temperature
0	01/01/2022	lahore	8
1	02/01/2022	lahore	10
2	03/01/2022	lahore	5
3	04/01/2022	lahore	6
4	05/01/2022	lahore	12

In [34]:

df.groupby('city').get_group('lahore').temperature.min()

Out[34]:

In [35]:

df.groupby('city').get_group('lahore').temperature.min()

Out[35]:

In [36]:

df.groupby('city').get_group('karachi').temperature.min()

Out[36]:

In [37]:

df.groupby('city').get_group('murree').temperature.min()

Out[37]:

-7

In [ ]:

b. Step 3: Combine Step¶

Now we have got minimum temperature of all the three cities, let us combine the result into a series object

In [38]:

kci = df.groupby('city').get_group('karachi').temperature.min()
lhr = df.groupby('city').get_group('lahore').temperature.min()
murree = df.groupby('city').get_group('murree').temperature.min()

s1 = pd.Series(data=[kci, lhr, murree], index=['K_min', 'L_min', 'M_min'] )
s1.name= 'Min Temperatures'
s1

Out[38]:

K_min    10
L_min     5
M_min    -7
Name: Min Temperatures, dtype: int64

In [ ]:

Let us perform the apply + combine steps in one go, by applying the min() function on the temperature series of all the dataframes inside the DataFrameGroupBy object.

This saves us from the hassle of applying min() method explicitly as done above

In [ ]:

df.groupby('city')

In [ ]:

df.groupby('city').temperature

In [ ]:

df.groupby('city').temperature.min()

In [ ]:

We can also apply agg() method on the temperature series of all the dataframes inside the DataFrameGroupBy object

In [ ]:

df.groupby('city').temperature.agg(['min', 'max', 'sum', 'mean'])

-Note that we have got a dataframe this time

In [ ]:

4. Practice GroupBy on Stack Overflow Survey Dataset¶

Visit to Download Data: https://insights.stackoverflow.com/survey/

In [ ]:

a. Understand the Data Set¶

In [39]:

import pandas as pd
df = pd.read_csv('datasets/so_survey_subset.csv', index_col='Respondent')
df.shape

Out[39]:

(88883, 9)

In [40]:

df.head()

Out[40]:

	MainBranch	Hobbyist	Country	YearsCode	ConvertedComp	LanguageWorkedWith	SocialMedia	Age	Gender
Respondent
1	I am a student who is learning to code	Yes	United Kingdom	4	NaN	HTML/CSS;Java;JavaScript;Python	Twitter	14.0	Man
2	I am a student who is learning to code	No	Bosnia and Herzegovina	NaN	NaN	C++;HTML/CSS;Python	Instagram	19.0	Man
3	I am not primarily a developer, but I write co...	Yes	Thailand	3	8820.0	HTML/CSS	Reddit	28.0	Man
4	I am a developer by profession	No	United States	3	61000.0	C;C++;C#;Python;SQL	Reddit	22.0	Man
5	I am a developer by profession	Yes	Ukraine	16	NaN	C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA	Facebook	30.0	Man

In [ ]:

df.loc[df['Country']=='Pakistan', :]

In [ ]:

In [41]:

import pandas as pd
schema = pd.read_csv('datasets/so_survey_subset_schema.csv', index_col='Column')
schema

Out[41]:

	QuestionText
Column
Respondent	Randomized respondent ID number (not in order ...
MainBranch	Which of the following options best describes ...
Hobbyist	Do you code as a hobby?
Country	In which country do you currently reside?
YearsCode	Including any education, how many years have y...
ConvertedComp	Salary converted to annual USD salaries using ...
LanguageWorkedWith	Which of the following programming, scripting,...
SocialMedia	What social media site do you use the most?
Age	What is your age (in years)? If you prefer not...
Gender	Which of the following do you currently identi...

In [ ]:

In [42]:

schema.loc['Hobbyist']

Out[42]:

QuestionText    Do you code as a hobby?
Name: Hobbyist, dtype: object

In [ ]:

df['Hobbyist']

In [ ]:

schema.loc['Country']

In [ ]:

df['Country']

In [ ]:

schema.loc['ConvertedComp']

In [ ]:

df['ConvertedComp']

In [ ]:

schema.loc['LanguageWorkedWith']

In [ ]:

!cat datasets/so_survey_subset_schema.csv

In [ ]:

df['LanguageWorkedWith']

In [ ]:

schema.loc['SocialMedia']

In [ ]:

df['SocialMedia']

In [ ]:

df

In [ ]:

Let us perform some basic statistical analysis on the Dataset¶

In [ ]:

# Returns the count of non-NA values for a series object.
df['Hobbyist'].count()

In [ ]:

# Returns a Series containing counts of unique rows.
df['Hobbyist'].value_counts()

In [ ]:

# Returns the count of non-NA values for a series object.
df['Country'].count()

In [ ]:

# Returns a Series containing counts of unique rows.
df['Country'].value_counts()

In [ ]:

# To get the count of countries whose developers participated in the survey
df['Country'].value_counts().count()

In [ ]:

# Returns the count of non-NA values for a series object.
df['ConvertedComp'].count()

In [ ]:

# Returns a Series containing counts of unique rows.
df['ConvertedComp'].value_counts()

In [ ]:

df['ConvertedComp'].mean()

In [ ]:

df['ConvertedComp'].median()

In [ ]:

df.describe()

In [ ]:

Let us try answering certain Questions

In [ ]:

Question 1:¶

List the most popular SocialMedia web site for every Country

Let us first do the easy task, and that is to list the most popular SocialMedia website of a single country (lets say Pakistan)

In [43]:

df

Out[43]:

	MainBranch	Hobbyist	Country	YearsCode	ConvertedComp	LanguageWorkedWith	SocialMedia	Age	Gender
Respondent
1	I am a student who is learning to code	Yes	United Kingdom	4	NaN	HTML/CSS;Java;JavaScript;Python	Twitter	14.0	Man
2	I am a student who is learning to code	No	Bosnia and Herzegovina	NaN	NaN	C++;HTML/CSS;Python	Instagram	19.0	Man
3	I am not primarily a developer, but I write co...	Yes	Thailand	3	8820.0	HTML/CSS	Reddit	28.0	Man
4	I am a developer by profession	No	United States	3	61000.0	C;C++;C#;Python;SQL	Reddit	22.0	Man
5	I am a developer by profession	Yes	Ukraine	16	NaN	C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA	Facebook	30.0	Man
...	...	...	...	...	...	...	...	...	...
88377	NaN	Yes	Canada	NaN	NaN	HTML/CSS;JavaScript;Other(s):	YouTube	NaN	Man
88601	NaN	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN
88802	NaN	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN
88816	NaN	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN
88863	NaN	Yes	Spain	8	NaN	Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...	WhatsApp	18.0	Man

88883 rows × 9 columns

In [47]:

df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts()

Out[47]:

WhatsApp                    266
Facebook                    232
YouTube                     182
LinkedIn                     71
Twitter                      58
Instagram                    41
Reddit                       28
I don't use social media     23
Snapchat                      5
Hello                         1
VK ВКонта́кте                 1
Name: SocialMedia, dtype: int64

In [ ]:

In [44]:

df.loc[df.Country =='Pakistan', :]
df.loc[df.Country =='Pakistan', 'SocialMedia'].head(10)
df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts()
df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts(normalize=True)
df.loc[df.Country =='China', 'SocialMedia'].value_counts()

Out[44]:

WeChat 微信                   403
YouTube                      53
Weibo 新浪微博                   42
I don't use social media     27
Twitter                      27
Reddit                       12
LinkedIn                     11
Facebook                      8
Instagram                     7
Youku Tudou 优酷                7
WhatsApp                      3
VK ВКонта́кте                 1
Name: SocialMedia, dtype: int64

In [ ]:

In [48]:

df.groupby('Country')

Out[48]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc92e94b250>

In [49]:

df.groupby('Country').get_group("Pakistan")

Out[49]:

	MainBranch	Hobbyist	Country	YearsCode	ConvertedComp	LanguageWorkedWith	SocialMedia	Age	Gender
Respondent
84	I am a developer by profession	No	Pakistan	3	3468.0	C;C++;C#;Java;Kotlin;PHP;SQL	WhatsApp	26.0	Man
119	I am a developer by profession	No	Pakistan	10	NaN	C;C++;C#;HTML/CSS;Java;JavaScript;SQL	Facebook	28.0	Man
298	I am a developer by profession	Yes	Pakistan	4	NaN	HTML/CSS;JavaScript;PHP;SQL;Other(s):	LinkedIn	23.0	Man
299	I am a developer by profession	Yes	Pakistan	19	NaN	Assembly;C;C++;Java;Python;SQL	Facebook	25.0	Man
311	I am a developer by profession	No	Pakistan	5	2600.0	Assembly;C;C++;C#;HTML/CSS;Java;Python;Scala;SQL	LinkedIn	24.0	Man
...	...	...	...	...	...	...	...	...	...
88862	I am a student who is learning to code	Yes	Pakistan	3	NaN	Java	WhatsApp	21.0	Man
5439	NaN	Yes	Pakistan	2	NaN	NaN	Instagram	24.0	Woman
39117	NaN	Yes	Pakistan	4	NaN	C;C++;C#;HTML/CSS;Java;JavaScript;SQL	WhatsApp	22.0	Man
60066	NaN	Yes	Pakistan	4	NaN	Assembly;C++;C#;HTML/CSS;Java;PHP;Python;SQL	YouTube	20.0	Man
88182	NaN	Yes	Pakistan	1	NaN	HTML/CSS;Java;JavaScript	Twitter	NaN	Man

923 rows × 9 columns

In [50]:

df.groupby('Country').get_group("Pakistan").loc[:, 'SocialMedia']

Out[50]:

Respondent
84        WhatsApp
119       Facebook
298       LinkedIn
299       Facebook
311       LinkedIn
           ...    
88862     WhatsApp
5439     Instagram
39117     WhatsApp
60066      YouTube
88182      Twitter
Name: SocialMedia, Length: 923, dtype: object

In [51]:

df.groupby('Country').get_group("Pakistan").loc[:, 'SocialMedia'].value_counts()

Out[51]:

WhatsApp                    266
Facebook                    232
YouTube                     182
LinkedIn                     71
Twitter                      58
Instagram                    41
Reddit                       28
I don't use social media     23
Snapchat                      5
Hello                         1
VK ВКонта́кте                 1
Name: SocialMedia, dtype: int64

In [54]:

df.groupby('Country')['SocialMedia'].value_counts().head(60)

Out[54]:

Country              SocialMedia             
Afghanistan          Facebook                     15
                     YouTube                       9
                     I don't use social media      6
                     WhatsApp                      4
                     Instagram                     1
                     LinkedIn                      1
                     Twitter                       1
Albania              WhatsApp                     18
                     Facebook                     16
                     Instagram                    13
                     YouTube                      10
                     Twitter                       8
                     LinkedIn                      7
                     Reddit                        6
                     I don't use social media      4
                     Snapchat                      1
                     WeChat 微信                     1
Algeria              YouTube                      42
                     Facebook                     41
                     Twitter                      14
                     LinkedIn                      9
                     I don't use social media      7
                     Instagram                     7
                     Reddit                        2
                     WhatsApp                      2
Andorra              Facebook                      2
                     I don't use social media      2
                     Reddit                        1
                     Snapchat                      1
                     YouTube                       1
Angola               Twitter                       2
                     Facebook                      1
                     YouTube                       1
Antigua and Barbuda  YouTube                       2
                     Facebook                      1
                     Instagram                     1
                     Twitter                       1
                     VK ВКонта́кте                 1
Argentina            WhatsApp                    202
                     Twitter                      80
                     YouTube                      62
                     Instagram                    53
                     Reddit                       50
                     Facebook                     45
                     I don't use social media     22
                     LinkedIn                     16
Armenia              Facebook                     38
                     YouTube                      23
                     LinkedIn                      5
                     Instagram                     3
                     Reddit                        3
                     Twitter                       3
                     WhatsApp                      3
                     VK ВКонта́кте                 2
                     I don't use social media      1
Australia            Reddit                      407
                     Facebook                    320
                     Twitter                     283
                     YouTube                     242
                     I don't use social media    156
Name: SocialMedia, dtype: int64

In [ ]:

df.groupby('Country')['SocialMedia'].value_counts().head(50)

In [ ]:

df.groupby('Country')['SocialMedia'].value_counts().head(50)

In [ ]:

Question 2:¶

What percentage of people in each country knows Python programming?

tc = Total count of people from each country who participated in the survey?

pc = Python People: Count of people from each country who knows Python

tc (option 1):

In [55]:

df

Out[55]:

	MainBranch	Hobbyist	Country	YearsCode	ConvertedComp	LanguageWorkedWith	SocialMedia	Age	Gender
Respondent
1	I am a student who is learning to code	Yes	United Kingdom	4	NaN	HTML/CSS;Java;JavaScript;Python	Twitter	14.0	Man
2	I am a student who is learning to code	No	Bosnia and Herzegovina	NaN	NaN	C++;HTML/CSS;Python	Instagram	19.0	Man
3	I am not primarily a developer, but I write co...	Yes	Thailand	3	8820.0	HTML/CSS	Reddit	28.0	Man
4	I am a developer by profession	No	United States	3	61000.0	C;C++;C#;Python;SQL	Reddit	22.0	Man
5	I am a developer by profession	Yes	Ukraine	16	NaN	C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA	Facebook	30.0	Man
...	...	...	...	...	...	...	...	...	...
88377	NaN	Yes	Canada	NaN	NaN	HTML/CSS;JavaScript;Other(s):	YouTube	NaN	Man
88601	NaN	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN
88802	NaN	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN
88816	NaN	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN
88863	NaN	Yes	Spain	8	NaN	Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...	WhatsApp	18.0	Man

88883 rows × 9 columns

In [56]:

df.loc[:, 'Country']

Out[56]:

Respondent
1                United Kingdom
2        Bosnia and Herzegovina
3                      Thailand
4                 United States
5                       Ukraine
                  ...          
88377                    Canada
88601                       NaN
88802                       NaN
88816                       NaN
88863                     Spain
Name: Country, Length: 88883, dtype: object

In [57]:

tc = df['Country'].value_counts()
tc.name = 'Total'
tc

Out[57]:

United States        20949
India                 9061
Germany               5866
United Kingdom        5737
Canada                3395
                     ...  
Tonga                    1
Timor-Leste              1
North Korea              1
Brunei Darussalam        1
Chad                     1
Name: Total, Length: 179, dtype: int64

In [ ]:

tc (option 2):

In [58]:

dfgb = df.groupby('Country')
dfgb

Out[58]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc92ea07b20>

In [59]:

df.groupby('Country')['Country']

Out[59]:

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fc92ea07220>

In [60]:

df.groupby('Country')['Country'].apply(lambda x: x.value_counts())

Out[60]:

Country                                                                   
Afghanistan                           Afghanistan                              44
Albania                               Albania                                  86
Algeria                               Algeria                                 134
Andorra                               Andorra                                   7
Angola                                Angola                                    5
                                                                             ... 
Venezuela, Bolivarian Republic of...  Venezuela, Bolivarian Republic of...     88
Viet Nam                              Viet Nam                                231
Yemen                                 Yemen                                    19
Zambia                                Zambia                                   12
Zimbabwe                              Zimbabwe                                 39
Name: Country, Length: 179, dtype: int64

In [ ]:

pc:

In [61]:

df.loc[:, 'LanguageWorkedWith']

Out[61]:

Respondent
1                          HTML/CSS;Java;JavaScript;Python
2                                      C++;HTML/CSS;Python
3                                                 HTML/CSS
4                                      C;C++;C#;Python;SQL
5              C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA
                               ...                        
88377                        HTML/CSS;JavaScript;Other(s):
88601                                                  NaN
88802                                                  NaN
88816                                                  NaN
88863    Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...
Name: LanguageWorkedWith, Length: 88883, dtype: object

In [62]:

df.groupby('Country')['LanguageWorkedWith']

Out[62]:

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fc92ea4a460>

In [63]:

df.groupby('Country')['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python'))

Out[63]:

Respondent
1         True
2         True
3        False
4         True
5         True
         ...  
88182    False
88212     True
88282    False
88377    False
88863    False
Name: LanguageWorkedWith, Length: 88751, dtype: object

In [64]:

pp = df.groupby('Country')['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
pp

Out[64]:

Country
Afghanistan                              8
Albania                                 23
Algeria                                 40
Andorra                                  0
Angola                                   2
                                        ..
Venezuela, Bolivarian Republic of...    28
Viet Nam                                78
Yemen                                    3
Zambia                                   4
Zimbabwe                                14
Name: LanguageWorkedWith, Length: 179, dtype: int64

In [68]:

pp.name = 'Knows Python'

In [ ]:

Create a Dataframe of two series tc and pp

In [69]:

resultdf = pd.concat([tc, pp], axis=1)
resultdf

Out[69]:

	Total	Knows Python
United States	20949	10083
India	9061	3105
Germany	5866	2451
United Kingdom	5737	2384
Canada	3395	1558
...	...	...
Tonga	1	0
Timor-Leste	1	1
North Korea	1	0
Brunei Darussalam	1	0
Chad	1	0

179 rows × 2 columns

In [70]:

resultdf.loc['Pakistan']

Out[70]:

Total           923
Knows Python    251
Name: Pakistan, dtype: int64

In [ ]:

Percentage of people in each country knows Python?

In [71]:

resultdf['Percentage'] = (resultdf['Knows Python'] / resultdf['Total']) * 100
resultdf

Out[71]:

	Total	Knows Python	Percentage
United States	20949	10083	48.131176
India	9061	3105	34.267741
Germany	5866	2451	41.783157
United Kingdom	5737	2384	41.554820
Canada	3395	1558	45.891016
...	...	...	...
Tonga	1	0	0.000000
Timor-Leste	1	1	100.000000
North Korea	1	0	0.000000
Brunei Darussalam	1	0	0.000000
Chad	1	0	0.000000

179 rows × 3 columns

In [ ]:

resultdf.loc['Pakistan']

In [ ]:

Department of Data Science

Course: Tools and Techniques for Data Science

Instructor: Muhammad Arif Butt, Ph.D.

Lecture 3.17 (Pandas-09)

Aggregating and Grouping Dataframes.ipynb¶

Instructor: Muhammad Arif Butt, Ph.D.

Learning agenda of this notebook¶

1. Overview of Aggregation Functions and the agg() Method¶

a. Applying a Built-in Aggregation Function on Entire Dataframe Object¶

b. Applying a Built-in Aggregation Function on a Series Object¶

c. Applying a User-Defined/Lambda Function on a Series Object using the apply() Method¶

How to Compute the Minimum Temperature of Each City?¶

2. Doing it the Hard Way¶

a. Splitting the Dataframe¶

b. Applying the min() Function¶

c. Combining the Result¶

How to Compute the Minimum Temperature of Each City?¶

3. An Elegant Way¶

a. Step 1: Split Step¶

b. Step 2: Apply Step¶

b. Step 3: Combine Step¶

4. Practice GroupBy on Stack Overflow Survey Dataset¶

a. Understand the Data Set¶

Let us perform some basic statistical analysis on the Dataset¶

Let us try answering certain Questions

Question 1:¶

Question 2:¶

1. Overview of Aggregation Functions and the `agg()` Method¶

c. Applying a User-Defined/Lambda Function on a Series Object using the `apply()` Method¶

b. Applying the `min()` Function¶