agg()
methodgroupby
agg()
Method¶import pandas as pd
df = pd.read_csv('datasets/groupbydata2.csv')
df
date | city | temperature | humidity | |
---|---|---|---|---|
0 | 01/01/2022 | lahore | 8 | 60 |
1 | 02/01/2022 | lahore | 10 | 58 |
2 | 03/01/2022 | lahore | 5 | 51 |
3 | 04/01/2022 | lahore | 6 | 49 |
4 | 05/01/2022 | lahore | 12 | 54 |
5 | 01/01/2022 | karachi | 18 | 74 |
6 | 02/01/2022 | karachi | 10 | 71 |
7 | 03/01/2022 | karachi | 12 | 78 |
8 | 04/01/2022 | karachi | 15 | 76 |
9 | 05/01/2022 | karachi | 16 | 70 |
10 | 01/01/2022 | murree | -5 | 61 |
11 | 02/01/2022 | murree | -3 | 68 |
12 | 03/01/2022 | murree | -4 | 69 |
13 | 04/01/2022 | murree | -1 | 63 |
14 | 05/01/2022 | murree | -7 | 72 |
df.min()
date 01/01/2022 city karachi temperature -7 humidity 49 dtype: object
df.count()
date 15 city 15 temperature 15 humidity 15 dtype: int64
# Should be applied to numeric columns only, may raise a warning
df.median()
/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_3651/2026153972.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. df.median()
temperature 8.0 humidity 68.0 dtype: float64
df.median(numeric_only=True)
temperature 8.0 humidity 68.0 dtype: float64
We can call the
agg()
method on the dataframe to apply multiple aggregation functions at a time, by passing theagg()
function a list of aggregation functions as strings.
df.agg(['min', 'max', 'count'])
date | city | temperature | humidity | |
---|---|---|---|---|
min | 01/01/2022 | karachi | -7 | 49 |
max | 05/01/2022 | murree | 18 | 78 |
count | 15 | 15 | 15 | 15 |
We can call the
describe()
method on the dataframe to get descriptive statistical measures on all its numeric columns.
df.describe()
temperature | humidity | |
---|---|---|
count | 15.000000 | 15.000000 |
mean | 6.133333 | 64.933333 |
std | 8.253715 | 9.153194 |
min | -7.000000 | 49.000000 |
25% | -2.000000 | 59.000000 |
50% | 8.000000 | 68.000000 |
75% | 12.000000 | 71.500000 |
max | 18.000000 | 78.000000 |
df['temperature'].min()
-7
df['temperature'].max()
18
df['temperature'].mean()
6.133333333333334
We can call the
agg()
method on a series to apply multiple aggregation functions at a time, by passing theagg()
function a list of aggregation functions as strings.
df['temperature'].agg(['min', 'max', 'mean', 'count'])
We can call the
describe()
method on the dataframe to get descriptive statistical measures on all its numeric columns.
df['temperature'].describe()
apply()
Method¶apply()
method before as well that is used to invoke function on values of Series and return a resulting series.df.temperature
0 8 1 10 2 5 3 6 4 12 5 18 6 10 7 12 8 15 9 16 10 -5 11 -3 12 -4 13 -1 14 -7 Name: temperature, dtype: int64
def ctof(x):
return x*9/5+32
df.temperature.apply(ctof)
0 46.4 1 50.0 2 41.0 3 42.8 4 53.6 5 64.4 6 50.0 7 53.6 8 59.0 9 60.8 10 23.0 11 26.6 12 24.8 13 30.2 14 19.4 Name: temperature, dtype: float64
df.temperature.apply(lambda x: x*9/5+32)
0 46.4 1 50.0 2 41.0 3 42.8 4 53.6 5 64.4 6 50.0 7 53.6 8 59.0 9 60.8 10 23.0 11 26.6 12 24.8 13 30.2 14 19.4 Name: temperature, dtype: float64
import pandas as pd
df = pd.read_csv('datasets/groupbydata1.csv')
df
date | city | temperature | |
---|---|---|---|
0 | 01/01/2022 | lahore | 8 |
1 | 02/01/2022 | lahore | 10 |
2 | 03/01/2022 | lahore | 5 |
3 | 04/01/2022 | lahore | 6 |
4 | 05/01/2022 | lahore | 12 |
5 | 01/01/2022 | karachi | 18 |
6 | 02/01/2022 | karachi | 10 |
7 | 03/01/2022 | karachi | 12 |
8 | 04/01/2022 | karachi | 15 |
9 | 05/01/2022 | karachi | 16 |
10 | 01/01/2022 | murree | -5 |
11 | 02/01/2022 | murree | -3 |
12 | 03/01/2022 | murree | -4 |
13 | 04/01/2022 | murree | -1 |
14 | 05/01/2022 | murree | -7 |
df[]
subscript operatordf.loc
methoddf[df['city']=='karachi']
date | city | temperature | |
---|---|---|---|
5 | 01/01/2022 | karachi | 18 |
6 | 02/01/2022 | karachi | 10 |
7 | 03/01/2022 | karachi | 12 |
8 | 04/01/2022 | karachi | 15 |
9 | 05/01/2022 | karachi | 16 |
df[df['city']=='lahore']
df.loc[df.city=='lahore', :]
date | city | temperature | |
---|---|---|---|
0 | 01/01/2022 | lahore | 8 |
1 | 02/01/2022 | lahore | 10 |
2 | 03/01/2022 | lahore | 5 |
3 | 04/01/2022 | lahore | 6 |
4 | 05/01/2022 | lahore | 12 |
df[df['city']=='karachi']
df.loc[df.city=='karachi', :]
date | city | temperature | |
---|---|---|---|
5 | 01/01/2022 | karachi | 18 |
6 | 02/01/2022 | karachi | 10 |
7 | 03/01/2022 | karachi | 12 |
8 | 04/01/2022 | karachi | 15 |
9 | 05/01/2022 | karachi | 16 |
df[df['city']=='murree']
df.loc[df.city=='murree', :]
date | city | temperature | |
---|---|---|---|
10 | 01/01/2022 | murree | -5 |
11 | 02/01/2022 | murree | -3 |
12 | 03/01/2022 | murree | -4 |
13 | 04/01/2022 | murree | -1 |
14 | 05/01/2022 | murree | -7 |
Limitation:
- We have to repeat this process for every city separately.
- What if there are over 100 cities in the dataset?
min()
Function¶min()
function on the temperature column of all of the above dataframes separatelydf.loc[df.city=='lahore', :].temperature.min()
df.loc[df.city=='lahore', :].temperature.min()
df.loc[df.city=='karachi', :].temperature.min()
df.loc[df.city=='murree', :].temperature.min()
Limitation:
- We have to repeat this process for every city separately.
- What if there are over 100 cities in the dataset?
lhr = df.loc[df.city=='lahore', :].temperature.min()
kci = df.loc[df.city=='karachi', :].temperature.min()
murree = df.loc[df.city=='murree', :].temperature.min()
s = pd.Series(data=[lhr, kci, murree], index=['L_min', 'K_min', 'M_min'] )
s.name= 'Min Temperatures'
s
import pandas as pd
df = pd.read_csv('datasets/groupbydata1.csv')
df
date | city | temperature | |
---|---|---|---|
0 | 01/01/2022 | lahore | 8 |
1 | 02/01/2022 | lahore | 10 |
2 | 03/01/2022 | lahore | 5 |
3 | 04/01/2022 | lahore | 6 |
4 | 05/01/2022 | lahore | 12 |
5 | 01/01/2022 | karachi | 18 |
6 | 02/01/2022 | karachi | 10 |
7 | 03/01/2022 | karachi | 12 |
8 | 04/01/2022 | karachi | 15 |
9 | 05/01/2022 | karachi | 16 |
10 | 01/01/2022 | murree | -5 |
11 | 02/01/2022 | murree | -3 |
12 | 03/01/2022 | murree | -4 |
13 | 04/01/2022 | murree | -1 |
14 | 05/01/2022 | murree | -7 |
groupby()
method on the city
column of the dataframe.by
argument passed to the groupby()
methoddfgb = df.groupby('city')
dfgb
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc92ad60d90>
- Since this is an iterable, so let us iterate :)
for mydf in dfgb:
print(mydf)
('karachi', date city temperature 5 01/01/2022 karachi 18 6 02/01/2022 karachi 10 7 03/01/2022 karachi 12 8 04/01/2022 karachi 15 9 05/01/2022 karachi 16) ('lahore', date city temperature 0 01/01/2022 lahore 8 1 02/01/2022 lahore 10 2 03/01/2022 lahore 5 3 04/01/2022 lahore 6 4 05/01/2022 lahore 12) ('murree', date city temperature 10 01/01/2022 murree -5 11 02/01/2022 murree -3 12 03/01/2022 murree -4 13 04/01/2022 murree -1 14 05/01/2022 murree -7)
- To display indices of every group in the dataframe, use
groups
attribute ofDataFrameGroupBy
object.- Returns a Dictionary object (PrettyDict) with keys as the group value and value as list of corresponding row indices
dfgb.groups # df.groupby('city').groups
{'karachi': [5, 6, 7, 8, 9], 'lahore': [0, 1, 2, 3, 4], 'murree': [10, 11, 12, 13, 14]}
- To display records of a specific group, use
get_group()
method onDataFrameGroupBy
object.- Construct and return DataFrame from
DataFrameGroupBy
object with provided name.
# Display DataFrame of a specific group from groupby object by providing the specific group value
dfgb.get_group('murree') # df.groupby('city').get_group('karachi')
date | city | temperature | |
---|---|---|---|
10 | 01/01/2022 | murree | -5 |
11 | 02/01/2022 | murree | -3 |
12 | 03/01/2022 | murree | -4 |
13 | 04/01/2022 | murree | -1 |
14 | 05/01/2022 | murree | -7 |
- To find the size of each group, use
size()
method of DataFrameGroupBy object.- It return a series containing number of rows in each each group of the DataFrameGroupBy object as a Series
dfgb.size() #df.groupby('city').size()
city karachi 5 lahore 5 murree 5 dtype: int64
After understanding the
groupby()
method let us move to step 2, and that isApplying a Function
Let us first apply aggregate function on a specific column of DataFrameGroupBy
object, which is a SeriesGroupBy
object
df
date | city | temperature | |
---|---|---|---|
0 | 01/01/2022 | lahore | 8 |
1 | 02/01/2022 | lahore | 10 |
2 | 03/01/2022 | lahore | 5 |
3 | 04/01/2022 | lahore | 6 |
4 | 05/01/2022 | lahore | 12 |
5 | 01/01/2022 | karachi | 18 |
6 | 02/01/2022 | karachi | 10 |
7 | 03/01/2022 | karachi | 12 |
8 | 04/01/2022 | karachi | 15 |
9 | 05/01/2022 | karachi | 16 |
10 | 01/01/2022 | murree | -5 |
11 | 02/01/2022 | murree | -3 |
12 | 03/01/2022 | murree | -4 |
13 | 04/01/2022 | murree | -1 |
14 | 05/01/2022 | murree | -7 |
df.groupby('city')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc92ade70d0>
df.groupby('city').get_group('lahore')
date | city | temperature | |
---|---|---|---|
0 | 01/01/2022 | lahore | 8 |
1 | 02/01/2022 | lahore | 10 |
2 | 03/01/2022 | lahore | 5 |
3 | 04/01/2022 | lahore | 6 |
4 | 05/01/2022 | lahore | 12 |
df.groupby('city').get_group('lahore').temperature.min()
5
df.groupby('city').get_group('lahore').temperature.min()
5
df.groupby('city').get_group('karachi').temperature.min()
10
df.groupby('city').get_group('murree').temperature.min()
-7
kci = df.groupby('city').get_group('karachi').temperature.min()
lhr = df.groupby('city').get_group('lahore').temperature.min()
murree = df.groupby('city').get_group('murree').temperature.min()
s1 = pd.Series(data=[kci, lhr, murree], index=['K_min', 'L_min', 'M_min'] )
s1.name= 'Min Temperatures'
s1
K_min 10 L_min 5 M_min -7 Name: Min Temperatures, dtype: int64
- Let us perform the
apply + combine
steps in one go, by applying themin()
function on the temperature series of all the dataframes inside the DataFrameGroupBy object.- This saves us from the hassle of applying
min()
method explicitly as done above
df.groupby('city')
df.groupby('city').temperature
df.groupby('city').temperature.min()
- We can also apply
agg()
method on the temperature series of all the dataframes inside the DataFrameGroupBy object
df.groupby('city').temperature.agg(['min', 'max', 'sum', 'mean'])
-Note that we have got a dataframe this time
Visit to Download Data: https://insights.stackoverflow.com/survey/
import pandas as pd
df = pd.read_csv('datasets/so_survey_subset.csv', index_col='Respondent')
df.shape
(88883, 9)
df.head()
MainBranch | Hobbyist | Country | YearsCode | ConvertedComp | LanguageWorkedWith | SocialMedia | Age | Gender | |
---|---|---|---|---|---|---|---|---|---|
Respondent | |||||||||
1 | I am a student who is learning to code | Yes | United Kingdom | 4 | NaN | HTML/CSS;Java;JavaScript;Python | 14.0 | Man | |
2 | I am a student who is learning to code | No | Bosnia and Herzegovina | NaN | NaN | C++;HTML/CSS;Python | 19.0 | Man | |
3 | I am not primarily a developer, but I write co... | Yes | Thailand | 3 | 8820.0 | HTML/CSS | 28.0 | Man | |
4 | I am a developer by profession | No | United States | 3 | 61000.0 | C;C++;C#;Python;SQL | 22.0 | Man | |
5 | I am a developer by profession | Yes | Ukraine | 16 | NaN | C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA | 30.0 | Man |
df.loc[df['Country']=='Pakistan', :]
import pandas as pd
schema = pd.read_csv('datasets/so_survey_subset_schema.csv', index_col='Column')
schema
QuestionText | |
---|---|
Column | |
Respondent | Randomized respondent ID number (not in order ... |
MainBranch | Which of the following options best describes ... |
Hobbyist | Do you code as a hobby? |
Country | In which country do you currently reside? |
YearsCode | Including any education, how many years have y... |
ConvertedComp | Salary converted to annual USD salaries using ... |
LanguageWorkedWith | Which of the following programming, scripting,... |
SocialMedia | What social media site do you use the most? |
Age | What is your age (in years)? If you prefer not... |
Gender | Which of the following do you currently identi... |
schema.loc['Hobbyist']
QuestionText Do you code as a hobby? Name: Hobbyist, dtype: object
df['Hobbyist']
schema.loc['Country']
df['Country']
schema.loc['ConvertedComp']
df['ConvertedComp']
schema.loc['LanguageWorkedWith']
!cat datasets/so_survey_subset_schema.csv
df['LanguageWorkedWith']
schema.loc['SocialMedia']
df['SocialMedia']
df
# Returns the count of non-NA values for a series object.
df['Hobbyist'].count()
# Returns a Series containing counts of unique rows.
df['Hobbyist'].value_counts()
# Returns the count of non-NA values for a series object.
df['Country'].count()
# Returns a Series containing counts of unique rows.
df['Country'].value_counts()
# To get the count of countries whose developers participated in the survey
df['Country'].value_counts().count()
# Returns the count of non-NA values for a series object.
df['ConvertedComp'].count()
# Returns a Series containing counts of unique rows.
df['ConvertedComp'].value_counts()
df['ConvertedComp'].mean()
df['ConvertedComp'].median()
df.describe()
List the most popular SocialMedia web site for every Country
Let us first do the easy task, and that is to list the most popular SocialMedia website of a single country (lets say Pakistan)
df
MainBranch | Hobbyist | Country | YearsCode | ConvertedComp | LanguageWorkedWith | SocialMedia | Age | Gender | |
---|---|---|---|---|---|---|---|---|---|
Respondent | |||||||||
1 | I am a student who is learning to code | Yes | United Kingdom | 4 | NaN | HTML/CSS;Java;JavaScript;Python | 14.0 | Man | |
2 | I am a student who is learning to code | No | Bosnia and Herzegovina | NaN | NaN | C++;HTML/CSS;Python | 19.0 | Man | |
3 | I am not primarily a developer, but I write co... | Yes | Thailand | 3 | 8820.0 | HTML/CSS | 28.0 | Man | |
4 | I am a developer by profession | No | United States | 3 | 61000.0 | C;C++;C#;Python;SQL | 22.0 | Man | |
5 | I am a developer by profession | Yes | Ukraine | 16 | NaN | C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA | 30.0 | Man | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88377 | NaN | Yes | Canada | NaN | NaN | HTML/CSS;JavaScript;Other(s): | YouTube | NaN | Man |
88601 | NaN | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
88802 | NaN | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
88816 | NaN | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
88863 | NaN | Yes | Spain | 8 | NaN | Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript... | 18.0 | Man |
88883 rows × 9 columns
df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts()
WhatsApp 266 Facebook 232 YouTube 182 LinkedIn 71 Twitter 58 Instagram 41 Reddit 28 I don't use social media 23 Snapchat 5 Hello 1 VK ВКонта́кте 1 Name: SocialMedia, dtype: int64
df.loc[df.Country =='Pakistan', :]
df.loc[df.Country =='Pakistan', 'SocialMedia'].head(10)
df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts()
df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts(normalize=True)
df.loc[df.Country =='China', 'SocialMedia'].value_counts()
WeChat 微信 403 YouTube 53 Weibo 新浪微博 42 I don't use social media 27 Twitter 27 Reddit 12 LinkedIn 11 Facebook 8 Instagram 7 Youku Tudou 优酷 7 WhatsApp 3 VK ВКонта́кте 1 Name: SocialMedia, dtype: int64
df.groupby('Country')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc92e94b250>
df.groupby('Country').get_group("Pakistan")
MainBranch | Hobbyist | Country | YearsCode | ConvertedComp | LanguageWorkedWith | SocialMedia | Age | Gender | |
---|---|---|---|---|---|---|---|---|---|
Respondent | |||||||||
84 | I am a developer by profession | No | Pakistan | 3 | 3468.0 | C;C++;C#;Java;Kotlin;PHP;SQL | 26.0 | Man | |
119 | I am a developer by profession | No | Pakistan | 10 | NaN | C;C++;C#;HTML/CSS;Java;JavaScript;SQL | 28.0 | Man | |
298 | I am a developer by profession | Yes | Pakistan | 4 | NaN | HTML/CSS;JavaScript;PHP;SQL;Other(s): | 23.0 | Man | |
299 | I am a developer by profession | Yes | Pakistan | 19 | NaN | Assembly;C;C++;Java;Python;SQL | 25.0 | Man | |
311 | I am a developer by profession | No | Pakistan | 5 | 2600.0 | Assembly;C;C++;C#;HTML/CSS;Java;Python;Scala;SQL | 24.0 | Man | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88862 | I am a student who is learning to code | Yes | Pakistan | 3 | NaN | Java | 21.0 | Man | |
5439 | NaN | Yes | Pakistan | 2 | NaN | NaN | 24.0 | Woman | |
39117 | NaN | Yes | Pakistan | 4 | NaN | C;C++;C#;HTML/CSS;Java;JavaScript;SQL | 22.0 | Man | |
60066 | NaN | Yes | Pakistan | 4 | NaN | Assembly;C++;C#;HTML/CSS;Java;PHP;Python;SQL | YouTube | 20.0 | Man |
88182 | NaN | Yes | Pakistan | 1 | NaN | HTML/CSS;Java;JavaScript | NaN | Man |
923 rows × 9 columns
df.groupby('Country').get_group("Pakistan").loc[:, 'SocialMedia']
Respondent 84 WhatsApp 119 Facebook 298 LinkedIn 299 Facebook 311 LinkedIn ... 88862 WhatsApp 5439 Instagram 39117 WhatsApp 60066 YouTube 88182 Twitter Name: SocialMedia, Length: 923, dtype: object
df.groupby('Country').get_group("Pakistan").loc[:, 'SocialMedia'].value_counts()
WhatsApp 266 Facebook 232 YouTube 182 LinkedIn 71 Twitter 58 Instagram 41 Reddit 28 I don't use social media 23 Snapchat 5 Hello 1 VK ВКонта́кте 1 Name: SocialMedia, dtype: int64
df.groupby('Country')['SocialMedia'].value_counts().head(60)
Country SocialMedia Afghanistan Facebook 15 YouTube 9 I don't use social media 6 WhatsApp 4 Instagram 1 LinkedIn 1 Twitter 1 Albania WhatsApp 18 Facebook 16 Instagram 13 YouTube 10 Twitter 8 LinkedIn 7 Reddit 6 I don't use social media 4 Snapchat 1 WeChat 微信 1 Algeria YouTube 42 Facebook 41 Twitter 14 LinkedIn 9 I don't use social media 7 Instagram 7 Reddit 2 WhatsApp 2 Andorra Facebook 2 I don't use social media 2 Reddit 1 Snapchat 1 YouTube 1 Angola Twitter 2 Facebook 1 YouTube 1 Antigua and Barbuda YouTube 2 Facebook 1 Instagram 1 Twitter 1 VK ВКонта́кте 1 Argentina WhatsApp 202 Twitter 80 YouTube 62 Instagram 53 Reddit 50 Facebook 45 I don't use social media 22 LinkedIn 16 Armenia Facebook 38 YouTube 23 LinkedIn 5 Instagram 3 Reddit 3 Twitter 3 WhatsApp 3 VK ВКонта́кте 2 I don't use social media 1 Australia Reddit 407 Facebook 320 Twitter 283 YouTube 242 I don't use social media 156 Name: SocialMedia, dtype: int64
df.groupby('Country')['SocialMedia'].value_counts().head(50)
df.groupby('Country')['SocialMedia'].value_counts().head(50)
What percentage of people in each country knows Python programming?
tc = Total count of people from each country who participated in the survey?
pc = Python People: Count of people from each country who knows Python
tc (option 1):
df
MainBranch | Hobbyist | Country | YearsCode | ConvertedComp | LanguageWorkedWith | SocialMedia | Age | Gender | |
---|---|---|---|---|---|---|---|---|---|
Respondent | |||||||||
1 | I am a student who is learning to code | Yes | United Kingdom | 4 | NaN | HTML/CSS;Java;JavaScript;Python | 14.0 | Man | |
2 | I am a student who is learning to code | No | Bosnia and Herzegovina | NaN | NaN | C++;HTML/CSS;Python | 19.0 | Man | |
3 | I am not primarily a developer, but I write co... | Yes | Thailand | 3 | 8820.0 | HTML/CSS | 28.0 | Man | |
4 | I am a developer by profession | No | United States | 3 | 61000.0 | C;C++;C#;Python;SQL | 22.0 | Man | |
5 | I am a developer by profession | Yes | Ukraine | 16 | NaN | C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA | 30.0 | Man | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88377 | NaN | Yes | Canada | NaN | NaN | HTML/CSS;JavaScript;Other(s): | YouTube | NaN | Man |
88601 | NaN | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
88802 | NaN | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
88816 | NaN | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
88863 | NaN | Yes | Spain | 8 | NaN | Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript... | 18.0 | Man |
88883 rows × 9 columns
df.loc[:, 'Country']
Respondent 1 United Kingdom 2 Bosnia and Herzegovina 3 Thailand 4 United States 5 Ukraine ... 88377 Canada 88601 NaN 88802 NaN 88816 NaN 88863 Spain Name: Country, Length: 88883, dtype: object
tc = df['Country'].value_counts()
tc.name = 'Total'
tc
United States 20949 India 9061 Germany 5866 United Kingdom 5737 Canada 3395 ... Tonga 1 Timor-Leste 1 North Korea 1 Brunei Darussalam 1 Chad 1 Name: Total, Length: 179, dtype: int64
tc (option 2):
dfgb = df.groupby('Country')
dfgb
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc92ea07b20>
df.groupby('Country')['Country']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fc92ea07220>
df.groupby('Country')['Country'].apply(lambda x: x.value_counts())
Country Afghanistan Afghanistan 44 Albania Albania 86 Algeria Algeria 134 Andorra Andorra 7 Angola Angola 5 ... Venezuela, Bolivarian Republic of... Venezuela, Bolivarian Republic of... 88 Viet Nam Viet Nam 231 Yemen Yemen 19 Zambia Zambia 12 Zimbabwe Zimbabwe 39 Name: Country, Length: 179, dtype: int64
pc:
df.loc[:, 'LanguageWorkedWith']
Respondent 1 HTML/CSS;Java;JavaScript;Python 2 C++;HTML/CSS;Python 3 HTML/CSS 4 C;C++;C#;Python;SQL 5 C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA ... 88377 HTML/CSS;JavaScript;Other(s): 88601 NaN 88802 NaN 88816 NaN 88863 Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript... Name: LanguageWorkedWith, Length: 88883, dtype: object
df.groupby('Country')['LanguageWorkedWith']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fc92ea4a460>
df.groupby('Country')['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python'))
Respondent 1 True 2 True 3 False 4 True 5 True ... 88182 False 88212 True 88282 False 88377 False 88863 False Name: LanguageWorkedWith, Length: 88751, dtype: object
pp = df.groupby('Country')['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
pp
Country Afghanistan 8 Albania 23 Algeria 40 Andorra 0 Angola 2 .. Venezuela, Bolivarian Republic of... 28 Viet Nam 78 Yemen 3 Zambia 4 Zimbabwe 14 Name: LanguageWorkedWith, Length: 179, dtype: int64
pp.name = 'Knows Python'
Create a Dataframe of two series tc and pp
resultdf = pd.concat([tc, pp], axis=1)
resultdf
Total | Knows Python | |
---|---|---|
United States | 20949 | 10083 |
India | 9061 | 3105 |
Germany | 5866 | 2451 |
United Kingdom | 5737 | 2384 |
Canada | 3395 | 1558 |
... | ... | ... |
Tonga | 1 | 0 |
Timor-Leste | 1 | 1 |
North Korea | 1 | 0 |
Brunei Darussalam | 1 | 0 |
Chad | 1 | 0 |
179 rows × 2 columns
resultdf.loc['Pakistan']
Total 923 Knows Python 251 Name: Pakistan, dtype: int64
Percentage of people in each country knows Python?
resultdf['Percentage'] = (resultdf['Knows Python'] / resultdf['Total']) * 100
resultdf
Total | Knows Python | Percentage | |
---|---|---|---|
United States | 20949 | 10083 | 48.131176 |
India | 9061 | 3105 | 34.267741 |
Germany | 5866 | 2451 | 41.783157 |
United Kingdom | 5737 | 2384 | 41.554820 |
Canada | 3395 | 1558 | 45.891016 |
... | ... | ... | ... |
Tonga | 1 | 0 | 0.000000 |
Timor-Leste | 1 | 1 | 100.000000 |
North Korea | 1 | 0 | 0.000000 |
Brunei Darussalam | 1 | 0 | 0.000000 |
Chad | 1 | 0 | 0.000000 |
179 rows × 3 columns
resultdf.loc['Pakistan']