by Fedor Iskhakov, ANU
Description: Introduction into DataFrames, grouping and data merging.
Collection of modules (libraries) used in scientific numerical computations:
NumPy
is widely-used scientific computing package for implements fast array processing — vectorizationSciPy
is a collection of functions that perform common scientific operations (optimization, root finding, interpolation, numerical integration, etc.)Pandas
is data manipulation package with special data types and methodsNumba
is just in time (JIT) compiler for a subset of Python and NumPy functionsMatplotlib
serves for making figures and plotsSeries
, DataFrame
)scikit-learn
and statsmodels
, which are built on top of pandasA DataFrame
combines multiple ‘columns’ of data into a
two-dimensional object, similar to a spreadsheet
import pandas as pd
data = pd.read_csv('./_static/data/recent-grads.csv')
# help(data) # more help on dataframe object and its methods
# Preview of the dataset
data.head(n=10)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 | 6 | 2418 | NUCLEAR ENGINEERING | 2573.0 | 2200.0 | 373.0 | Engineering | 0.144967 | 17 | 1857 | ... | 264 | 1449 | 400 | 0.177226 | 65000 | 50000 | 102000 | 1142 | 657 | 244 |
6 | 7 | 6202 | ACTUARIAL SCIENCE | 3777.0 | 2110.0 | 1667.0 | Business | 0.441356 | 51 | 2912 | ... | 296 | 2482 | 308 | 0.095652 | 62000 | 53000 | 72000 | 1768 | 314 | 259 |
7 | 8 | 5001 | ASTRONOMY AND ASTROPHYSICS | 1792.0 | 832.0 | 960.0 | Physical Sciences | 0.535714 | 10 | 1526 | ... | 553 | 827 | 33 | 0.021167 | 62000 | 31500 | 109000 | 972 | 500 | 220 |
8 | 9 | 2414 | MECHANICAL ENGINEERING | 91227.0 | 80320.0 | 10907.0 | Engineering | 0.119559 | 1029 | 76442 | ... | 13101 | 54639 | 4650 | 0.057342 | 60000 | 48000 | 70000 | 52844 | 16384 | 3253 |
9 | 10 | 2408 | ELECTRICAL ENGINEERING | 81527.0 | 65511.0 | 16016.0 | Engineering | 0.196450 | 631 | 61928 | ... | 12695 | 41413 | 3895 | 0.059174 | 60000 | 45000 | 72000 | 45829 | 10874 | 3170 |
10 rows × 21 columns
The Economic Guide To Picking A College Major
https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/
Data dictionary available at
https://github.com/fivethirtyeight/data/tree/master/college-majors
# Info on the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173 entries, 0 to 172 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 173 non-null int64 1 Major_code 173 non-null int64 2 Major 173 non-null object 3 Total 172 non-null float64 4 Men 172 non-null float64 5 Women 172 non-null float64 6 Major_category 173 non-null object 7 ShareWomen 172 non-null float64 8 Sample_size 173 non-null int64 9 Employed 173 non-null int64 10 Full_time 173 non-null int64 11 Part_time 173 non-null int64 12 Full_time_year_round 173 non-null int64 13 Unemployed 173 non-null int64 14 Unemployment_rate 173 non-null float64 15 Median 173 non-null int64 16 P25th 173 non-null int64 17 P75th 173 non-null int64 18 College_jobs 173 non-null int64 19 Non_college_jobs 173 non-null int64 20 Low_wage_jobs 173 non-null int64 dtypes: float64(5), int64(14), object(2) memory usage: 28.5+ KB
This returns a Series
object
data1 = data['Major']
print('Type of data1 is ',type(data1))
data1.head()
Type of data1 is <class 'pandas.core.series.Series'>
0 PETROLEUM ENGINEERING 1 MINING AND MINERAL ENGINEERING 2 METALLURGICAL ENGINEERING 3 NAVAL ARCHITECTURE AND MARINE ENGINEERING 4 CHEMICAL ENGINEERING Name: Major, dtype: object
This returns a DataFrame
object again
data2 = data[['Major', 'ShareWomen']]
print('Type of data2 is ',type(data2))
data2.head()
Type of data2 is <class 'pandas.core.frame.DataFrame'>
Major | ShareWomen | |
---|---|---|
0 | PETROLEUM ENGINEERING | 0.120564 |
1 | MINING AND MINERAL ENGINEERING | 0.101852 |
2 | METALLURGICAL ENGINEERING | 0.153037 |
3 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 0.107313 |
4 | CHEMICAL ENGINEERING | 0.341631 |
data['Employment rate'] = data['Employed'] / data['Total']
data[['Total', 'Employed', 'Employment rate']].head()
Total | Employed | Employment rate | |
---|---|---|---|
0 | 2339.0 | 1976 | 0.844805 |
1 | 756.0 | 640 | 0.846561 |
2 | 856.0 | 648 | 0.757009 |
3 | 1258.0 | 758 | 0.602544 |
4 | 32260.0 | 25694 | 0.796466 |
data['Unemployment_rate'].mean() * 100
6.819083091329481
.describe()
returns useful summary statistics
data['Unemployment_rate'].describe()
count 173.000000 mean 0.068191 std 0.030331 min 0.000000 25% 0.050306 50% 0.067961 75% 0.087557 max 0.177226 Name: Unemployment_rate, dtype: float64
Pandas also provides a simple way to generate matplotlib plots
import matplotlib.pyplot as plt
%matplotlib inline
data.plot(x='ShareWomen', y='Median', kind='scatter', figsize=(10, 8), color='red')
plt.xlabel('Share of women')
plt.ylabel('Median salary')
Text(0, 0.5, 'Median salary')
We can use integer slicing to select rows as follows
data[:3]
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | Employment rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 | 0.844805 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 | 0.846561 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 | 0.757009 |
3 rows × 22 columns
First we will sort our values by a column in the dataframe
data.sort_values(by='ShareWomen', ascending=False)[:3]
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | Employment rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
164 | 165 | 2307 | EARLY CHILDHOOD EDUCATION | 37589.0 | 1167.0 | 36422.0 | Education | 0.968954 | 342 | 32551 | ... | 20748 | 1360 | 0.040105 | 28000 | 21000 | 35000 | 23515 | 7705 | 2868 | 0.865971 |
163 | 164 | 6102 | COMMUNICATION DISORDERS SCIENCES AND SERVICES | 38279.0 | 1225.0 | 37054.0 | Health | 0.967998 | 95 | 29763 | ... | 14460 | 1487 | 0.047584 | 28000 | 20000 | 40000 | 19957 | 9404 | 5125 | 0.777528 |
51 | 52 | 6104 | MEDICAL ASSISTING SERVICES | 11123.0 | 803.0 | 10320.0 | Health | 0.927807 | 67 | 9168 | ... | 4290 | 407 | 0.042507 | 42000 | 30000 | 65000 | 2091 | 6948 | 1270 | 0.824238 |
3 rows × 22 columns
Another way to select rows is to use row labels, i.e. set a row index
Similar to the column labels, we can add row labels (the index)
data.set_index('Major_code').head()
Rank | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | Full_time | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | Employment rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Major_code | |||||||||||||||||||||
2419 | 1 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | 1849 | ... | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 | 0.844805 |
2416 | 2 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | 556 | ... | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 | 0.846561 |
2415 | 3 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | 558 | ... | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 | 0.757009 |
2417 | 4 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | 1069 | ... | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 | 0.602544 |
2405 | 5 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | 23170 | ... | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 | 0.796466 |
5 rows × 21 columns
Note: we haven’t actually changed the DataFrame data
Need to overwrite data
with the new copy
data.head()
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | Employment rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 | 0.844805 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 | 0.846561 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 | 0.757009 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 | 0.602544 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 | 0.796466 |
5 rows × 22 columns
data = data.set_index('Major_code')
# Could also use data.set_index('Major_code', inplace=True)
data.head()
Rank | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | Full_time | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | Employment rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Major_code | |||||||||||||||||||||
2419 | 1 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | 1849 | ... | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 | 0.844805 |
2416 | 2 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | 556 | ... | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 | 0.846561 |
2415 | 3 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | 558 | ... | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 | 0.757009 |
2417 | 4 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | 1069 | ... | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 | 0.602544 |
2405 | 5 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | 23170 | ... | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 | 0.796466 |
5 rows × 21 columns
loc
¶Using Major_code
variable values as labels for rows
data.loc[2405]
Rank 5 Major CHEMICAL ENGINEERING Total 32260 Men 21239 Women 11021 Major_category Engineering ShareWomen 0.341631 Sample_size 289 Employed 25694 Full_time 23170 Part_time 5180 Full_time_year_round 16697 Unemployed 1672 Unemployment_rate 0.0610977 Median 65000 P25th 50000 P75th 75000 College_jobs 18314 Non_college_jobs 4440 Low_wage_jobs 972 Employment rate 0.796466 Name: 2405, dtype: object
code_list = [6102, 5001]
data.loc[code_list]
Rank | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | Full_time | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | Employment rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Major_code | |||||||||||||||||||||
6102 | 164 | COMMUNICATION DISORDERS SCIENCES AND SERVICES | 38279.0 | 1225.0 | 37054.0 | Health | 0.967998 | 95 | 29763 | 19975 | ... | 14460 | 1487 | 0.047584 | 28000 | 20000 | 40000 | 19957 | 9404 | 5125 | 0.777528 |
5001 | 8 | ASTRONOMY AND ASTROPHYSICS | 1792.0 | 832.0 | 960.0 | Physical Sciences | 0.535714 | 10 | 1526 | 1085 | ... | 827 | 33 | 0.021167 | 62000 | 31500 | 109000 | 972 | 500 | 220 | 0.851562 |
2 rows × 21 columns
Recommended for efficient selecting and filtering
data.sort_index(inplace=True)
data.head()
Rank | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | Full_time | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | Employment rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Major_code | |||||||||||||||||||||
1100 | 65 | GENERAL AGRICULTURE | 10399.0 | 6053.0 | 4346.0 | Agriculture & Natural Resources | 0.417925 | 158 | 8884 | 7589 | ... | 5888 | 178 | 0.019642 | 40000 | 30000 | 50000 | 2418 | 4717 | 839 | 0.854313 |
1101 | 64 | AGRICULTURE PRODUCTION AND MANAGEMENT | 14240.0 | 9658.0 | 4582.0 | Agriculture & Natural Resources | 0.321770 | 273 | 12323 | 11119 | ... | 9093 | 649 | 0.050031 | 40000 | 25000 | 50000 | 1925 | 6221 | 1362 | 0.865379 |
1102 | 72 | AGRICULTURAL ECONOMICS | 2439.0 | 1749.0 | 690.0 | Agriculture & Natural Resources | 0.282903 | 44 | 2174 | 1819 | ... | 1528 | 182 | 0.077250 | 40000 | 27000 | 54000 | 535 | 893 | 94 | 0.891349 |
1103 | 153 | ANIMAL SCIENCES | 21573.0 | 5347.0 | 16226.0 | Agriculture & Natural Resources | 0.752144 | 255 | 17112 | 14479 | ... | 10824 | 917 | 0.050862 | 30000 | 22000 | 40000 | 5443 | 9571 | 2125 | 0.793214 |
1104 | 22 | FOOD SCIENCE | NaN | NaN | NaN | Agriculture & Natural Resources | NaN | 36 | 3149 | 2558 | ... | 1735 | 338 | 0.096931 | 53000 | 32000 | 70000 | 1183 | 1274 | 485 | NaN |
5 rows × 21 columns
Alternatively, we can filter our dataframe (select rows) using boolean conditions
selection = data['Major_category'] == 'Arts'
data[selection].head()
Rank | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | Full_time | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | Employment rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Major_code | |||||||||||||||||||||
6000 | 150 | FINE ARTS | 74440.0 | 24786.0 | 49654.0 | Arts | 0.667034 | 623 | 59679 | 42764 | ... | 31877 | 5486 | 0.084186 | 30500 | 21000 | 41000 | 20792 | 32725 | 11880 | 0.801706 |
6001 | 167 | DRAMA AND THEATER ARTS | 43249.0 | 14440.0 | 28809.0 | Arts | 0.666119 | 357 | 36165 | 25147 | ... | 16891 | 3040 | 0.077541 | 27000 | 19200 | 35000 | 6994 | 25313 | 11068 | 0.836204 |
6002 | 147 | MUSIC | 60633.0 | 29909.0 | 30724.0 | Arts | 0.506721 | 419 | 47662 | 29010 | ... | 21425 | 3918 | 0.075960 | 31000 | 22300 | 42000 | 13752 | 28786 | 9286 | 0.786074 |
6003 | 154 | VISUAL AND PERFORMING ARTS | 16250.0 | 4133.0 | 12117.0 | Arts | 0.745662 | 132 | 12870 | 8447 | ... | 6322 | 1465 | 0.102197 | 30000 | 22000 | 40000 | 3849 | 7635 | 2840 | 0.792000 |
6004 | 96 | COMMERCIAL ART AND GRAPHIC DESIGN | 103480.0 | 32041.0 | 71439.0 | Arts | 0.690365 | 1186 | 83483 | 67448 | ... | 52243 | 8947 | 0.096798 | 35000 | 25000 | 45000 | 37389 | 38119 | 14839 | 0.806755 |
5 rows × 21 columns
selection = data['Major_category'] == 'Arts'
data[selection & (data['Total'] > 100000)].head()
Rank | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | Full_time | ... | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | Employment rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Major_code | |||||||||||||||||||||
6004 | 96 | COMMERCIAL ART AND GRAPHIC DESIGN | 103480.0 | 32041.0 | 71439.0 | Arts | 0.690365 | 1186 | 83483 | 67448 | ... | 52243 | 8947 | 0.096798 | 35000 | 25000 | 45000 | 37389 | 38119 | 14839 | 0.806755 |
1 rows × 21 columns
We might want to summarize our data by grouping it by major categories
To do this, we will use the .groupby()
function
grouped = data.groupby('Major_category')
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9ac0802990>
print(type(grouped))
print(type(grouped.groups))
grouped.groups
<class 'pandas.core.groupby.generic.DataFrameGroupBy'> <class 'dict'>
{'Agriculture & Natural Resources': Int64Index([1100, 1101, 1102, 1103, 1104, 1105, 1106, 1199, 1302, 1303], dtype='int64', name='Major_code'), 'Arts': Int64Index([6000, 6001, 6002, 6003, 6004, 6005, 6007, 6099], dtype='int64', name='Major_code'), 'Biology & Life Science': Int64Index([1301, 3600, 3601, 3602, 3603, 3604, 3605, 3606, 3607, 3608, 3609, 3611, 3699, 4006], dtype='int64', name='Major_code'), 'Business': Int64Index([6200, 6201, 6202, 6203, 6204, 6205, 6206, 6207, 6209, 6210, 6211, 6212, 6299], dtype='int64', name='Major_code'), 'Communications & Journalism': Int64Index([1901, 1902, 1903, 1904], dtype='int64', name='Major_code'), 'Computers & Mathematics': Int64Index([2001, 2100, 2101, 2102, 2105, 2106, 2107, 3700, 3701, 3702, 4005], dtype='int64', name='Major_code'), 'Education': Int64Index([2300, 2301, 2303, 2304, 2305, 2306, 2307, 2308, 2309, 2310, 2311, 2312, 2313, 2314, 2399, 3501], dtype='int64', name='Major_code'), 'Engineering': Int64Index([1401, 2400, 2401, 2402, 2403, 2404, 2405, 2406, 2407, 2408, 2409, 2410, 2411, 2412, 2413, 2414, 2415, 2416, 2417, 2418, 2419, 2499, 2500, 2501, 2502, 2503, 2504, 2599, 5008], dtype='int64', name='Major_code'), 'Health': Int64Index([4002, 6100, 6102, 6103, 6104, 6105, 6106, 6107, 6108, 6109, 6110, 6199], dtype='int64', name='Major_code'), 'Humanities & Liberal Arts': Int64Index([1501, 2601, 2602, 2603, 3301, 3302, 3401, 3402, 4001, 4801, 4901, 5502, 6006, 6402, 6403], dtype='int64', name='Major_code'), 'Industrial Arts & Consumer Services': Int64Index([2201, 2901, 3801, 4101, 5601, 5701, 5901], dtype='int64', name='Major_code'), 'Interdisciplinary': Int64Index([4000], dtype='int64', name='Major_code'), 'Law & Public Policy': Int64Index([3201, 3202, 5301, 5401, 5402], dtype='int64', name='Major_code'), 'Physical Sciences': Int64Index([5000, 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5098, 5102], dtype='int64', name='Major_code'), 'Psychology & Social Work': Int64Index([5200, 5201, 5202, 5203, 5205, 5206, 5299, 5403, 5404], dtype='int64', name='Major_code'), 'Social Science': Int64Index([4007, 5500, 5501, 5503, 5504, 5505, 5506, 5507, 5599], dtype='int64', name='Major_code')}
To return an aggregated dataframe, we need to specify the function we would like pandas to use to aggregate our groups
A list of built-in aggregatation functions https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
grouped.mean()
Rank | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | Employment rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Major_category | |||||||||||||||||||
Agriculture & Natural Resources | 101.500000 | 8402.222222 | 4484.111111 | 3918.111111 | 0.405267 | 110.400000 | 6694.300000 | 5814.300000 | 1659.100000 | 4362.600000 | 382.400000 | 0.056328 | 36900.000000 | 25400.000000 | 48010.000000 | 1986.000000 | 3449.100000 | 789.900000 | 0.860603 |
Arts | 131.125000 | 44641.250000 | 16798.750000 | 27842.500000 | 0.603658 | 407.500000 | 36014.250000 | 25971.625000 | 14348.875000 | 19138.875000 | 3528.500000 | 0.090173 | 33062.500000 | 21962.500000 | 43662.500000 | 11848.125000 | 20465.000000 | 7514.500000 | 0.815670 |
Biology & Life Science | 95.357143 | 32418.714286 | 13208.500000 | 19210.214286 | 0.587193 | 165.500000 | 21628.357143 | 17169.785714 | 8338.285714 | 11843.000000 | 1632.428571 | 0.060918 | 36421.428571 | 26614.285714 | 46085.714286 | 10802.357143 | 9084.428571 | 3053.000000 | 0.700543 |
Business | 55.846154 | 100182.769231 | 51373.230769 | 48809.538462 | 0.483198 | 1192.692308 | 83749.384615 | 76066.923077 | 15148.923077 | 60801.923077 | 6144.384615 | 0.071064 | 43538.461538 | 33461.538462 | 54846.153846 | 11426.000000 | 38197.692308 | 9752.923077 | 0.829638 |
Communications & Journalism | 104.000000 | 98150.250000 | 32980.250000 | 65170.000000 | 0.658384 | 1127.000000 | 82665.000000 | 68332.500000 | 22454.250000 | 53557.000000 | 6713.000000 | 0.075538 | 34500.000000 | 26250.000000 | 44975.000000 | 21639.000000 | 43248.000000 | 12398.750000 | 0.844534 |
Computers & Mathematics | 57.000000 | 27182.545455 | 18975.000000 | 8207.545455 | 0.311772 | 260.000000 | 21626.727273 | 18867.727273 | 4842.727273 | 14468.727273 | 1670.272727 | 0.084256 | 42745.454545 | 29290.909091 | 58090.909091 | 12532.636364 | 6769.363636 | 1466.909091 | 0.799617 |
Education | 130.375000 | 34945.562500 | 6470.375000 | 28475.187500 | 0.748507 | 296.375000 | 29989.937500 | 24878.687500 | 7537.062500 | 18001.937500 | 1560.562500 | 0.051702 | 32350.000000 | 26590.625000 | 38562.500000 | 21169.562500 | 7610.062500 | 2554.375000 | 0.856697 |
Engineering | 22.620690 | 18537.344828 | 14079.551724 | 4457.793103 | 0.238889 | 169.862069 | 14495.586207 | 13167.827586 | 2935.724138 | 9963.862069 | 1028.172414 | 0.063334 | 57382.758621 | 41555.172414 | 70448.275862 | 9302.310345 | 3530.448276 | 864.793103 | 0.787311 |
Health | 96.500000 | 38602.500000 | 6293.083333 | 32309.416667 | 0.795152 | 326.166667 | 31012.250000 | 24568.250000 | 9549.333333 | 19034.833333 | 1851.083333 | 0.065920 | 36825.000000 | 26166.666667 | 50250.000000 | 20453.416667 | 9208.000000 | 2605.833333 | 0.758078 |
Humanities & Liberal Arts | 135.066667 | 47564.533333 | 18189.733333 | 29374.800000 | 0.631790 | 356.000000 | 36274.533333 | 27795.933333 | 14268.666667 | 19704.066667 | 3406.733333 | 0.081008 | 31913.333333 | 23493.333333 | 42073.333333 | 12843.333333 | 18435.466667 | 6282.666667 | 0.767983 |
Industrial Arts & Consumer Services | 105.142857 | 32827.428571 | 14825.857143 | 18001.571429 | 0.349523 | 309.285714 | 27006.142857 | 21626.142857 | 8731.714286 | 16311.285714 | 1646.571429 | 0.048071 | 36342.857143 | 26771.428571 | 45142.857143 | 8171.428571 | 14945.714286 | 3798.571429 | 0.715442 |
Interdisciplinary | 110.000000 | 12296.000000 | 2817.000000 | 9479.000000 | 0.770901 | 128.000000 | 9821.000000 | 8032.000000 | 3173.000000 | 6234.000000 | 749.000000 | 0.070861 | 35000.000000 | 25000.000000 | 44000.000000 | 5176.000000 | 3903.000000 | 1061.000000 | 0.798715 |
Law & Public Policy | 64.600000 | 35821.400000 | 18225.800000 | 17595.600000 | 0.483649 | 387.000000 | 28958.000000 | 25388.000000 | 7642.600000 | 20090.800000 | 2699.000000 | 0.090805 | 42200.000000 | 32640.000000 | 55000.000000 | 5844.200000 | 20004.800000 | 4144.000000 | 0.770304 |
Physical Sciences | 67.600000 | 18547.900000 | 9539.000000 | 9008.900000 | 0.508683 | 113.700000 | 13923.100000 | 11285.200000 | 4344.400000 | 8563.500000 | 788.000000 | 0.046511 | 41890.000000 | 28350.000000 | 57290.000000 | 7655.200000 | 4946.900000 | 1407.800000 | 0.776080 |
Psychology & Social Work | 143.000000 | 53445.222222 | 10901.666667 | 42543.555556 | 0.794397 | 353.333333 | 42260.444444 | 32111.111111 | 15332.444444 | 24233.888889 | 3699.111111 | 0.072065 | 30100.000000 | 25333.333333 | 38777.777778 | 18256.111111 | 18818.444444 | 6249.555556 | 0.795633 |
Social Science | 91.666667 | 58885.111111 | 28537.111111 | 30348.000000 | 0.553962 | 509.000000 | 44610.333333 | 38571.222222 | 13507.666667 | 28357.666667 | 4775.000000 | 0.095729 | 37344.444444 | 28355.555556 | 50111.111111 | 12662.222222 | 21138.444444 | 6020.000000 | 0.770296 |
grouped['Median'].mean()
Major_category Agriculture & Natural Resources 36900.000000 Arts 33062.500000 Biology & Life Science 36421.428571 Business 43538.461538 Communications & Journalism 34500.000000 Computers & Mathematics 42745.454545 Education 32350.000000 Engineering 57382.758621 Health 36825.000000 Humanities & Liberal Arts 31913.333333 Industrial Arts & Consumer Services 36342.857143 Interdisciplinary 35000.000000 Law & Public Policy 42200.000000 Physical Sciences 41890.000000 Psychology & Social Work 30100.000000 Social Science 37344.444444 Name: Median, dtype: float64
grouped['Median'].agg(['mean', 'median', 'std'])
mean | median | std | |
---|---|---|---|
Major_category | |||
Agriculture & Natural Resources | 36900.000000 | 35000 | 6935.416354 |
Arts | 33062.500000 | 30750 | 7223.164621 |
Biology & Life Science | 36421.428571 | 36300 | 4528.912006 |
Business | 43538.461538 | 40000 | 7774.052832 |
Communications & Journalism | 34500.000000 | 35000 | 1000.000000 |
Computers & Mathematics | 42745.454545 | 45000 | 5108.691346 |
Education | 32350.000000 | 32750 | 3892.728263 |
Engineering | 57382.758621 | 57000 | 13626.079747 |
Health | 36825.000000 | 35000 | 5776.460854 |
Humanities & Liberal Arts | 31913.333333 | 32000 | 3393.032076 |
Industrial Arts & Consumer Services | 36342.857143 | 35000 | 7290.829204 |
Interdisciplinary | 35000.000000 | 35000 | NaN |
Law & Public Policy | 42200.000000 | 36000 | 9066.421565 |
Physical Sciences | 41890.000000 | 39500 | 8251.659766 |
Psychology & Social Work | 30100.000000 | 30000 | 5381.914158 |
Social Science | 37344.444444 | 38000 | 4750.555523 |
grouped['Median'].mean().plot(kind='bar', figsize=(10, 8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f9b20472a10>
Simple example with fictitious data
raw_data_1 = {'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
# Assign each to a dataframe called data1, data2, data3
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])
# Print data 1
data1
subject_id | first_name | last_name | |
---|---|---|---|
0 | 1 | Alex | Anderson |
1 | 2 | Amy | Ackerman |
2 | 3 | Allen | Ali |
3 | 4 | Alice | Aoni |
4 | 5 | Ayoung | Atiches |
# Print data 2
data2
subject_id | first_name | last_name | |
---|---|---|---|
0 | 4 | Billy | Bonder |
1 | 5 | Brian | Black |
2 | 6 | Bran | Balwner |
3 | 7 | Bryce | Brice |
4 | 8 | Betty | Btisan |
# Print data 3
data3
subject_id | test_id | |
---|---|---|
0 | 1 | 51 |
1 | 2 | 15 |
2 | 3 | 15 |
3 | 4 | 61 |
4 | 5 | 16 |
5 | 7 | 14 |
6 | 8 | 15 |
7 | 9 | 1 |
8 | 10 | 61 |
9 | 11 | 16 |
# Join the two dataframes along rows
data_all_rows = pd.concat([data1, data2])
data_all_rows
subject_id | first_name | last_name | |
---|---|---|---|
0 | 1 | Alex | Anderson |
1 | 2 | Amy | Ackerman |
2 | 3 | Allen | Ali |
3 | 4 | Alice | Aoni |
4 | 5 | Ayoung | Atiches |
0 | 4 | Billy | Bonder |
1 | 5 | Brian | Black |
2 | 6 | Bran | Balwner |
3 | 7 | Bryce | Brice |
4 | 8 | Betty | Btisan |
# Join the two dataframes along columns
data_all_col = pd.concat([data1, data2], axis = 1)
data_all_col
subject_id | first_name | last_name | subject_id | first_name | last_name | |
---|---|---|---|---|---|---|
0 | 1 | Alex | Anderson | 4 | Billy | Bonder |
1 | 2 | Amy | Ackerman | 5 | Brian | Black |
2 | 3 | Allen | Ali | 6 | Bran | Balwner |
3 | 4 | Alice | Aoni | 7 | Bryce | Brice |
4 | 5 | Ayoung | Atiches | 8 | Betty | Btisan |
# Merge data_all_rows and data3 along the subject_id value
pd.merge(data_all_rows, data3, on='subject_id')
subject_id | first_name | last_name | test_id | |
---|---|---|---|---|
0 | 1 | Alex | Anderson | 51 |
1 | 2 | Amy | Ackerman | 15 |
2 | 3 | Allen | Ali | 15 |
3 | 4 | Alice | Aoni | 61 |
4 | 4 | Billy | Bonder | 61 |
5 | 5 | Ayoung | Atiches | 16 |
6 | 5 | Brian | Black | 16 |
7 | 7 | Bryce | Brice | 14 |
8 | 8 | Betty | Btisan | 15 |
# Inner merge
pd.merge(data1, data2, on='subject_id', how='inner')
subject_id | first_name_x | last_name_x | first_name_y | last_name_y | |
---|---|---|---|---|---|
0 | 4 | Alice | Aoni | Billy | Bonder |
1 | 5 | Ayoung | Atiches | Brian | Black |
# Outer merge
pd.merge(data1, data2, on='subject_id', how='outer')
subject_id | first_name_x | last_name_x | first_name_y | last_name_y | |
---|---|---|---|---|---|
0 | 1 | Alex | Anderson | NaN | NaN |
1 | 2 | Amy | Ackerman | NaN | NaN |
2 | 3 | Allen | Ali | NaN | NaN |
3 | 4 | Alice | Aoni | Billy | Bonder |
4 | 5 | Ayoung | Atiches | Brian | Black |
5 | 6 | NaN | NaN | Bran | Balwner |
6 | 7 | NaN | NaN | Bryce | Brice |
7 | 8 | NaN | NaN | Betty | Btisan |