This project uses a cleaned dataset on college majors sourced from American Community Survey available at this Github repo
Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more.
This project focuses on exploring the data quickly using visualizations, using pandas plotting within the Jupyter notebook interface.
Some of the plots employed throughout the project:
Observations from the visualizations are summarized to answer the following questions on the dataset.
Do students in more popular majors make more money?
Do students that majored in subjects that were majority female make more money?
Is there any link between the number of full-time employees and median salary?
How many majors are predominantly male?
How many majors are predominantly female?
Which category of majors have the most students?
How do percentages of women compare between the top 10 and bottom 10 ranked majors?
How do unemployment rates from top 10 and bottom 10 ranked majors compare?
How does number of men compare with the number of women in each category of majors?
import pandas as pd
import matplotlib
import numpy as np
# Setup Jupyter Notebook to display plots within
%matplotlib inline
#Dataset
csvfile = 'recent-grads.csv'
# Load into pandas dataframe
recent_grads = pd.read_csv(csvfile)
# column names
recent_grads.head(0)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs |
---|
0 rows × 21 columns
# First row
#recent_grads.iloc[0]
#OR
recent_grads.head(1)
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 rows × 21 columns
# Look at how the data is structured by looking at a few rows at the top and bottom
# We could use df.head() and df.tail() OR benefit from jupyterhub rendering it by using the df variable
recent_grads
Rank | Major_code | Major | Total | Men | Women | Major_category | ShareWomen | Sample_size | Employed | ... | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2419 | PETROLEUM ENGINEERING | 2339.0 | 2057.0 | 282.0 | Engineering | 0.120564 | 36 | 1976 | ... | 270 | 1207 | 37 | 0.018381 | 110000 | 95000 | 125000 | 1534 | 364 | 193 |
1 | 2 | 2416 | MINING AND MINERAL ENGINEERING | 756.0 | 679.0 | 77.0 | Engineering | 0.101852 | 7 | 640 | ... | 170 | 388 | 85 | 0.117241 | 75000 | 55000 | 90000 | 350 | 257 | 50 |
2 | 3 | 2415 | METALLURGICAL ENGINEERING | 856.0 | 725.0 | 131.0 | Engineering | 0.153037 | 3 | 648 | ... | 133 | 340 | 16 | 0.024096 | 73000 | 50000 | 105000 | 456 | 176 | 0 |
3 | 4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 1123.0 | 135.0 | Engineering | 0.107313 | 16 | 758 | ... | 150 | 692 | 40 | 0.050125 | 70000 | 43000 | 80000 | 529 | 102 | 0 |
4 | 5 | 2405 | CHEMICAL ENGINEERING | 32260.0 | 21239.0 | 11021.0 | Engineering | 0.341631 | 289 | 25694 | ... | 5180 | 16697 | 1672 | 0.061098 | 65000 | 50000 | 75000 | 18314 | 4440 | 972 |
5 | 6 | 2418 | NUCLEAR ENGINEERING | 2573.0 | 2200.0 | 373.0 | Engineering | 0.144967 | 17 | 1857 | ... | 264 | 1449 | 400 | 0.177226 | 65000 | 50000 | 102000 | 1142 | 657 | 244 |
6 | 7 | 6202 | ACTUARIAL SCIENCE | 3777.0 | 2110.0 | 1667.0 | Business | 0.441356 | 51 | 2912 | ... | 296 | 2482 | 308 | 0.095652 | 62000 | 53000 | 72000 | 1768 | 314 | 259 |
7 | 8 | 5001 | ASTRONOMY AND ASTROPHYSICS | 1792.0 | 832.0 | 960.0 | Physical Sciences | 0.535714 | 10 | 1526 | ... | 553 | 827 | 33 | 0.021167 | 62000 | 31500 | 109000 | 972 | 500 | 220 |
8 | 9 | 2414 | MECHANICAL ENGINEERING | 91227.0 | 80320.0 | 10907.0 | Engineering | 0.119559 | 1029 | 76442 | ... | 13101 | 54639 | 4650 | 0.057342 | 60000 | 48000 | 70000 | 52844 | 16384 | 3253 |
9 | 10 | 2408 | ELECTRICAL ENGINEERING | 81527.0 | 65511.0 | 16016.0 | Engineering | 0.196450 | 631 | 61928 | ... | 12695 | 41413 | 3895 | 0.059174 | 60000 | 45000 | 72000 | 45829 | 10874 | 3170 |
10 | 11 | 2407 | COMPUTER ENGINEERING | 41542.0 | 33258.0 | 8284.0 | Engineering | 0.199413 | 399 | 32506 | ... | 5146 | 23621 | 2275 | 0.065409 | 60000 | 45000 | 75000 | 23694 | 5721 | 980 |
11 | 12 | 2401 | AEROSPACE ENGINEERING | 15058.0 | 12953.0 | 2105.0 | Engineering | 0.139793 | 147 | 11391 | ... | 2724 | 8790 | 794 | 0.065162 | 60000 | 42000 | 70000 | 8184 | 2425 | 372 |
12 | 13 | 2404 | BIOMEDICAL ENGINEERING | 14955.0 | 8407.0 | 6548.0 | Engineering | 0.437847 | 79 | 10047 | ... | 2694 | 5986 | 1019 | 0.092084 | 60000 | 36000 | 70000 | 6439 | 2471 | 789 |
13 | 14 | 5008 | MATERIALS SCIENCE | 4279.0 | 2949.0 | 1330.0 | Engineering | 0.310820 | 22 | 3307 | ... | 878 | 1967 | 78 | 0.023043 | 60000 | 39000 | 65000 | 2626 | 391 | 81 |
14 | 15 | 2409 | ENGINEERING MECHANICS PHYSICS AND SCIENCE | 4321.0 | 3526.0 | 795.0 | Engineering | 0.183985 | 30 | 3608 | ... | 811 | 2004 | 23 | 0.006334 | 58000 | 25000 | 74000 | 2439 | 947 | 263 |
15 | 16 | 2402 | BIOLOGICAL ENGINEERING | 8925.0 | 6062.0 | 2863.0 | Engineering | 0.320784 | 55 | 6170 | ... | 1983 | 3413 | 589 | 0.087143 | 57100 | 40000 | 76000 | 3603 | 1595 | 524 |
16 | 17 | 2412 | INDUSTRIAL AND MANUFACTURING ENGINEERING | 18968.0 | 12453.0 | 6515.0 | Engineering | 0.343473 | 183 | 15604 | ... | 2243 | 11326 | 699 | 0.042876 | 57000 | 37900 | 67000 | 8306 | 3235 | 640 |
17 | 18 | 2400 | GENERAL ENGINEERING | 61152.0 | 45683.0 | 15469.0 | Engineering | 0.252960 | 425 | 44931 | ... | 7199 | 33540 | 2859 | 0.059824 | 56000 | 36000 | 69000 | 26898 | 11734 | 3192 |
18 | 19 | 2403 | ARCHITECTURAL ENGINEERING | 2825.0 | 1835.0 | 990.0 | Engineering | 0.350442 | 26 | 2575 | ... | 343 | 1848 | 170 | 0.061931 | 54000 | 38000 | 65000 | 1665 | 649 | 137 |
19 | 20 | 3201 | COURT REPORTING | 1148.0 | 877.0 | 271.0 | Law & Public Policy | 0.236063 | 14 | 930 | ... | 223 | 808 | 11 | 0.011690 | 54000 | 50000 | 54000 | 402 | 528 | 144 |
20 | 21 | 2102 | COMPUTER SCIENCE | 128319.0 | 99743.0 | 28576.0 | Computers & Mathematics | 0.222695 | 1196 | 102087 | ... | 18726 | 70932 | 6884 | 0.063173 | 53000 | 39000 | 70000 | 68622 | 25667 | 5144 |
21 | 22 | 1104 | FOOD SCIENCE | NaN | NaN | NaN | Agriculture & Natural Resources | NaN | 36 | 3149 | ... | 1121 | 1735 | 338 | 0.096931 | 53000 | 32000 | 70000 | 1183 | 1274 | 485 |
22 | 23 | 2502 | ELECTRICAL ENGINEERING TECHNOLOGY | 11565.0 | 8181.0 | 3384.0 | Engineering | 0.292607 | 97 | 8587 | ... | 1873 | 5681 | 824 | 0.087557 | 52000 | 35000 | 60000 | 5126 | 2686 | 696 |
23 | 24 | 2413 | MATERIALS ENGINEERING AND MATERIALS SCIENCE | 2993.0 | 2020.0 | 973.0 | Engineering | 0.325092 | 22 | 2449 | ... | 1040 | 1151 | 70 | 0.027789 | 52000 | 35000 | 62000 | 1911 | 305 | 70 |
24 | 25 | 6212 | MANAGEMENT INFORMATION SYSTEMS AND STATISTICS | 18713.0 | 13496.0 | 5217.0 | Business | 0.278790 | 278 | 16413 | ... | 2420 | 13017 | 1015 | 0.058240 | 51000 | 38000 | 60000 | 6342 | 5741 | 708 |
25 | 26 | 2406 | CIVIL ENGINEERING | 53153.0 | 41081.0 | 12072.0 | Engineering | 0.227118 | 565 | 43041 | ... | 10080 | 29196 | 3270 | 0.070610 | 50000 | 40000 | 60000 | 28526 | 9356 | 2899 |
26 | 27 | 5601 | CONSTRUCTION SERVICES | 18498.0 | 16820.0 | 1678.0 | Industrial Arts & Consumer Services | 0.090713 | 295 | 16318 | ... | 1751 | 12313 | 1042 | 0.060023 | 50000 | 36000 | 60000 | 3275 | 5351 | 703 |
27 | 28 | 6204 | OPERATIONS LOGISTICS AND E-COMMERCE | 11732.0 | 7921.0 | 3811.0 | Business | 0.324838 | 156 | 10027 | ... | 1183 | 7724 | 504 | 0.047859 | 50000 | 40000 | 60000 | 1466 | 3629 | 285 |
28 | 29 | 2499 | MISCELLANEOUS ENGINEERING | 9133.0 | 7398.0 | 1735.0 | Engineering | 0.189970 | 118 | 7428 | ... | 1662 | 5476 | 597 | 0.074393 | 50000 | 39000 | 65000 | 3445 | 2426 | 365 |
29 | 30 | 5402 | PUBLIC POLICY | 5978.0 | 2639.0 | 3339.0 | Law & Public Policy | 0.558548 | 55 | 4547 | ... | 1306 | 2776 | 670 | 0.128426 | 50000 | 35000 | 70000 | 1550 | 1871 | 340 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
143 | 144 | 1105 | PLANT SCIENCE AND AGRONOMY | 7416.0 | 4897.0 | 2519.0 | Agriculture & Natural Resources | 0.339671 | 110 | 6594 | ... | 1246 | 4522 | 314 | 0.045455 | 32000 | 22900 | 40000 | 2089 | 3545 | 1231 |
144 | 145 | 2308 | SCIENCE AND COMPUTER TEACHER EDUCATION | 6483.0 | 2049.0 | 4434.0 | Education | 0.683943 | 59 | 5362 | ... | 1227 | 3247 | 266 | 0.047264 | 32000 | 28000 | 39000 | 4214 | 1106 | 591 |
145 | 146 | 5200 | PSYCHOLOGY | 393735.0 | 86648.0 | 307087.0 | Psychology & Social Work | 0.779933 | 2584 | 307933 | ... | 115172 | 174438 | 28169 | 0.083811 | 31500 | 24000 | 41000 | 125148 | 141860 | 48207 |
146 | 147 | 6002 | MUSIC | 60633.0 | 29909.0 | 30724.0 | Arts | 0.506721 | 419 | 47662 | ... | 24943 | 21425 | 3918 | 0.075960 | 31000 | 22300 | 42000 | 13752 | 28786 | 9286 |
147 | 148 | 2306 | PHYSICAL AND HEALTH EDUCATION TEACHING | 28213.0 | 15670.0 | 12543.0 | Education | 0.444582 | 259 | 23794 | ... | 7230 | 13651 | 1920 | 0.074667 | 31000 | 24000 | 40000 | 12777 | 9328 | 2042 |
148 | 149 | 6006 | ART HISTORY AND CRITICISM | 21030.0 | 3240.0 | 17790.0 | Humanities & Liberal Arts | 0.845934 | 204 | 17579 | ... | 6140 | 9965 | 1128 | 0.060298 | 31000 | 23000 | 40000 | 5139 | 9738 | 3426 |
149 | 150 | 6000 | FINE ARTS | 74440.0 | 24786.0 | 49654.0 | Arts | 0.667034 | 623 | 59679 | ... | 23656 | 31877 | 5486 | 0.084186 | 30500 | 21000 | 41000 | 20792 | 32725 | 11880 |
150 | 151 | 2901 | FAMILY AND CONSUMER SCIENCES | 58001.0 | 5166.0 | 52835.0 | Industrial Arts & Consumer Services | 0.910933 | 518 | 46624 | ... | 15872 | 26906 | 3355 | 0.067128 | 30000 | 22900 | 40000 | 20985 | 20133 | 5248 |
151 | 152 | 5404 | SOCIAL WORK | 53552.0 | 5137.0 | 48415.0 | Psychology & Social Work | 0.904075 | 374 | 45038 | ... | 13481 | 27588 | 3329 | 0.068828 | 30000 | 25000 | 35000 | 27449 | 14416 | 4344 |
152 | 153 | 1103 | ANIMAL SCIENCES | 21573.0 | 5347.0 | 16226.0 | Agriculture & Natural Resources | 0.752144 | 255 | 17112 | ... | 5353 | 10824 | 917 | 0.050862 | 30000 | 22000 | 40000 | 5443 | 9571 | 2125 |
153 | 154 | 6003 | VISUAL AND PERFORMING ARTS | 16250.0 | 4133.0 | 12117.0 | Arts | 0.745662 | 132 | 12870 | ... | 6253 | 6322 | 1465 | 0.102197 | 30000 | 22000 | 40000 | 3849 | 7635 | 2840 |
154 | 155 | 2312 | TEACHER EDUCATION: MULTIPLE LEVELS | 14443.0 | 2734.0 | 11709.0 | Education | 0.810704 | 142 | 13076 | ... | 2214 | 8457 | 496 | 0.036546 | 30000 | 24000 | 37000 | 10766 | 1949 | 722 |
155 | 156 | 5299 | MISCELLANEOUS PSYCHOLOGY | 9628.0 | 1936.0 | 7692.0 | Psychology & Social Work | 0.798920 | 60 | 7653 | ... | 3221 | 3838 | 419 | 0.051908 | 30000 | 20800 | 40000 | 2960 | 3948 | 1650 |
156 | 157 | 5403 | HUMAN SERVICES AND COMMUNITY ORGANIZATION | 9374.0 | 885.0 | 8489.0 | Psychology & Social Work | 0.905590 | 89 | 8294 | ... | 2405 | 5061 | 326 | 0.037819 | 30000 | 24000 | 35000 | 2878 | 4595 | 724 |
157 | 158 | 3402 | HUMANITIES | 6652.0 | 2013.0 | 4639.0 | Humanities & Liberal Arts | 0.697384 | 49 | 5052 | ... | 2225 | 2661 | 372 | 0.068584 | 30000 | 20000 | 49000 | 1168 | 3354 | 1141 |
158 | 159 | 4901 | THEOLOGY AND RELIGIOUS VOCATIONS | 30207.0 | 18616.0 | 11591.0 | Humanities & Liberal Arts | 0.383719 | 310 | 24202 | ... | 8767 | 13944 | 1617 | 0.062628 | 29000 | 22000 | 38000 | 9927 | 12037 | 3304 |
159 | 160 | 6007 | STUDIO ARTS | 16977.0 | 4754.0 | 12223.0 | Arts | 0.719974 | 182 | 13908 | ... | 5673 | 7413 | 1368 | 0.089552 | 29000 | 19200 | 38300 | 3948 | 8707 | 3586 |
160 | 161 | 2201 | COSMETOLOGY SERVICES AND CULINARY ARTS | 10510.0 | 4364.0 | 6146.0 | Industrial Arts & Consumer Services | 0.584776 | 117 | 8650 | ... | 2064 | 5949 | 510 | 0.055677 | 29000 | 20000 | 36000 | 563 | 7384 | 3163 |
161 | 162 | 1199 | MISCELLANEOUS AGRICULTURE | 1488.0 | 404.0 | 1084.0 | Agriculture & Natural Resources | 0.728495 | 24 | 1290 | ... | 335 | 936 | 82 | 0.059767 | 29000 | 23000 | 42100 | 483 | 626 | 31 |
162 | 163 | 5502 | ANTHROPOLOGY AND ARCHEOLOGY | 38844.0 | 11376.0 | 27468.0 | Humanities & Liberal Arts | 0.707136 | 247 | 29633 | ... | 14515 | 13232 | 3395 | 0.102792 | 28000 | 20000 | 38000 | 9805 | 16693 | 6866 |
163 | 164 | 6102 | COMMUNICATION DISORDERS SCIENCES AND SERVICES | 38279.0 | 1225.0 | 37054.0 | Health | 0.967998 | 95 | 29763 | ... | 13862 | 14460 | 1487 | 0.047584 | 28000 | 20000 | 40000 | 19957 | 9404 | 5125 |
164 | 165 | 2307 | EARLY CHILDHOOD EDUCATION | 37589.0 | 1167.0 | 36422.0 | Education | 0.968954 | 342 | 32551 | ... | 7001 | 20748 | 1360 | 0.040105 | 28000 | 21000 | 35000 | 23515 | 7705 | 2868 |
165 | 166 | 2603 | OTHER FOREIGN LANGUAGES | 11204.0 | 3472.0 | 7732.0 | Humanities & Liberal Arts | 0.690111 | 56 | 7052 | ... | 3685 | 3214 | 846 | 0.107116 | 27500 | 22900 | 38000 | 2326 | 3703 | 1115 |
166 | 167 | 6001 | DRAMA AND THEATER ARTS | 43249.0 | 14440.0 | 28809.0 | Arts | 0.666119 | 357 | 36165 | ... | 15994 | 16891 | 3040 | 0.077541 | 27000 | 19200 | 35000 | 6994 | 25313 | 11068 |
167 | 168 | 3302 | COMPOSITION AND RHETORIC | 18953.0 | 7022.0 | 11931.0 | Humanities & Liberal Arts | 0.629505 | 151 | 15053 | ... | 6612 | 7832 | 1340 | 0.081742 | 27000 | 20000 | 35000 | 4855 | 8100 | 3466 |
168 | 169 | 3609 | ZOOLOGY | 8409.0 | 3050.0 | 5359.0 | Biology & Life Science | 0.637293 | 47 | 6259 | ... | 2190 | 3602 | 304 | 0.046320 | 26000 | 20000 | 39000 | 2771 | 2947 | 743 |
169 | 170 | 5201 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 522.0 | 2332.0 | Psychology & Social Work | 0.817099 | 7 | 2125 | ... | 572 | 1211 | 148 | 0.065112 | 25000 | 24000 | 34000 | 1488 | 615 | 82 |
170 | 171 | 5202 | CLINICAL PSYCHOLOGY | 2838.0 | 568.0 | 2270.0 | Psychology & Social Work | 0.799859 | 13 | 2101 | ... | 648 | 1293 | 368 | 0.149048 | 25000 | 25000 | 40000 | 986 | 870 | 622 |
171 | 172 | 5203 | COUNSELING PSYCHOLOGY | 4626.0 | 931.0 | 3695.0 | Psychology & Social Work | 0.798746 | 21 | 3777 | ... | 965 | 2738 | 214 | 0.053621 | 23400 | 19200 | 26000 | 2403 | 1245 | 308 |
172 | 173 | 3501 | LIBRARY SCIENCE | 1098.0 | 134.0 | 964.0 | Education | 0.877960 | 2 | 742 | ... | 237 | 410 | 87 | 0.104946 | 22000 | 20000 | 22000 | 288 | 338 | 192 |
173 rows × 21 columns
# column types
recent_grads.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173 entries, 0 to 172 Data columns (total 21 columns): Rank 173 non-null int64 Major_code 173 non-null int64 Major 173 non-null object Total 172 non-null float64 Men 172 non-null float64 Women 172 non-null float64 Major_category 173 non-null object ShareWomen 172 non-null float64 Sample_size 173 non-null int64 Employed 173 non-null int64 Full_time 173 non-null int64 Part_time 173 non-null int64 Full_time_year_round 173 non-null int64 Unemployed 173 non-null int64 Unemployment_rate 173 non-null float64 Median 173 non-null int64 P25th 173 non-null int64 P75th 173 non-null int64 College_jobs 173 non-null int64 Non_college_jobs 173 non-null int64 Low_wage_jobs 173 non-null int64 dtypes: float64(5), int64(14), object(2) memory usage: 28.5+ KB
# Let's see how many columns in the dataset are numeric and what they are
numeric_cols = recent_grads.select_dtypes(include=[np.number]).columns
#print (numeric_cols)
recent_grads[numeric_cols].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173 entries, 0 to 172 Data columns (total 19 columns): Rank 173 non-null int64 Major_code 173 non-null int64 Total 172 non-null float64 Men 172 non-null float64 Women 172 non-null float64 ShareWomen 172 non-null float64 Sample_size 173 non-null int64 Employed 173 non-null int64 Full_time 173 non-null int64 Part_time 173 non-null int64 Full_time_year_round 173 non-null int64 Unemployed 173 non-null int64 Unemployment_rate 173 non-null float64 Median 173 non-null int64 P25th 173 non-null int64 P75th 173 non-null int64 College_jobs 173 non-null int64 Non_college_jobs 173 non-null int64 Low_wage_jobs 173 non-null int64 dtypes: float64(5), int64(14) memory usage: 25.8 KB
# Summary statistics for the numeric columns
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 173.000000 | 173.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 | 173.000000 |
mean | 87.000000 | 3879.815029 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 356.080925 | 31192.763006 | 26029.306358 | 8832.398844 | 19694.427746 | 2416.329480 | 0.068191 | 40151.445087 | 29501.445087 | 51494.219653 | 12322.635838 | 13284.497110 | 3859.017341 |
std | 50.084928 | 1687.753140 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 618.361022 | 50675.002241 | 42869.655092 | 14648.179473 | 33160.941514 | 4112.803148 | 0.030331 | 11470.181802 | 9166.005235 | 14906.279740 | 21299.868863 | 23789.655363 | 6944.998579 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.000000 | 2403.000000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 39.000000 | 3608.000000 | 3154.000000 | 1030.000000 | 2453.000000 | 304.000000 | 0.050306 | 33000.000000 | 24000.000000 | 42000.000000 | 1675.000000 | 1591.000000 | 340.000000 |
50% | 87.000000 | 3608.000000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 130.000000 | 11797.000000 | 10048.000000 | 3299.000000 | 7413.000000 | 893.000000 | 0.067961 | 36000.000000 | 27000.000000 | 47000.000000 | 4390.000000 | 4595.000000 | 1231.000000 |
75% | 130.000000 | 5503.000000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 338.000000 | 31433.000000 | 25147.000000 | 9948.000000 | 16891.000000 | 2393.000000 | 0.087557 | 45000.000000 | 33000.000000 | 60000.000000 | 14444.000000 | 11783.000000 | 3466.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.000000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
# Number of rows
raw_data_count = recent_grads.shape[0]
print (raw_data_count)
173
# Drop rows containing missing values
recent_grads = recent_grads.dropna(axis=0)
cleaned_data_count = recent_grads.shape[0]
print (cleaned_data_count)
172
recent_grads.describe()
Rank | Major_code | Total | Men | Women | ShareWomen | Sample_size | Employed | Full_time | Part_time | Full_time_year_round | Unemployed | Unemployment_rate | Median | P25th | P75th | College_jobs | Non_college_jobs | Low_wage_jobs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.00000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 | 172.000000 |
mean | 87.377907 | 3895.953488 | 39370.081395 | 16723.406977 | 22646.674419 | 0.522223 | 357.941860 | 31355.80814 | 26165.767442 | 8877.232558 | 19798.843023 | 2428.412791 | 0.068024 | 40076.744186 | 29486.918605 | 51386.627907 | 12387.401163 | 13354.325581 | 3878.633721 |
std | 49.983181 | 1679.240095 | 63483.491009 | 28122.433474 | 41057.330740 | 0.231205 | 619.680419 | 50777.42865 | 42957.122320 | 14679.038729 | 33229.227514 | 4121.730452 | 0.030340 | 11461.388773 | 9190.769927 | 14882.278650 | 21344.967522 | 23841.326605 | 6960.467621 |
min | 1.000000 | 1100.000000 | 124.000000 | 119.000000 | 0.000000 | 0.000000 | 2.000000 | 0.00000 | 111.000000 | 0.000000 | 111.000000 | 0.000000 | 0.000000 | 22000.000000 | 18500.000000 | 22000.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 44.750000 | 2403.750000 | 4549.750000 | 2177.500000 | 1778.250000 | 0.336026 | 42.000000 | 3734.75000 | 3181.000000 | 1013.750000 | 2474.750000 | 299.500000 | 0.050261 | 33000.000000 | 24000.000000 | 41750.000000 | 1744.750000 | 1594.000000 | 336.750000 |
50% | 87.500000 | 3608.500000 | 15104.000000 | 5434.000000 | 8386.500000 | 0.534024 | 131.000000 | 12031.50000 | 10073.500000 | 3332.500000 | 7436.500000 | 905.000000 | 0.067544 | 36000.000000 | 27000.000000 | 47000.000000 | 4467.500000 | 4603.500000 | 1238.500000 |
75% | 130.250000 | 5503.250000 | 38909.750000 | 14631.000000 | 22553.750000 | 0.703299 | 339.000000 | 31701.25000 | 25447.250000 | 9981.000000 | 17674.750000 | 2397.000000 | 0.087247 | 45000.000000 | 33250.000000 | 58500.000000 | 14595.750000 | 11791.750000 | 3496.000000 |
max | 173.000000 | 6403.000000 | 393735.000000 | 173809.000000 | 307087.000000 | 0.968954 | 4212.000000 | 307933.00000 | 251540.000000 | 115172.000000 | 199897.000000 | 28169.000000 | 0.177226 | 110000.000000 | 95000.000000 | 125000.000000 | 151643.000000 | 148395.000000 | 48207.000000 |
Scatterplots generated using values from 2 columns can be useful for looking into possible correlations between them
ax = recent_grads.plot(x='Sample_size', y='Median', kind='scatter', xlim=0)
ax.set_title('Sample_size vs. Median')
<matplotlib.text.Text at 0x7f945e118da0>
recent_grads.plot(x='Sample_size',y='Unemployment_rate',kind='scatter',title='Sample_size vs. Unemployment_rate', xlim=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f946016e358>
recent_grads.plot(x='Full_time',y='Median',kind='scatter',title='Full_time vs. Median', xlim=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f9460170cf8>
recent_grads.plot(x='ShareWomen',y='Unemployment_rate',kind='scatter',title='ShareWomen vs. Unemployment_rate', xlim=0, ylim=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945e012278>
recent_grads.plot(x='Men',y='Median',kind='scatter',title='Men vs. Median', xlim=0, ylim=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945e012668>
recent_grads.plot(x='Women',y='Median',kind='scatter',title='Women vs. Median', xlim=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945df4d518>
There is no clear indication that Median earnings are connected with number Men
or Women
. Similarly, there is no significant connection between Median earnings and full time employment either. Nor is there a connection between Majors that are majority female and unemployment rate.
Let's explore more questions that scatter plots lend well to.
# Sort `Total` number of people with Major and get the corresponding `Median` earnings info
recent_grads_total_median = recent_grads[['Major', 'Total','Median']]
recent_grads_total_median
sorted_recent_grads_total_median = recent_grads_total_median.sort_values('Total',ascending=False)
sorted_recent_grads_total_median
Major | Total | Median | |
---|---|---|---|
145 | PSYCHOLOGY | 393735.0 | 31500 |
76 | BUSINESS MANAGEMENT AND ADMINISTRATION | 329927.0 | 38000 |
123 | BIOLOGY | 280709.0 | 33400 |
57 | GENERAL BUSINESS | 234590.0 | 40000 |
93 | COMMUNICATIONS | 213996.0 | 35000 |
34 | NURSING | 209394.0 | 48000 |
77 | MARKETING AND MARKETING RESEARCH | 205211.0 | 38000 |
40 | ACCOUNTING | 198633.0 | 45000 |
137 | ENGLISH LANGUAGE AND LITERATURE | 194673.0 | 32000 |
78 | POLITICAL SCIENCE AND GOVERNMENT | 182621.0 | 38000 |
35 | FINANCE | 174506.0 | 47000 |
138 | ELEMENTARY EDUCATION | 170862.0 | 32000 |
94 | CRIMINAL JUSTICE AND FIRE PROTECTION | 152824.0 | 35000 |
113 | GENERAL EDUCATION | 143718.0 | 34000 |
114 | HISTORY | 141951.0 | 34000 |
36 | ECONOMICS | 139247.0 | 47000 |
20 | COMPUTER SCIENCE | 128319.0 | 53000 |
139 | PHYSICAL FITNESS PARKS RECREATION AND LEISURE | 125074.0 | 32000 |
124 | SOCIOLOGY | 115433.0 | 33000 |
95 | COMMERCIAL ART AND GRAPHIC DESIGN | 103480.0 | 35000 |
8 | MECHANICAL ENGINEERING | 91227.0 | 60000 |
9 | ELECTRICAL ENGINEERING | 81527.0 | 60000 |
149 | FINE ARTS | 74440.0 | 30500 |
96 | JOURNALISM | 72619.0 | 35000 |
41 | MATHEMATICS | 72397.0 | 45000 |
140 | LIBERAL ARTS | 71369.0 | 32000 |
74 | CHEMISTRY | 66530.0 | 39000 |
97 | MULTI-DISCIPLINARY OR GENERAL SCIENCE | 62052.0 | 35000 |
17 | GENERAL ENGINEERING | 61152.0 | 56000 |
146 | MUSIC | 60633.0 | 31000 |
... | ... | ... | ... |
70 | INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY | 3014.0 | 40000 |
23 | MATERIALS ENGINEERING AND MATERIALS SCIENCE | 2993.0 | 52000 |
50 | ENGINEERING AND INDUSTRIAL MANAGEMENT | 2906.0 | 44000 |
169 | EDUCATIONAL PSYCHOLOGY | 2854.0 | 25000 |
170 | CLINICAL PSYCHOLOGY | 2838.0 | 25000 |
18 | ARCHITECTURAL ENGINEERING | 2825.0 | 54000 |
5 | NUCLEAR ENGINEERING | 2573.0 | 65000 |
71 | AGRICULTURAL ECONOMICS | 2439.0 | 40000 |
75 | ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLO... | 2435.0 | 38400 |
49 | OCEANOGRAPHY | 2418.0 | 44700 |
0 | PETROLEUM ENGINEERING | 2339.0 | 110000 |
39 | NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL ... | 2116.0 | 46000 |
90 | GEOSCIENCES | 1978.0 | 36000 |
7 | ASTRONOMY AND ASTROPHYSICS | 1792.0 | 62000 |
48 | PHARMACOLOGY | 1762.0 | 45000 |
161 | MISCELLANEOUS AGRICULTURE | 1488.0 | 29000 |
72 | PHYSICAL SCIENCES | 1436.0 | 40000 |
91 | SOCIAL PSYCHOLOGY | 1386.0 | 36000 |
83 | BOTANY | 1329.0 | 37000 |
3 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 1258.0 | 70000 |
19 | COURT REPORTING | 1148.0 | 54000 |
172 | LIBRARY SCIENCE | 1098.0 | 22000 |
2 | METALLURGICAL ENGINEERING | 856.0 | 73000 |
55 | SCHOOL STUDENT COUNSELING | 818.0 | 41000 |
120 | EDUCATIONAL ADMINISTRATION AND SUPERVISION | 804.0 | 34000 |
1 | MINING AND MINERAL ENGINEERING | 756.0 | 75000 |
33 | GEOLOGICAL AND GEOPHYSICAL ENGINEERING | 720.0 | 50000 |
112 | SOIL SCIENCE | 685.0 | 35000 |
52 | MATHEMATICS AND COMPUTER SCIENCE | 609.0 | 42000 |
73 | MILITARY TECHNOLOGIES | 124.0 | 40000 |
172 rows × 3 columns
sorted_recent_grads_total_median.Median.describe()
count 172.000000 mean 40076.744186 std 11461.388773 min 22000.000000 25% 33000.000000 50% 36000.000000 75% 45000.000000 max 110000.000000 Name: Median, dtype: float64
# Display the entire rows with the `minimum` and `maximum` values of `Total`
print(sorted_recent_grads_total_median.loc[sorted_recent_grads_total_median.Total.idxmax()])
print('\n')
print(sorted_recent_grads_total_median.loc[sorted_recent_grads_total_median.Total.idxmin()])
Major PSYCHOLOGY Total 393735 Median 31500 Name: 145, dtype: object Major MILITARY TECHNOLOGIES Total 124 Median 40000 Name: 73, dtype: object
sorted_recent_grads_total_median.plot(x='Total',y='Median',kind='scatter',title='Popular Majors(Total) vs. Median', xlim=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945df306a0>
The scatter plot is not indicative of earnings affected by rising Total.
This observation is backed by the data which shows PSYCHOLOGY
major being the most popular, has median earnings of 31500 while the least popuar major MILITARY TECHNOLOGIES
has higher median earnings of 40000.
# Sort `Women as share of Total` and get the corresponding `Median` earnings info
recent_grads_ShareWomen_Median = recent_grads[['Major', 'ShareWomen','Median']]
sorted_recent_grads_ShareWomen_Median = recent_grads_ShareWomen_Median.sort_values('ShareWomen',ascending=False)
sorted_recent_grads_ShareWomen_Median
Major | ShareWomen | Median | |
---|---|---|---|
164 | EARLY CHILDHOOD EDUCATION | 0.968954 | 28000 |
163 | COMMUNICATION DISORDERS SCIENCES AND SERVICES | 0.967998 | 28000 |
51 | MEDICAL ASSISTING SERVICES | 0.927807 | 42000 |
138 | ELEMENTARY EDUCATION | 0.923745 | 32000 |
150 | FAMILY AND CONSUMER SCIENCES | 0.910933 | 30000 |
100 | SPECIAL NEEDS EDUCATION | 0.906677 | 35000 |
156 | HUMAN SERVICES AND COMMUNITY ORGANIZATION | 0.905590 | 30000 |
151 | SOCIAL WORK | 0.904075 | 30000 |
34 | NURSING | 0.896019 | 48000 |
88 | MISCELLANEOUS HEALTH MEDICAL PROFESSIONS | 0.881294 | 36000 |
172 | LIBRARY SCIENCE | 0.877960 | 22000 |
128 | LANGUAGE AND DRAMA EDUCATION | 0.877228 | 33000 |
103 | NUTRITION SCIENCES | 0.864456 | 35000 |
55 | SCHOOL STUDENT COUNSELING | 0.854523 | 41000 |
148 | ART HISTORY AND CRITICISM | 0.845934 | 31000 |
169 | EDUCATIONAL PSYCHOLOGY | 0.817099 | 25000 |
113 | GENERAL EDUCATION | 0.812877 | 34000 |
154 | TEACHER EDUCATION: MULTIPLE LEVELS | 0.810704 | 30000 |
170 | CLINICAL PSYCHOLOGY | 0.799859 | 25000 |
155 | MISCELLANEOUS PSYCHOLOGY | 0.798920 | 30000 |
171 | COUNSELING PSYCHOLOGY | 0.798746 | 23400 |
118 | COMMUNITY AND PUBLIC HEALTH | 0.792095 | 34000 |
145 | PSYCHOLOGY | 0.779933 | 31500 |
134 | GENERAL MEDICAL AND HEALTH SERVICES | 0.774577 | 32400 |
109 | MULTI/INTERDISCIPLINARY STUDIES | 0.770901 | 35000 |
104 | HEALTH AND MEDICAL ADMINISTRATIVE SERVICES | 0.764427 | 35000 |
131 | INTERDISCIPLINARY SOCIAL SCIENCES | 0.764320 | 33000 |
98 | ADVERTISING AND PUBLIC RELATIONS | 0.758060 | 35000 |
44 | MEDICAL TECHNOLOGIES TECHNICIANS | 0.753927 | 45000 |
152 | ANIMAL SCIENCES | 0.752144 | 30000 |
... | ... | ... | ... |
53 | COMPUTER PROGRAMMING AND DATA PROCESSING | 0.269194 | 41300 |
42 | COMPUTER AND INFORMATION SYSTEMS | 0.253583 | 45000 |
17 | GENERAL ENGINEERING | 0.252960 | 56000 |
31 | ENGINEERING TECHNOLOGIES | 0.251389 | 50000 |
38 | INDUSTRIAL PRODUCTION TECHNOLOGIES | 0.249190 | 46000 |
45 | INFORMATION SCIENCES | 0.244103 | 45000 |
19 | COURT REPORTING | 0.236063 | 54000 |
75 | ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLO... | 0.232444 | 38400 |
25 | CIVIL ENGINEERING | 0.227118 | 50000 |
20 | COMPUTER SCIENCE | 0.222695 | 53000 |
65 | MISCELLANEOUS ENGINEERING TECHNOLOGIES | 0.200023 | 40000 |
10 | COMPUTER ENGINEERING | 0.199413 | 60000 |
9 | ELECTRICAL ENGINEERING | 0.196450 | 60000 |
28 | MISCELLANEOUS ENGINEERING | 0.189970 | 50000 |
14 | ENGINEERING MECHANICS PHYSICS AND SCIENCE | 0.183985 | 58000 |
81 | COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY | 0.180883 | 37500 |
52 | MATHEMATICS AND COMPUTER SCIENCE | 0.178982 | 42000 |
50 | ENGINEERING AND INDUSTRIAL MANAGEMENT | 0.174123 | 44000 |
2 | METALLURGICAL ENGINEERING | 0.153037 | 73000 |
5 | NUCLEAR ENGINEERING | 0.144967 | 65000 |
11 | AEROSPACE ENGINEERING | 0.139793 | 60000 |
111 | FORESTRY | 0.125035 | 35000 |
106 | TRANSPORTATION SCIENCES AND TECHNOLOGIES | 0.124950 | 35000 |
0 | PETROLEUM ENGINEERING | 0.120564 | 110000 |
8 | MECHANICAL ENGINEERING | 0.119559 | 60000 |
3 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | 0.107313 | 70000 |
1 | MINING AND MINERAL ENGINEERING | 0.101852 | 75000 |
26 | CONSTRUCTION SERVICES | 0.090713 | 50000 |
66 | MECHANICAL ENGINEERING RELATED TECHNOLOGIES | 0.077453 | 40000 |
73 | MILITARY TECHNOLOGIES | 0.000000 | 40000 |
172 rows × 3 columns
# Find rows with the `minimum` and `maximum` values of `ShareWomen`
print(sorted_recent_grads_ShareWomen_Median.loc[sorted_recent_grads_ShareWomen_Median.ShareWomen.idxmax()])
print('\n')
print(sorted_recent_grads_ShareWomen_Median.loc[sorted_recent_grads_ShareWomen_Median.ShareWomen.idxmin()])
Major EARLY CHILDHOOD EDUCATION ShareWomen 0.968954 Median 28000 Name: 164, dtype: object Major MILITARY TECHNOLOGIES ShareWomen 0 Median 40000 Name: 73, dtype: object
sorted_recent_grads_ShareWomen_Median.plot(x='ShareWomen',y='Median',kind='scatter',title='Majority Female(ShareWomen) vs. Median', xlim=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945df48cf8>
From visualizing Median earnings of majors with Female Majority, the answer is no, students that major in female majority subjects don't make more money. The trend is clearly downward indicating that the more female majority, the less earnings.
Data also shows that those who majored in EARLY CHILDHOOD EDUCATION
with 96% female share made less than students who majored in MILITARY TECHNOLOGIES
with zero females.
# Sort `Full_time` employed and get the corresponding `Median` earnings info
recent_grads_FT_median = recent_grads[['Full_time','Median']]
sorted_recent_grads_FT_median = recent_grads_FT_median.sort_values('Full_time',ascending=False)
sorted_recent_grads_FT_median
Full_time | Median | |
---|---|---|
76 | 251540 | 38000 |
145 | 233205 | 31500 |
57 | 171385 | 40000 |
77 | 156668 | 38000 |
40 | 151967 | 45000 |
34 | 151191 | 48000 |
93 | 147335 | 35000 |
123 | 144512 | 33400 |
35 | 137921 | 47000 |
138 | 123177 | 32000 |
78 | 117709 | 38000 |
137 | 114386 | 32000 |
94 | 109970 | 35000 |
113 | 98408 | 34000 |
36 | 96567 | 47000 |
20 | 91485 | 53000 |
114 | 84681 | 34000 |
139 | 77428 | 32000 |
124 | 73475 | 33000 |
8 | 71298 | 60000 |
95 | 67448 | 35000 |
9 | 55450 | 60000 |
96 | 51411 | 35000 |
41 | 46399 | 45000 |
140 | 43401 | 32000 |
149 | 42764 | 30500 |
17 | 41235 | 56000 |
74 | 39509 | 39000 |
98 | 38815 | 35000 |
25 | 38302 | 50000 |
... | ... | ... |
32 | 2049 | 50000 |
5 | 2038 | 65000 |
50 | 1992 | 44000 |
49 | 1931 | 44700 |
0 | 1849 | 110000 |
169 | 1848 | 25000 |
71 | 1819 | 40000 |
67 | 1787 | 40000 |
170 | 1724 | 25000 |
23 | 1658 | 52000 |
70 | 1644 | 40000 |
39 | 1392 | 46000 |
90 | 1264 | 36000 |
161 | 1098 | 29000 |
7 | 1085 | 62000 |
3 | 1069 | 70000 |
83 | 946 | 37000 |
91 | 828 | 36000 |
19 | 808 | 54000 |
72 | 768 | 40000 |
120 | 733 | 34000 |
48 | 657 | 45000 |
55 | 595 | 41000 |
172 | 593 | 22000 |
52 | 584 | 42000 |
2 | 558 | 73000 |
1 | 556 | 75000 |
33 | 524 | 50000 |
112 | 488 | 35000 |
73 | 111 | 40000 |
172 rows × 2 columns
# Find rows with the `minimum` and `maximum` values of `Full_time` and their `Median` earnings
print(sorted_recent_grads_FT_median.loc[sorted_recent_grads_FT_median.Full_time.idxmax()])
print('\n')
print(sorted_recent_grads_FT_median.loc[sorted_recent_grads_FT_median.Full_time.idxmin()])
Full_time 251540 Median 38000 Name: 76, dtype: int64 Full_time 111 Median 40000 Name: 73, dtype: int64
sorted_recent_grads_FT_median.plot(x='Full_time',y='Median', kind='scatter',title='Number of FT Employees vs. Median',xlim=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945de61550>
Most FT earnings are in 20000-80000 range and that's for the lower end of the number employed full-time. As FT number increases, median salaries do not move up as observed by the data analysis as well.
Using Histograms will be useful here as we explore the distribution of values in a column. Histograms generate bins
and their frequencies and allow us to visually estimate the percentage of values that fall into a range of bins.
recent_grads['Sample_size'].plot.hist(bins=18)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945dddd048>
# Let's create same number of bins, with the column's min and max values in mind.
recent_grads.Sample_size_bins = pd.cut(x=recent_grads['Sample_size'], bins=[0,250,500,750,1000,1250,1500,1750,2000,2250,2500,2750,3000,3250,3500,3750,4000,4250,4500])
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access from ipykernel import kernelapp as app
# This gets us the count of values in each bin
recent_grads.Sample_size_bins.value_counts()
(0, 250] 118 (250, 500] 23 (500, 750] 9 (1000, 1250] 6 (1250, 1500] 4 (2500, 2750] 3 (1500, 1750] 2 (750, 1000] 2 (2000, 2250] 2 (2250, 2500] 2 (4000, 4250] 1 (1750, 2000] 0 (2750, 3000] 0 (3000, 3250] 0 (3250, 3500] 0 (3500, 3750] 0 (3750, 4000] 0 (4250, 4500] 0 Name: Sample_size, dtype: int64
I split the Sample_size
column into 18 bins and created a new column Sample_size_bins
(I based the number of bins on the min and max values we found from summary statistics of the numeric columns).
By doing this, values from Sample_size
column got assigned to the appropriate bins. The unique value counts from the newly created column gets us the count of values in each bin - the same thing that the Histogram is doing. This is helpful in backing up our observations from the Histogram that's also using the same number of bins.
As clearly seen in the Histogram plot, the Sample sizes are concentrated in the left most bins. We can say that majority of the Sample sizes are 0-250.
Let's do the same for other columns.
recent_grads.Median.plot.hist(bins=12)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945d01ee10>
recent_grads.Median_bins = pd.cut(x=recent_grads.Median, bins=[20000,28000,36000,44000,52000,60000,68000,76000,84000,92000,100000,108000,116000])
The upper limit of earnings for recent grads did not exceed 80K. 99K-119K is an outlier. 58% of the Majors make lower 20K-40K. 34% have earnings between 40K-60K. Only 8% earnings are in the 58K-79K bin.
recent_grads.Median_bins.value_counts()
(28000, 36000] 76 (36000, 44000] 36 (44000, 52000] 28 (52000, 60000] 13 (20000, 28000] 11 (60000, 68000] 4 (68000, 76000] 3 (108000, 116000] 1 (100000, 108000] 0 (92000, 100000] 0 (84000, 92000] 0 (76000, 84000] 0 Name: Median, dtype: int64
recent_grads.Employed.plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f945cf27630>
recent_grads.Employed_bins = pd.cut(x=recent_grads.Employed, bins=[0,30000,60000,90000,120000,150000,180000,210000,240000,270000,300000,330000])
recent_grads.Employed_bins.value_counts()
(0, 30000] 126 (30000, 60000] 22 (90000, 120000] 6 (120000, 150000] 5 (60000, 90000] 4 (180000, 210000] 3 (150000, 180000] 3 (300000, 330000] 1 (270000, 300000] 1 (240000, 270000] 0 (210000, 240000] 0 Name: Employed, dtype: int64
Employed number ranges between 0 and nearly 300K. Over 70% of values fall within the 0-30K bin.
recent_grads.Full_time.plot.hist(bins=21)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945dc1b390>
recent_grads.Full_time_bins = pd.cut(x=recent_grads.Full_time, bins=[100,12100,24100,36100,48100,60100,72100,84100,96100,108100,120100,132100,144100,156100,168100,180100,192100,204100,216100,228100,240100,252100])
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access if __name__ == '__main__':
recent_grads.Full_time_bins.value_counts()
(100, 12100] 99 (12100, 24100] 29 (24100, 36100] 12 (36100, 48100] 9 (144100, 156100] 4 (108100, 120100] 3 (48100, 60100] 2 (60100, 72100] 2 (72100, 84100] 2 (84100, 96100] 2 (96100, 108100] 2 (240100, 252100] 1 (228100, 240100] 1 (132100, 144100] 1 (156100, 168100] 1 (168100, 180100] 1 (120100, 132100] 1 (180100, 192100] 0 (192100, 204100] 0 (204100, 216100] 0 (216100, 228100] 0 Name: Full_time, dtype: int64
Around 86% of the Majors fall within 50K FT Employment
recent_grads.Men.plot.hist(bins=5)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945ce540f0>
recent_grads.Men_bins = pd.cut(x=recent_grads.Men,bins=[0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,110000,120000,130000,140000,150000,160000,170000,180000,190000])
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access if __name__ == '__main__':
recent_grads.Men_bins.value_counts()
(0, 10000] 111 (10000, 20000] 23 (20000, 30000] 13 (30000, 40000] 6 (80000, 90000] 4 (70000, 80000] 3 (90000, 100000] 3 (110000, 120000] 2 (60000, 70000] 2 (40000, 50000] 2 (170000, 180000] 1 (130000, 140000] 1 (50000, 60000] 1 (100000, 110000] 0 (120000, 130000] 0 (140000, 150000] 0 (150000, 160000] 0 (160000, 170000] 0 (180000, 190000] 0 Name: Men, dtype: int64
recent_grads.Women.plot.hist(bins=5)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945cec2898>
Most majors have under 40K men, and nearly 60K women. This means, there are more women graduates than men.
print ('Women:', recent_grads.Women.sum())
print ('Men:', recent_grads.Men.sum())
Women: 3895228.0 Men: 2876426.0
recent_grads.ShareWomen.plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f945cd529b0>
There is a small number (2-3)of Majors with less than 10% women. Almost 56% of the Majors are predominantly (over 50%) women. About 1.3% of the Majors are upwards of 80% women.
recent_grads.Unemployment_rate.plot.hist(bins=9)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945cd41cf8>
Around 70% of Majors have unemployment rate between 4-10%.
Around 16% of Majors have unemployment rate below 4%
Around 10% of Majors have unemployment rate between 10-12%
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f945d958470>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f945d892908>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f945d862400>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f945d81d6d8>]], dtype=object)
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(10,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f945d8a18d0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f945d4c06a0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f945d6c3198>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f945d67d4e0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f945d64c198>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f945d6070f0>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f945d5d2208>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f945d588ef0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f945d55b160>]], dtype=object)
Visualizing with bar plots will be convenient for categorical data and 10 bars isn't too many for such a plot here. The bars represent values with lengths proportional to the values and can help trace the category corresponding to the smallest or largest values.
recent_grads[:10].plot.barh(x='Major',y='ShareWomen', title='Percentages of Women in top 10 Majors',legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945d754b38>
recent_grads[-10:].plot.barh(x='Major',y='ShareWomen', title='Percentages of Women in bottom 10 Majors',legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945d34ec88>
Almost all top ranked Majors are predominantly men. The only top ranked Major in which Women are a little over 50% is ASTRONOMY AND ASTROPHYSICS
.
All of the lowest ranked Majors have atleast 60% women. 3 Majors have 80-100% women
-COMMUNICATION DISORDERS SCIENCES AND SERVICES
-EARLY CHILDHOOD EDUATION
-LIBRARY SCIENCE
recent_grads[:10].plot.barh(x='Major',y='Unemployment_rate', title='Unemployment rate in top 10 Majors',legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945d348080>
recent_grads[-10:].plot.barh(x='Major',y='Unemployment_rate', title='Unemployment rate in bottom 10 Majors',legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945d2569b0>
Among the top ranked Majors, these have higher unemployment rates, ranging between 10-18%
NUCLEAR ENGINEERING
ACTUARIAL SCIENCE
MINING AND MINERAL ENGINEERING
Interesting observation is that NUCLEAR ENGINEERING
has more than 80% men and has the most unemployment across top and bottom ranked majors.
For the lowest 10 ranked Majors, these have unemployment rates over 10%
CLINICAL PSYCHOLOGY
OTHER FOREIGN LANGUAGES
LIBRARY SCIENCE
1. Compare the number of Men
with the number of Women
in each category of Majors and visualize the results with a grouped bar plot
# Aggregate to sum up men per Major Category
# Create a dictionary for count of men in each Major Category
major_men = {}
major_cats = recent_grads.Major_category.unique()
major_cats
array(['Engineering', 'Business', 'Physical Sciences', 'Law & Public Policy', 'Computers & Mathematics', 'Industrial Arts & Consumer Services', 'Arts', 'Health', 'Social Science', 'Biology & Life Science', 'Education', 'Agriculture & Natural Resources', 'Humanities & Liberal Arts', 'Psychology & Social Work', 'Communications & Journalism', 'Interdisciplinary'], dtype=object)
# Select rows belonging to specific Major category
for m in major_cats:
mcat_rows = recent_grads[recent_grads['Major_category'] == m]
# Calculate the sum of Men
total_men = mcat_rows['Men'].sum()
# Put it in the dictionary using Major category as key
major_men[m] = total_men
major_men
{'Agriculture & Natural Resources': 40357.0, 'Arts': 134390.0, 'Biology & Life Science': 184919.0, 'Business': 667852.0, 'Communications & Journalism': 131921.0, 'Computers & Mathematics': 208725.0, 'Education': 103526.0, 'Engineering': 408307.0, 'Health': 75517.0, 'Humanities & Liberal Arts': 272846.0, 'Industrial Arts & Consumer Services': 103781.0, 'Interdisciplinary': 2817.0, 'Law & Public Policy': 91129.0, 'Physical Sciences': 95390.0, 'Psychology & Social Work': 98115.0, 'Social Science': 256834.0}
# Convert `major_men` dictionary to a series object; don't sort values
m_series = pd.Series(major_men)
# Create a dataframe from the series `m_series`
m_series_df = pd.DataFrame(m_series,columns = ['total_men'])
m_series_df
total_men | |
---|---|
Agriculture & Natural Resources | 40357.0 |
Arts | 134390.0 |
Biology & Life Science | 184919.0 |
Business | 667852.0 |
Communications & Journalism | 131921.0 |
Computers & Mathematics | 208725.0 |
Education | 103526.0 |
Engineering | 408307.0 |
Health | 75517.0 |
Humanities & Liberal Arts | 272846.0 |
Industrial Arts & Consumer Services | 103781.0 |
Interdisciplinary | 2817.0 |
Law & Public Policy | 91129.0 |
Physical Sciences | 95390.0 |
Psychology & Social Work | 98115.0 |
Social Science | 256834.0 |
# Aggregate to sum up women per Major Category
# Create a dictionary for count of women in each Major Category
major_women = {}
# Select rows belonging to specific Major category
for m in major_cats:
mcat_rows = recent_grads[recent_grads['Major_category'] == m]
# Calculate the sum of Women
total_women = mcat_rows['Women'].sum()
# Put it in the dictionary using Major category as key
major_women[m] = total_women
major_women
{'Agriculture & Natural Resources': 35263.0, 'Arts': 222740.0, 'Biology & Life Science': 268943.0, 'Business': 634524.0, 'Communications & Journalism': 260680.0, 'Computers & Mathematics': 90283.0, 'Education': 455603.0, 'Engineering': 129276.0, 'Health': 387713.0, 'Humanities & Liberal Arts': 440622.0, 'Industrial Arts & Consumer Services': 126011.0, 'Interdisciplinary': 9479.0, 'Law & Public Policy': 87978.0, 'Physical Sciences': 90089.0, 'Psychology & Social Work': 382892.0, 'Social Science': 273132.0}
# Create a series object from major_women dictionary
# Then, add it as a new column named `total_women` to the m_series_df DataFrame
w_series = pd.Series(major_women)
w_series
Agriculture & Natural Resources 35263.0 Arts 222740.0 Biology & Life Science 268943.0 Business 634524.0 Communications & Journalism 260680.0 Computers & Mathematics 90283.0 Education 455603.0 Engineering 129276.0 Health 387713.0 Humanities & Liberal Arts 440622.0 Industrial Arts & Consumer Services 126011.0 Interdisciplinary 9479.0 Law & Public Policy 87978.0 Physical Sciences 90089.0 Psychology & Social Work 382892.0 Social Science 273132.0 dtype: float64
# Add the series as a new column to DataFrame `m_series_df`
m_series_df['total_women'] = w_series
m_series_df
total_men | total_women | |
---|---|---|
Agriculture & Natural Resources | 40357.0 | 35263.0 |
Arts | 134390.0 | 222740.0 |
Biology & Life Science | 184919.0 | 268943.0 |
Business | 667852.0 | 634524.0 |
Communications & Journalism | 131921.0 | 260680.0 |
Computers & Mathematics | 208725.0 | 90283.0 |
Education | 103526.0 | 455603.0 |
Engineering | 408307.0 | 129276.0 |
Health | 75517.0 | 387713.0 |
Humanities & Liberal Arts | 272846.0 | 440622.0 |
Industrial Arts & Consumer Services | 103781.0 | 126011.0 |
Interdisciplinary | 2817.0 | 9479.0 |
Law & Public Policy | 91129.0 | 87978.0 |
Physical Sciences | 95390.0 | 90089.0 |
Psychology & Social Work | 98115.0 | 382892.0 |
Social Science | 256834.0 | 273132.0 |
# Plot the dataframe
m_series_df.plot.barh(figsize=(8,8),title='Men vs Women in Major categories')
<matplotlib.axes._subplots.AxesSubplot at 0x7f945cd445c0>
Number of Women tip the scale in 10 out of 16 Major categories. These are some of Major categories with female majority:
Business
being by far the most popular category for both men and women, there is much less gender gap in these categories:
Men dominate these Major categories:
Men also exceed Women (but by not as much as in the 2 categories noted above) in
2. Explore the distributions of median salaries and unemployment rate and visualize the results with a box plot
recent_grads.Unemployment_rate.describe()
count 172.000000 mean 0.068024 std 0.030340 min 0.000000 25% 0.050261 50% 0.067544 75% 0.087247 max 0.177226 Name: Unemployment_rate, dtype: float64
recent_grads.boxplot(column='Unemployment_rate', vert=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945d5e0828>
recent_grads['Median'].describe()
count 172.000000 mean 40076.744186 std 11461.388773 min 22000.000000 25% 33000.000000 50% 36000.000000 75% 45000.000000 max 110000.000000 Name: Median, dtype: float64
recent_grads.boxplot(column='Median', vert=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945d1331d0>
Median
column confirms the right skewness: mean value (40K) is more than the median (35K)Some of the scatter plots had dense points and the same data can be looked at using a hexbin plot. This will help figure if there were overlapping points, putting into bins of hexagons and coloring the bins based on their count could be more informative
recent_grads.plot.hexbin(x='ShareWomen', y='Unemployment_rate', title='Hexagonal bin plot for ShareWomen & Unemployment_rate',gridsize=15)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945bdff9e8>
recent_grads.plot.hexbin(x='Full_time',y='Median',title='Hexagonal bin plot for Full_time & Median',gridsize=15)
<matplotlib.axes._subplots.AxesSubplot at 0x7f945bdb2208>
Business
Social Science
Physical Sciences
Laws & Public Policy
Agriculture & Natural Resources
Most salaries are in lower 20K-40K
Popularity of a Major does not translate to higher salaries
There is no connection between predominantly female Majors and unemployment rate
ASTRONOMY AND ASTROPHYSICS stands out as 1 of the top ranked Majors that's predominantly female & with unemployment rate under 3%
NUCLEAR ENGINEERING stands out as 1 of the top 10 ranked Majors that's predominantly male(has more than 80% men) & with the highest unemployment rate when compared across top and bottom 10 ranked Majors
These Majors are not only among lowest 10 ranked Majors, but also have unemployment rates over 10%
CLINICAL PSYCHOLOGY
OTHER FOREIGN LANGUAGES
LIBRARY SCIENCE
All top 10 ranked Majors are predominantly male
All bottom 10 ranked Majors are predominanly female
Most majors have under 40K men, and nearly 60K women so, there are more women grads than men