We will begin by introducing the Series
, DataFrame
, and Index
classes, which are the basic building blocks of the pandas library, and showing how to work with them. By the end of this section, you will be able to create DataFrames and perform operations on them to inspect and filter the data.
A DataFrame is composed of one or more Series. The names of the Series form the column names, and the row labels form the Index.
import pandas as pd
meteorites = pd.read_csv('../data/Meteorite_Landings.csv', nrows=5)
meteorites
name | id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Aachen | 1 | Valid | L5 | 21 | Fell | 01/01/1880 12:00:00 AM | 50.77500 | 6.08333 | (50.775, 6.08333) |
1 | Aarhus | 2 | Valid | H6 | 720 | Fell | 01/01/1951 12:00:00 AM | 56.18333 | 10.23333 | (56.18333, 10.23333) |
2 | Abee | 6 | Valid | EH4 | 107000 | Fell | 01/01/1952 12:00:00 AM | 54.21667 | -113.00000 | (54.21667, -113.0) |
3 | Acapulco | 10 | Valid | Acapulcoite | 1914 | Fell | 01/01/1976 12:00:00 AM | 16.88333 | -99.90000 | (16.88333, -99.9) |
4 | Achiras | 370 | Valid | L6 | 780 | Fell | 01/01/1902 12:00:00 AM | -33.16667 | -64.95000 | (-33.16667, -64.95) |
Source: NASA's Open Data Portal
meteorites.name
0 Aachen 1 Aarhus 2 Abee 3 Acapulco 4 Achiras Name: name, dtype: object
meteorites.columns
Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year', 'reclat', 'reclong', 'GeoLocation'], dtype='object')
meteorites.index
RangeIndex(start=0, stop=5, step=1)
import pandas as pd
meteorites = pd.read_csv('../data/Meteorite_Landings.csv')
Tip: There are many parameters to this function to handle some initial processing while reading in the file – be sure check out the documentation.
Collect the data from NASA's Open Data Portal using the Socrata Open Data API (SODA) with the requests
library:
import requests
response = requests.get(
'https://data.nasa.gov/docs/legacy/meteorite_landings/gh4g-9sfh.json',
params={'$limit': 50_000}
)
if response.ok:
payload = response.json()
else:
print(f'Request was not successful and returned code: {response.status_code}.')
payload = None
Create the DataFrame with the resulting payload:
import pandas as pd
df = pd.DataFrame(payload)
df.head(3)
name | id | nametype | recclass | mass | fall | year | reclat | reclong | geolocation | :@computed_region_cbhk_fwbd | :@computed_region_nnqa_25f4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aachen | 1 | Valid | L5 | 21 | Fell | 1880-01-01T00:00:00.000 | 50.775000 | 6.083330 | {'latitude': '50.775', 'longitude': '6.08333'} | NaN | NaN |
1 | Aarhus | 2 | Valid | H6 | 720 | Fell | 1951-01-01T00:00:00.000 | 56.183330 | 10.233330 | {'latitude': '56.18333', 'longitude': '10.23333'} | NaN | NaN |
2 | Abee | 6 | Valid | EH4 | 107000 | Fell | 1952-01-01T00:00:00.000 | 54.216670 | -113.000000 | {'latitude': '54.21667', 'longitude': '-113.0'} | NaN | NaN |
Tip: df.to_csv('data.csv')
writes this data to a new file called data.csv
.
Now that we have some data, we need to perform an initial inspection of it. This gives us information on what the data looks like, how many rows/columns there are, and how much data we have.
Let's inspect the meteorites
data.
meteorites.shape
(45716, 10)
meteorites.columns
Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year', 'reclat', 'reclong', 'GeoLocation'], dtype='object')
meteorites.dtypes
name object id int64 nametype object recclass object mass (g) float64 fall object year object reclat float64 reclong float64 GeoLocation object dtype: object
meteorites.head()
name | id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Aachen | 1 | Valid | L5 | 21.0 | Fell | 01/01/1880 12:00:00 AM | 50.77500 | 6.08333 | (50.775, 6.08333) |
1 | Aarhus | 2 | Valid | H6 | 720.0 | Fell | 01/01/1951 12:00:00 AM | 56.18333 | 10.23333 | (56.18333, 10.23333) |
2 | Abee | 6 | Valid | EH4 | 107000.0 | Fell | 01/01/1952 12:00:00 AM | 54.21667 | -113.00000 | (54.21667, -113.0) |
3 | Acapulco | 10 | Valid | Acapulcoite | 1914.0 | Fell | 01/01/1976 12:00:00 AM | 16.88333 | -99.90000 | (16.88333, -99.9) |
4 | Achiras | 370 | Valid | L6 | 780.0 | Fell | 01/01/1902 12:00:00 AM | -33.16667 | -64.95000 | (-33.16667, -64.95) |
Sometimes there may be extraneous data at the end of the file, so checking the bottom few rows is also important:
meteorites.tail()
name | id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | |
---|---|---|---|---|---|---|---|---|---|---|
45711 | Zillah 002 | 31356 | Valid | Eucrite | 172.0 | Found | 01/01/1990 12:00:00 AM | 29.03700 | 17.01850 | (29.037, 17.0185) |
45712 | Zinder | 30409 | Valid | Pallasite, ungrouped | 46.0 | Found | 01/01/1999 12:00:00 AM | 13.78333 | 8.96667 | (13.78333, 8.96667) |
45713 | Zlin | 30410 | Valid | H4 | 3.3 | Found | 01/01/1939 12:00:00 AM | 49.25000 | 17.66667 | (49.25, 17.66667) |
45714 | Zubkovsky | 31357 | Valid | L6 | 2167.0 | Found | 01/01/2003 12:00:00 AM | 49.78917 | 41.50460 | (49.78917, 41.5046) |
45715 | Zulu Queen | 30414 | Valid | L3.7 | 200.0 | Found | 01/01/1976 12:00:00 AM | 33.98333 | -115.68333 | (33.98333, -115.68333) |
meteorites.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45716 entries, 0 to 45715 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 45716 non-null object 1 id 45716 non-null int64 2 nametype 45716 non-null object 3 recclass 45716 non-null object 4 mass (g) 45585 non-null float64 5 fall 45716 non-null object 6 year 45425 non-null object 7 reclat 38401 non-null float64 8 reclong 38401 non-null float64 9 GeoLocation 38401 non-null object dtypes: float64(3), int64(1), object(6) memory usage: 3.5+ MB
2019_Yellow_Taxi_Trip_Data.csv
file. Examine the first 5 rows.¶# Complete this exercise in the workbook.ipynb file
# Click on `Exercise 1.1` above to open the workbook.ipynb file
# WARNING: if you complete the exercise here, your cell numbers
# for the rest of the training might not match the slides
# Complete this exercise in the workbook.ipynb file
# Click on `Exercise 1.2` above to open the workbook.ipynb file
# WARNING: if you complete the exercise here, your cell numbers
# for the rest of the training might not match the slides
A crucial part of working with DataFrames is extracting subsets of the data: finding rows that meet a certain set of criteria, isolating columns/rows of interest, etc. After narrowing down our data, we are closer to discovering insights. This section will be the backbone of many analysis tasks.
We can select columns as attributes if their names would be valid Python variables:
meteorites.name
0 Aachen 1 Aarhus 2 Abee 3 Acapulco 4 Achiras ... 45711 Zillah 002 45712 Zinder 45713 Zlin 45714 Zubkovsky 45715 Zulu Queen Name: name, Length: 45716, dtype: object
If they aren't, we have to select them as keys. However, we can select multiple columns at once this way:
meteorites[['name', 'mass (g)']]
name | mass (g) | |
---|---|---|
0 | Aachen | 21.0 |
1 | Aarhus | 720.0 |
2 | Abee | 107000.0 |
3 | Acapulco | 1914.0 |
4 | Achiras | 780.0 |
... | ... | ... |
45711 | Zillah 002 | 172.0 |
45712 | Zinder | 46.0 |
45713 | Zlin | 3.3 |
45714 | Zubkovsky | 2167.0 |
45715 | Zulu Queen | 200.0 |
45716 rows × 2 columns
meteorites[100:104]
name | id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | |
---|---|---|---|---|---|---|---|---|---|---|
100 | Benton | 5026 | Valid | LL6 | 2840.0 | Fell | 01/01/1949 12:00:00 AM | 45.95000 | -67.55000 | (45.95, -67.55) |
101 | Berduc | 48975 | Valid | L6 | 270.0 | Fell | 01/01/2008 12:00:00 AM | -31.91000 | -58.32833 | (-31.91, -58.32833) |
102 | Béréba | 5028 | Valid | Eucrite-mmict | 18000.0 | Fell | 01/01/1924 12:00:00 AM | 11.65000 | -3.65000 | (11.65, -3.65) |
103 | Berlanguillas | 5029 | Valid | L6 | 1440.0 | Fell | 01/01/1811 12:00:00 AM | 41.68333 | -3.80000 | (41.68333, -3.8) |
We use iloc[]
to select rows and columns by their position:
meteorites.iloc[100:104, [0, 3, 4, 6]]
name | recclass | mass (g) | year | |
---|---|---|---|---|
100 | Benton | LL6 | 2840.0 | 01/01/1949 12:00:00 AM |
101 | Berduc | L6 | 270.0 | 01/01/2008 12:00:00 AM |
102 | Béréba | Eucrite-mmict | 18000.0 | 01/01/1924 12:00:00 AM |
103 | Berlanguillas | L6 | 1440.0 | 01/01/1811 12:00:00 AM |
We use loc[]
to select by name:
meteorites.loc[100:104, 'mass (g)':'year']
mass (g) | fall | year | |
---|---|---|---|
100 | 2840.0 | Fell | 01/01/1949 12:00:00 AM |
101 | 270.0 | Fell | 01/01/2008 12:00:00 AM |
102 | 18000.0 | Fell | 01/01/1924 12:00:00 AM |
103 | 1440.0 | Fell | 01/01/1811 12:00:00 AM |
104 | 960.0 | Fell | 01/01/2004 12:00:00 AM |
A Boolean mask is a array-like structure of Boolean values – it's a way to specify which rows/columns we want to select (True
) and which we don't (False
).
Here's an example of a Boolean mask for meteorites weighing more than 50 grams that were found on Earth (i.e., they were not observed falling):
(meteorites['mass (g)'] > 50) & (meteorites.fall == 'Found')
0 False 1 False 2 False 3 False 4 False ... 45711 True 45712 False 45713 False 45714 True 45715 True Length: 45716, dtype: bool
Important: Take note of the syntax here. We surround each condition with parentheses, and we use bitwise operators (&
, |
, ~
) instead of logical operators (and
, or
, not
).
We can use a Boolean mask to select the subset of meteorites weighing more than 1 million grams (1,000 kilograms or roughly 2,205 pounds) that were observed falling:
meteorites[(meteorites['mass (g)'] > 1e6) & (meteorites.fall == 'Fell')]
name | id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | |
---|---|---|---|---|---|---|---|---|---|---|
29 | Allende | 2278 | Valid | CV3 | 2000000.0 | Fell | 01/01/1969 12:00:00 AM | 26.96667 | -105.31667 | (26.96667, -105.31667) |
419 | Jilin | 12171 | Valid | H5 | 4000000.0 | Fell | 01/01/1976 12:00:00 AM | 44.05000 | 126.16667 | (44.05, 126.16667) |
506 | Kunya-Urgench | 12379 | Valid | H5 | 1100000.0 | Fell | 01/01/1998 12:00:00 AM | 42.25000 | 59.20000 | (42.25, 59.2) |
707 | Norton County | 17922 | Valid | Aubrite | 1100000.0 | Fell | 01/01/1948 12:00:00 AM | 39.68333 | -99.86667 | (39.68333, -99.86667) |
920 | Sikhote-Alin | 23593 | Valid | Iron, IIAB | 23000000.0 | Fell | 01/01/1947 12:00:00 AM | 46.16000 | 134.65333 | (46.16, 134.65333) |
Tip: Boolean masks can be used with loc[]
and iloc[]
.
An alternative to this is the query()
method:
meteorites.query("`mass (g)` > 1e6 and fall == 'Fell'")
name | id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | |
---|---|---|---|---|---|---|---|---|---|---|
29 | Allende | 2278 | Valid | CV3 | 2000000.0 | Fell | 01/01/1969 12:00:00 AM | 26.96667 | -105.31667 | (26.96667, -105.31667) |
419 | Jilin | 12171 | Valid | H5 | 4000000.0 | Fell | 01/01/1976 12:00:00 AM | 44.05000 | 126.16667 | (44.05, 126.16667) |
506 | Kunya-Urgench | 12379 | Valid | H5 | 1100000.0 | Fell | 01/01/1998 12:00:00 AM | 42.25000 | 59.20000 | (42.25, 59.2) |
707 | Norton County | 17922 | Valid | Aubrite | 1100000.0 | Fell | 01/01/1948 12:00:00 AM | 39.68333 | -99.86667 | (39.68333, -99.86667) |
920 | Sikhote-Alin | 23593 | Valid | Iron, IIAB | 23000000.0 | Fell | 01/01/1947 12:00:00 AM | 46.16000 | 134.65333 | (46.16, 134.65333) |
Tip: Here, we can use both logical operators and bitwise operators.
In the next section of this workshop, we will discuss data cleaning for a more meaningful analysis of our datasets; however, we can already extract some interesting insights from the meteorites
data by calculating summary statistics.
meteorites.fall.value_counts()
fall Found 44609 Fell 1107 Name: count, dtype: int64
The Meteoritical Society states that a "relict meteorite is dominantly composed of secondary minerals, and is thought to have once been a meteorite." What proportion of the data are relict meteorites? Let's verify these were all found versus observed falling:
meteorites.value_counts(subset=['nametype', 'fall'], normalize=True)
nametype fall Valid Found 0.974145 Fell 0.024215 Relict Found 0.001641 Name: proportion, dtype: float64
meteorites['mass (g)'].mean()
np.float64(13278.078548601512)
Important: The mean isn't always the best measure of central tendency. If there are outliers in the distribution, the mean will be skewed. Here, the mean is being pulled higher by some very heavy meteorites – the distribution is right-skewed.
Taking a look at some quantiles at the extremes of the distribution shows that the mean is between the 95th and 99th percentile of the distribution, so it isn't a good measure of central tendency here:
meteorites['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])
0.01 0.44 0.05 1.10 0.50 32.60 0.95 4000.00 0.99 50600.00 Name: mass (g), dtype: float64
A better measure in this case is the median (50th percentile), since it is robust to outliers:
meteorites['mass (g)'].median()
np.float64(32.6)
meteorites['mass (g)'].max()
np.float64(60000000.0)
Let's extract the information on this meteorite:
meteorites.loc[meteorites['mass (g)'].idxmax()]
name Hoba id 11890 nametype Valid recclass Iron, IVB mass (g) 60000000.0 fall Found year 01/01/1920 12:00:00 AM reclat -19.58333 reclong 17.91667 GeoLocation (-19.58333, 17.91667) Name: 16392, dtype: object
meteorites.recclass.nunique()
466
Some examples:
meteorites.recclass.unique()[:14]
array(['L5', 'H6', 'EH4', 'Acapulcoite', 'L6', 'LL3-6', 'H5', 'L', 'Diogenite-pm', 'Unknown', 'H4', 'H', 'Iron, IVA', 'CR2-an'], dtype=object)
Note: All fields preceded with "rec" are the values recommended by The Meteoritical Society. Check out this Wikipedia article for some information on meteorite classes.
We can get common summary statistics for all columns at once. By default, this will only be numeric columns, but here, we will summarize everything together:
meteorites.describe(include='all')
name | id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | |
---|---|---|---|---|---|---|---|---|---|---|
count | 45716 | 45716.000000 | 45716 | 45716 | 4.558500e+04 | 45716 | 45425 | 38401.000000 | 38401.000000 | 38401 |
unique | 45716 | NaN | 2 | 466 | NaN | 2 | 266 | NaN | NaN | 17100 |
top | Aachen | NaN | Valid | L6 | NaN | Found | 01/01/2003 12:00:00 AM | NaN | NaN | (0.0, 0.0) |
freq | 1 | NaN | 45641 | 8285 | NaN | 44609 | 3323 | NaN | NaN | 6214 |
mean | NaN | 26889.735104 | NaN | NaN | 1.327808e+04 | NaN | NaN | -39.122580 | 61.074319 | NaN |
std | NaN | 16860.683030 | NaN | NaN | 5.749889e+05 | NaN | NaN | 46.378511 | 80.647298 | NaN |
min | NaN | 1.000000 | NaN | NaN | 0.000000e+00 | NaN | NaN | -87.366670 | -165.433330 | NaN |
25% | NaN | 12688.750000 | NaN | NaN | 7.200000e+00 | NaN | NaN | -76.714240 | 0.000000 | NaN |
50% | NaN | 24261.500000 | NaN | NaN | 3.260000e+01 | NaN | NaN | -71.500000 | 35.666670 | NaN |
75% | NaN | 40656.750000 | NaN | NaN | 2.026000e+02 | NaN | NaN | 0.000000 | 157.166670 | NaN |
max | NaN | 57458.000000 | NaN | NaN | 6.000000e+07 | NaN | NaN | 81.166670 | 354.473330 | NaN |
Important: NaN
values signify missing data. For instance, the fall
column contains strings, so there is no value for mean
; likewise, mass (g)
is numeric, so we don't have entries for the categorical summary statistics (unique
, top
, freq
).
2019_Yellow_Taxi_Trip_Data.csv
file, calculate summary statistics for the fare_amount
, tip_amount
, tolls_amount
, and total_amount
columns.¶# Complete this exercise in the workbook.ipynb file
# Click on `Exercise 1.3` above to open the workbook.ipynb file
fare_amount
, tip_amount
, tolls_amount
, and total_amount
for the longest trip by distance (trip_distance
).¶# Complete this exercise in the workbook.ipynb file
# Click on `Exercise 1.4` above to open the workbook.ipynb file