Authors: Yury Kashnitsky, and Maxim Keremet. Translated and edited by Artem Trunov, and Aditya Soni. mlcourse.ai is powered by OpenDataScience (ods.ai) © 2017—2021
Prior to working on the assignment, you'd better check out the corresponding course material:
For discussions, please stick to [ODS Slack](https://opendatascience.slack.com/), channel #mlcourse_ai_eng, pinned thread #a2_bonus. If you are sure that something is not 100% correct, please leave your feedback via the mentioned webform ↑
import matplotlib.pyplot as plt
from pathlib import Path
import numpy as np
import pandas as pd
# if seaborn is not yet installed, run `pip install seaborn` in terminal
import seaborn as sns
sns.set()
# sharper plots
%config InlineBackend.figure_format = 'retina'
"../../_static/data/assignment2"
folder, or change the path below according to your location.Consider the following terms we use:
Reading data into memory and creating a Pandas DataFrame
object
(This may take a while, be patient)
We are not going to read in the whole dataset. In order to reduce memory footprint, we instead load only needed columns and cast them to suitable data types.
dtype = {
"DayOfWeek": np.uint8,
"DayofMonth": np.uint8,
"Month": np.uint8,
"Cancelled": np.uint8,
"Year": np.uint16,
"FlightNum": np.uint16,
"Distance": np.uint16,
"UniqueCarrier": str,
"CancellationCode": str,
"Origin": str,
"Dest": str,
"ArrDelay": np.float16,
"DepDelay": np.float16,
"CarrierDelay": np.float16,
"WeatherDelay": np.float16,
"NASDelay": np.float16,
"SecurityDelay": np.float16,
"LateAircraftDelay": np.float16,
"DepTime": np.float16,
}
PATH_TO_DATA = Path("data")
# change the path if needed
flights_df = pd.read_csv(
PATH_TO_DATA / "flights_2008.csv.bz2", usecols=dtype.keys(), dtype=dtype
)
Check the number of rows and columns and print column names.
print(flights_df.shape)
print(flights_df.columns)
(7009728, 19) Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'UniqueCarrier', 'FlightNum', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'Cancelled', 'CancellationCode', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'], dtype='object')
Print first 5 rows of the dataset.
flights_df.head(5)
Year | Month | DayofMonth | DayOfWeek | DepTime | UniqueCarrier | FlightNum | ArrDelay | DepDelay | Origin | Dest | Distance | Cancelled | CancellationCode | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008 | 1 | 3 | 4 | 2003.0 | WN | 335 | -14.0 | 8.0 | IAD | TPA | 810 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2008 | 1 | 3 | 4 | 754.0 | WN | 3231 | 2.0 | 19.0 | IAD | TPA | 810 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2008 | 1 | 3 | 4 | 628.0 | WN | 448 | 14.0 | 8.0 | IND | BWI | 515 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2008 | 1 | 3 | 4 | 926.0 | WN | 1746 | -6.0 | -4.0 | IND | BWI | 515 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2008 | 1 | 3 | 4 | 1829.0 | WN | 3920 | 34.0 | 34.0 | IND | BWI | 515 | 0 | NaN | 2.0 | 0.0 | 0.0 | 0.0 | 32.0 |
Transpose the frame to see all features at once.
flights_df.head(5).T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
Year | 2008 | 2008 | 2008 | 2008 | 2008 |
Month | 1 | 1 | 1 | 1 | 1 |
DayofMonth | 3 | 3 | 3 | 3 | 3 |
DayOfWeek | 4 | 4 | 4 | 4 | 4 |
DepTime | 2003.0 | 754.0 | 628.0 | 926.0 | 1829.0 |
UniqueCarrier | WN | WN | WN | WN | WN |
FlightNum | 335 | 3231 | 448 | 1746 | 3920 |
ArrDelay | -14.0 | 2.0 | 14.0 | -6.0 | 34.0 |
DepDelay | 8.0 | 19.0 | 8.0 | -4.0 | 34.0 |
Origin | IAD | IAD | IND | IND | IND |
Dest | TPA | TPA | BWI | BWI | BWI |
Distance | 810 | 810 | 515 | 515 | 515 |
Cancelled | 0 | 0 | 0 | 0 | 0 |
CancellationCode | NaN | NaN | NaN | NaN | NaN |
CarrierDelay | NaN | NaN | NaN | NaN | 2.0 |
WeatherDelay | NaN | NaN | NaN | NaN | 0.0 |
NASDelay | NaN | NaN | NaN | NaN | 0.0 |
SecurityDelay | NaN | NaN | NaN | NaN | 0.0 |
LateAircraftDelay | NaN | NaN | NaN | NaN | 32.0 |
Examine data types of all features and total dataframe size in memory.
flights_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7009728 entries, 0 to 7009727 Data columns (total 19 columns): # Column Dtype --- ------ ----- 0 Year uint16 1 Month uint8 2 DayofMonth uint8 3 DayOfWeek uint8 4 DepTime float16 5 UniqueCarrier object 6 FlightNum uint16 7 ArrDelay float16 8 DepDelay float16 9 Origin object 10 Dest object 11 Distance uint16 12 Cancelled uint8 13 CancellationCode object 14 CarrierDelay float16 15 WeatherDelay float16 16 NASDelay float16 17 SecurityDelay float16 18 LateAircraftDelay float16 dtypes: float16(8), object(4), uint16(3), uint8(4) memory usage: 387.7+ MB
Get basic statistics of each feature.
flights_df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Year | 7009728.0 | 2008.000000 | 0.000000 | 2008.0 | 2008.0 | 2008.0 | 2008.0 | 2008.0 |
Month | 7009728.0 | 6.375130 | 3.406737 | 1.0 | 3.0 | 6.0 | 9.0 | 12.0 |
DayofMonth | 7009728.0 | 15.728015 | 8.797068 | 1.0 | 8.0 | 16.0 | 23.0 | 31.0 |
DayOfWeek | 7009728.0 | 3.924182 | 1.988259 | 1.0 | 2.0 | 4.0 | 6.0 | 7.0 |
DepTime | 6873482.0 | NaN | NaN | 1.0 | 928.0 | 1325.0 | 1728.0 | 2400.0 |
FlightNum | 7009728.0 | 2224.200105 | 1961.715999 | 1.0 | 622.0 | 1571.0 | 3518.0 | 9743.0 |
ArrDelay | 6855029.0 | NaN | NaN | -519.0 | -10.0 | -2.0 | 12.0 | 2460.0 |
DepDelay | 6873482.0 | NaN | NaN | -534.0 | -4.0 | -1.0 | 8.0 | 2468.0 |
Distance | 7009728.0 | 726.387029 | 562.101803 | 11.0 | 325.0 | 581.0 | 954.0 | 4962.0 |
Cancelled | 7009728.0 | 0.019606 | 0.138643 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
CarrierDelay | 1524735.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 16.0 | 2436.0 |
WeatherDelay | 1524735.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 1352.0 |
NASDelay | 1524735.0 | NaN | NaN | 0.0 | 0.0 | 6.0 | 21.0 | 1357.0 |
SecurityDelay | 1524735.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 392.0 |
LateAircraftDelay | 1524735.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 26.0 | 1316.0 |
Count unique Carriers and plot their relative share of flights:
flights_df["UniqueCarrier"].nunique()
20
flights_df.groupby("UniqueCarrier").size().plot(kind="bar")
We can also group by category/categories in order to calculate different aggregated statistics.
For example, finding top-3 flight codes, that have the largest total distance traveled in year 2008.
flights_df.groupby(["UniqueCarrier", "FlightNum"])["Distance"].sum().sort_values(
ascending=False
).iloc[:3]
UniqueCarrier FlightNum CO 15 1796244.0 14 1796244.0 UA 52 1789722.0 Name: Distance, dtype: float64
Another way:
flights_df.groupby(["UniqueCarrier", "FlightNum"]).agg(
{"Distance": [np.mean, np.sum, "count"], "Cancelled": np.sum}
).sort_values(("Distance", "sum"), ascending=False).iloc[0:3]
Distance | Cancelled | ||||
---|---|---|---|---|---|
mean | sum | count | sum | ||
UniqueCarrier | FlightNum | ||||
CO | 15 | 4962.000000 | 1796244.0 | 362 | 0 |
14 | 4962.000000 | 1796244.0 | 362 | 0 | |
UA | 52 | 2465.181818 | 1789722.0 | 726 | 8 |
Number of flights by days of week and months:
pd.crosstab(flights_df["Month"], flights_df["DayOfWeek"])
DayOfWeek | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
Month | |||||||
1 | 80807 | 97298 | 100080 | 102043 | 81940 | 67178 | 76419 |
2 | 81504 | 79700 | 80587 | 82158 | 102726 | 66462 | 76099 |
3 | 103210 | 81159 | 82307 | 82831 | 82936 | 86153 | 97494 |
4 | 82463 | 100785 | 102586 | 82799 | 82964 | 68304 | 78225 |
5 | 80626 | 79884 | 81264 | 102572 | 102878 | 84493 | 74576 |
6 | 104168 | 82160 | 82902 | 83617 | 83930 | 72322 | 99566 |
7 | 84095 | 103429 | 103315 | 105035 | 79349 | 72219 | 80489 |
8 | 82983 | 80895 | 81773 | 82625 | 103878 | 86155 | 93970 |
9 | 94300 | 91533 | 74057 | 75589 | 75881 | 58343 | 71205 |
10 | 75131 | 72195 | 91900 | 94123 | 93894 | 58168 | 70794 |
11 | 74214 | 72443 | 73653 | 68071 | 70484 | 76031 | 88376 |
12 | 92700 | 90568 | 85241 | 70761 | 74306 | 61708 | 69674 |
It can also be handy to color such tables in order to easily notice outliers:
plt.imshow(
pd.crosstab(flights_df["Month"], flights_df["DayOfWeek"]),
cmap="seismic",
interpolation="none",
)
Flight distance histogram:
flights_df.hist("Distance", bins=20)
Making a histogram of flight frequency by date.
flights_df["Date"] = pd.to_datetime(
flights_df.rename(columns={"DayofMonth": "Day"})[["Year", "Month", "Day"]]
)
num_flights_by_date = flights_df.groupby("Date").size()
num_flights_by_date.plot()
Do you see a weekly pattern above? And below?
num_flights_by_date.rolling(window=7).mean().plot()
We'll need a new column in our dataset - departure hour, let's create it.
As we see, DepTime
is distributed from 1 to 2400 (it is given in the hhmm
format, check the column description again). We'll treat departure hour as DepTime
// 100 (divide by 100 and apply the floor
function). However, now we'll have both hour 0 and hour 24. Hour 24 sounds strange, we'll set it to be 0 instead (a typical imperfectness of real data, however, you can check that it affects only 521 rows, which is sort of not a big deal). So now values of a new column DepHour
will be distributed from 0 to 23. There are some missing values, for now we won't fill in them, just ignore them.
flights_df["DepHour"] = flights_df["DepTime"] // 100
flights_df["DepHour"].replace(to_replace=24, value=0, inplace=True)
flights_df["DepHour"].describe()
count 6873482.0 mean NaN std 0.0 min 0.0 25% 9.0 50% 13.0 75% 17.0 max 23.0 Name: DepHour, dtype: float64
flights_df["UniqueCarrier"].nunique()
20
cancelled_df = flights_df["Cancelled"].value_counts()
cancelled_df[0] - cancelled_df[1]
6734860
target_df = flights_df.groupby("Dest").agg(
{"ArrDelay": "max", "DepDelay": "max"})
target_df.sort_values("DepDelay", ascending=False).iloc[:1]
ArrDelay | DepDelay | |
---|---|---|
Dest | ||
MSP | 2460.0 | 2468.0 |
target_df.sort_values("ArrDelay", ascending=False).iloc[:1]
ArrDelay | DepDelay | |
---|---|---|
Dest | ||
MSP | 2460.0 | 2468.0 |
flights_df.groupby("Cancelled")["UniqueCarrier"].value_counts()[1].iloc[:3]
# flights_df[flights_df["Cancelled"] == 1]["UniqueCarrier"].mode()
UniqueCarrier MQ 18331 AA 17440 OO 12436 Name: UniqueCarrier, dtype: int64
flights_df.groupby("DepHour")["DepHour"].count(
).sort_values(ascending=False).iloc[:5]
DepHour 8.0 449224 6.0 446737 11.0 444702 7.0 436700 13.0 435917 Name: DepHour, dtype: int64
flights_df.groupby("DepHour")["Cancelled"].mean().sort_values().head()
DepHour 3.0 0.000000 13.0 0.000108 12.0 0.000125 9.0 0.000129 10.0 0.000132 Name: Cancelled, dtype: float64
cancelled_hours = set(
flights_df[flights_df["Cancelled"] == 1].groupby(
"DepHour")["DepHour"].count().index
)
all_hours = set(np.arange(0.0, 24.0))
all_hours.difference(cancelled_hours)
{3.0}
Hint: Consider only completed flights.
flights_df[flights_df["Cancelled"] == 0]["DepHour"].value_counts().sort_values(
ascending=False
).iloc[:3]
# flights_df[(flights_df["Cancelled"] == 0)].groupby(["DepHour"]).size().idxmax()
8.0 449145 6.0 446657 11.0 444643 Name: DepHour, dtype: int64
Hint: Consider only completed flights.
flights_df[flights_df["Cancelled"] == 0].groupby("DepHour")["DepDelay"].mean()
DepHour 0.0 70.750000 1.0 111.812500 2.0 122.562500 3.0 95.812500 4.0 3.375000 5.0 -4.253906 6.0 -1.629883 7.0 0.102539 8.0 1.760742 9.0 3.753906 10.0 5.460938 11.0 6.683594 12.0 7.902344 13.0 8.812500 14.0 10.351562 15.0 11.875000 16.0 11.882812 17.0 13.210938 18.0 15.343750 19.0 16.875000 20.0 22.781250 21.0 23.031250 22.0 38.031250 23.0 60.000000 Name: DepDelay, dtype: float16
ax = flights_df[(flights_df["Cancelled"] == 0) & (flights_df["UniqueCarrier"] == "MQ")][
"DepHour"
].hist(bins=24)
ax.set_xticks(np.arange(0, 24))
Which of the listed below is not in your top-10 list?
flights_df[flights_df["Cancelled"] == 0][
"UniqueCarrier"
].value_counts().sort_values(ascending=False).iloc[:10]
WN 1189365 AA 587445 OO 554723 MQ 472362 US 447007 DL 445118 UA 438974 XE 364518 NW 344746 CO 294753 Name: UniqueCarrier, dtype: int64
cancellation_map = {
"A": "Carrier",
"B": "Weather",
"C": "National Air System",
"D": "Security",
}
flights_df[flights_df["Cancelled"] == 1]["CancellationCode"].map(
cancellation_map
).value_counts().plot(kind="barh")
(Take a look at 'Origin' and 'Dest' features. Consider A->B and B->A directions as different routes)
routes = flights_df["Origin"] + "-" + flights_df["Dest"]
routes.value_counts().iloc[:3]
# flights_df[["Origin", "Dest"]].groupby(["Origin", "Dest"]).size().idxmax()
SFO-LAX 13788 LAX-SFO 13390 OGG-HNL 12383 dtype: int64
Hint: consider only positive delays
target_df = flights_df[flights_df["DepDelay"] > 0].copy()
target_df["Route"] = target_df["Origin"] + "-" + target_df["Dest"]
five_routes = target_df["Route"].value_counts().iloc[:5].index
print(five_routes)
# unfinished
Index(['LAX-SFO', 'DAL-HOU', 'SFO-LAX', 'ORD-LGA', 'HOU-DAL'], dtype='object')
sum(target_df[target_df["Route"].isin(five_routes)]["WeatherDelay"] > 0)
668
ax = flights_df["DepHour"].hist(bins=24)
ax.set_title("Number of flights by departure hour")
Choose all correct statements:
Hint: Look for official meteorological winter months for the Northern Hemisphere.
flights_df.columns
Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'UniqueCarrier', 'FlightNum', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'Cancelled', 'CancellationCode', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay', 'Date', 'DepHour'], dtype='object')
flights_df["DayOfWeek"].value_counts().sort_index(ascending=False).plot(
kind="barh", title="Number of flights by the day of the week"
)
flights_df["Month"].value_counts().sort_index(ascending=False).plot(
kind="barh", title="Number of flights by the day of the week"
)
Choose all correct statements:
target_df = flights_df[flights_df["Cancelled"] == 1].copy()
target_df["CancellationReason"] = target_df["CancellationCode"].map(
cancellation_map)
target_df.groupby(["Month", "CancellationReason"]).size().unstack(0).plot(
kind="barh", figsize=(5, 12)
)
<AxesSubplot:ylabel='CancellationReason'>
Reminder on Cancellatoin codes:
A - Carrier
B - Weather
C - National Air System
D - Security
target_df.groupby(["CancellationReason", "Month"]).size().sort_values(
ascending=False
).iloc[:5]
CancellationReason Month Weather 12 10042 2 10034 3 7520 Carrier 4 7312 Weather 1 6711 dtype: int64
target_df[target_df["CancellationReason"] == "Carrier"][
"UniqueCarrier"
].value_counts().iloc[:3]
AA 9609 WN 6554 UA 6352 Name: UniqueCarrier, dtype: int64
(Boxplots can be helpful in this exercise, as well as it might be a good idea to remove outliers in order to build nice graphs. You can exclude delay time values higher than a corresponding .95 percentile).**
target_df = flights_df.copy()
target_df = target_df[(target_df["ArrDelay"] >= 0) &
(target_df["DepDelay"] >= 0)]
flights_df.shape, target_df.shape
((7009728, 21), (2320755, 21))
percentiles = target_df[["ArrDelay", "DepDelay"]].quantile(q=0.95)
target_df = target_df[
(target_df["ArrDelay"] <= percentiles["ArrDelay"])
& (target_df["DepDelay"] <= percentiles["DepDelay"])
]
target_df.shape
(2188884, 21)
melted_df = target_df[["UniqueCarrier", "ArrDelay", "DepDelay"]].melt(
id_vars="UniqueCarrier"
)
sns.boxplot(data=melted_df, y="UniqueCarrier", x="value", hue="variable")
sns.set(rc={"figure.figsize": (20, 20)}, font_scale=1.5, style="whitegrid")