mlcourse.ai – Open Machine Learning Course
Authors: Yury Kashnitsky, and Maxim Keremet. Translated and edited by Artem Trunov, and Aditya Soni. mlcourse.ai is powered by OpenDataScience (ods.ai) © 2017—2021

Assignment #2. Task

Exploratory Data Analysis (EDA) of US flights with Pandas, Matplotlib, and Seaborn

Prior to working on the assignment, you'd better check out the corresponding course material:

[Exploratory data analysis with Pandas](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic01_pandas_data_analysis/topic1_pandas_data_analysis.ipynb?flush_cache=true), the same as an interactive web-based [Kaggle Notebook](https://www.kaggle.com/kashnitsky/topic-1-exploratory-data-analysis-with-pandas)
[Visualization: from Simple Distributions to Dimensionality Reduction](https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic02_visual_data_analysis/topic2_visual_data_analysis.ipynb?flush_cache=true), the same as a [Kaggle Notebook](https://www.kaggle.com/kashnitsky/topic-2-visual-data-analysis-in-python)
[Overview of Seaborn, Matplotlib and Plotly libraries](https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic02_visual_data_analysis/topic2_additional_seaborn_matplotlib_plotly.ipynb?flush_cache=true), the same as a [Kaggle Notebook](https://www.kaggle.com/kashnitsky/topic-2-part-2-seaborn-and-plotly)
first lectures in [this](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX) YouTube playlist
you can also practice with demo assignments, which are simpler and already shared with solutions: [A1 demo](https://www.kaggle.com/kashnitsky/a1-demo-pandas-and-uci-adult-dataset), [solution](https://www.kaggle.com/kashnitsky/a1-demo-pandas-and-uci-adult-dataset-solution), [A2 demo](https://www.kaggle.com/kashnitsky/a2-demo-analyzing-cardiovascular-data), [solution](https://www.kaggle.com/kashnitsky/a2-demo-analyzing-cardiovascular-data-solution)

Your task is to:¶

write code and perform computations in the cells below;
choose answers in the [webform](https://docs.google.com/forms/d/1GXgR4TsqoTH_nQkrgluqWBElpK0emfhNzdZWZQObtnw).

For discussions, please stick to [ODS Slack](https://opendatascience.slack.com/), channel #mlcourse_ai_eng, pinned thread #a2_bonus. If you are sure that something is not 100% correct, please leave your feedback via the mentioned webform ↑

In [1]:

import matplotlib.pyplot as plt
from pathlib import Path

import numpy as np
import pandas as pd

# if seaborn is not yet installed, run `pip install seaborn` in terminal
import seaborn as sns

sns.set()

# sharper plots
%config InlineBackend.figure_format = 'retina'

Download the data archive (Archived ~ 114 Mb, unzipped – ~ 690 Mb). No need to unzip – pandas can unzip on the fly.
Place it in the "../../_static/data/assignment2" folder, or change the path below according to your location.
The dataset has information about carriers and flights between US airports during the year 2008.
Column description is available here. Visit this site to find ex. meaning of flight cancellation codes.

Consider the following terms we use:

unique flight – a record (row) in the dataset
completed flight – flight that is not cancelled (Cancelled==0 in the dataset)
flight code – a combination of ['UniqueCarrier','FlightNum'], i.e. UA52
airport code – a three letter airport alias from 'Origin or 'Dest' columns

Reading data into memory and creating a Pandas DataFrame object

(This may take a while, be patient)

We are not going to read in the whole dataset. In order to reduce memory footprint, we instead load only needed columns and cast them to suitable data types.

In [2]:

dtype = {
    "DayOfWeek": np.uint8,
    "DayofMonth": np.uint8,
    "Month": np.uint8,
    "Cancelled": np.uint8,
    "Year": np.uint16,
    "FlightNum": np.uint16,
    "Distance": np.uint16,
    "UniqueCarrier": str,
    "CancellationCode": str,
    "Origin": str,
    "Dest": str,
    "ArrDelay": np.float16,
    "DepDelay": np.float16,
    "CarrierDelay": np.float16,
    "WeatherDelay": np.float16,
    "NASDelay": np.float16,
    "SecurityDelay": np.float16,
    "LateAircraftDelay": np.float16,
    "DepTime": np.float16,
}

In [3]:

PATH_TO_DATA = Path("data")

In [4]:

# change the path if needed
flights_df = pd.read_csv(
    PATH_TO_DATA / "flights_2008.csv.bz2", usecols=dtype.keys(), dtype=dtype
)

Check the number of rows and columns and print column names.

In [5]:

print(flights_df.shape)
print(flights_df.columns)

(7009728, 19)
Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'UniqueCarrier',
       'FlightNum', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance',
       'Cancelled', 'CancellationCode', 'CarrierDelay', 'WeatherDelay',
       'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
      dtype='object')

Print first 5 rows of the dataset.

In [6]:

flights_df.head(5)

Out[6]:

	Year	Month	DayofMonth	DayOfWeek	DepTime	UniqueCarrier	FlightNum	ArrDelay	DepDelay	Origin	Dest	Distance	CancellationCode	CarrierDelay	WeatherDelay	NASDelay	SecurityDelay	LateAircraftDelay
0	2008	1	3	4	2003.0	WN	335	-14.0	8.0	IAD	TPA	810	NaN	NaN	NaN	NaN	NaN	NaN
1	2008	1	3	4	754.0	WN	3231	2.0	19.0	IAD	TPA	810	NaN	NaN	NaN	NaN	NaN	NaN
2	2008	1	3	4	628.0	WN	448	14.0	8.0	IND	BWI	515	NaN	NaN	NaN	NaN	NaN	NaN
3	2008	1	3	4	926.0	WN	1746	-6.0	-4.0	IND	BWI	515	NaN	NaN	NaN	NaN	NaN	NaN
4	2008	1	3	4	1829.0	WN	3920	34.0	34.0	IND	BWI	515	NaN	2.0	0.0	0.0	0.0	32.0

Transpose the frame to see all features at once.

In [7]:

flights_df.head(5).T

Out[7]:

	0	1	2	3	4
Year	2008	2008	2008	2008	2008
Month	1	1	1	1	1
DayofMonth	3	3	3	3	3
DayOfWeek	4	4	4	4	4
DepTime	2003.0	754.0	628.0	926.0	1829.0
UniqueCarrier	WN	WN	WN	WN	WN
FlightNum	335	3231	448	1746	3920
ArrDelay	-14.0	2.0	14.0	-6.0	34.0
DepDelay	8.0	19.0	8.0	-4.0	34.0
Origin	IAD	IAD	IND	IND	IND
Dest	TPA	TPA	BWI	BWI	BWI
Distance	810	810	515	515	515
Cancelled	0	0	0	0	0
CancellationCode	NaN	NaN	NaN	NaN	NaN
CarrierDelay	NaN	NaN	NaN	NaN	2.0
WeatherDelay	NaN	NaN	NaN	NaN	0.0
NASDelay	NaN	NaN	NaN	NaN	0.0
SecurityDelay	NaN	NaN	NaN	NaN	0.0
LateAircraftDelay	NaN	NaN	NaN	NaN	32.0

Examine data types of all features and total dataframe size in memory.

In [8]:

flights_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7009728 entries, 0 to 7009727
Data columns (total 19 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Year               uint16 
 1   Month              uint8  
 2   DayofMonth         uint8  
 3   DayOfWeek          uint8  
 4   DepTime            float16
 5   UniqueCarrier      object 
 6   FlightNum          uint16 
 7   ArrDelay           float16
 8   DepDelay           float16
 9   Origin             object 
 10  Dest               object 
 11  Distance           uint16 
 12  Cancelled          uint8  
 13  CancellationCode   object 
 14  CarrierDelay       float16
 15  WeatherDelay       float16
 16  NASDelay           float16
 17  SecurityDelay      float16
 18  LateAircraftDelay  float16
dtypes: float16(8), object(4), uint16(3), uint8(4)
memory usage: 387.7+ MB

Get basic statistics of each feature.

In [9]:

flights_df.describe().T

Out[9]:

	count	mean	std	min	25%	50%	75%	max
Year	7009728.0	2008.000000	0.000000	2008.0	2008.0	2008.0	2008.0	2008.0
Month	7009728.0	6.375130	3.406737	1.0	3.0	6.0	9.0	12.0
DayofMonth	7009728.0	15.728015	8.797068	1.0	8.0	16.0	23.0	31.0
DayOfWeek	7009728.0	3.924182	1.988259	1.0	2.0	4.0	6.0	7.0
DepTime	6873482.0	NaN	NaN	1.0	928.0	1325.0	1728.0	2400.0
FlightNum	7009728.0	2224.200105	1961.715999	1.0	622.0	1571.0	3518.0	9743.0
ArrDelay	6855029.0	NaN	NaN	-519.0	-10.0	-2.0	12.0	2460.0
DepDelay	6873482.0	NaN	NaN	-534.0	-4.0	-1.0	8.0	2468.0
Distance	7009728.0	726.387029	562.101803	11.0	325.0	581.0	954.0	4962.0
Cancelled	7009728.0	0.019606	0.138643	0.0	0.0	0.0	0.0	1.0
CarrierDelay	1524735.0	NaN	NaN	0.0	0.0	0.0	16.0	2436.0
WeatherDelay	1524735.0	NaN	NaN	0.0	0.0	0.0	0.0	1352.0
NASDelay	1524735.0	NaN	NaN	0.0	0.0	6.0	21.0	1357.0
SecurityDelay	1524735.0	NaN	NaN	0.0	0.0	0.0	0.0	392.0
LateAircraftDelay	1524735.0	NaN	NaN	0.0	0.0	0.0	26.0	1316.0

Count unique Carriers and plot their relative share of flights:

In [10]:

flights_df["UniqueCarrier"].nunique()

Out[10]:

In [11]:

flights_df.groupby("UniqueCarrier").size().plot(kind="bar")

We can also group by category/categories in order to calculate different aggregated statistics.

For example, finding top-3 flight codes, that have the largest total distance traveled in year 2008.

In [12]:

flights_df.groupby(["UniqueCarrier", "FlightNum"])["Distance"].sum().sort_values(
    ascending=False
).iloc[:3]

Out[12]:

UniqueCarrier  FlightNum
CO             15           1796244.0
               14           1796244.0
UA             52           1789722.0
Name: Distance, dtype: float64

Another way:

In [13]:

flights_df.groupby(["UniqueCarrier", "FlightNum"]).agg(
    {"Distance": [np.mean, np.sum, "count"], "Cancelled": np.sum}
).sort_values(("Distance", "sum"), ascending=False).iloc[0:3]

Out[13]:

		Distance			Cancelled
		mean	sum	count	sum
UniqueCarrier	FlightNum
CO	15	4962.000000	1796244.0	362	0
CO	14	4962.000000	1796244.0	362	0
UA	52	2465.181818	1789722.0	726	8

Number of flights by days of week and months:

In [14]:

pd.crosstab(flights_df["Month"], flights_df["DayOfWeek"])

Out[14]:

DayOfWeek	1	2	3	4	5	6	7
Month
1	80807	97298	100080	102043	81940	67178	76419
2	81504	79700	80587	82158	102726	66462	76099
3	103210	81159	82307	82831	82936	86153	97494
4	82463	100785	102586	82799	82964	68304	78225
5	80626	79884	81264	102572	102878	84493	74576
6	104168	82160	82902	83617	83930	72322	99566
7	84095	103429	103315	105035	79349	72219	80489
8	82983	80895	81773	82625	103878	86155	93970
9	94300	91533	74057	75589	75881	58343	71205
10	75131	72195	91900	94123	93894	58168	70794
11	74214	72443	73653	68071	70484	76031	88376
12	92700	90568	85241	70761	74306	61708	69674

It can also be handy to color such tables in order to easily notice outliers:

In [15]:

plt.imshow(
    pd.crosstab(flights_df["Month"], flights_df["DayOfWeek"]),
    cmap="seismic",
    interpolation="none",
)

Flight distance histogram:

In [16]:

flights_df.hist("Distance", bins=20)

Making a histogram of flight frequency by date.

In [17]:

flights_df["Date"] = pd.to_datetime(
    flights_df.rename(columns={"DayofMonth": "Day"})[["Year", "Month", "Day"]]
)

In [18]:

num_flights_by_date = flights_df.groupby("Date").size()

In [19]:

num_flights_by_date.plot()

Do you see a weekly pattern above? And below?

In [20]:

num_flights_by_date.rolling(window=7).mean().plot()

We'll need a new column in our dataset - departure hour, let's create it.

As we see, DepTime is distributed from 1 to 2400 (it is given in the hhmm format, check the column description again). We'll treat departure hour as DepTime // 100 (divide by 100 and apply the floor function). However, now we'll have both hour 0 and hour 24. Hour 24 sounds strange, we'll set it to be 0 instead (a typical imperfectness of real data, however, you can check that it affects only 521 rows, which is sort of not a big deal). So now values of a new column DepHour will be distributed from 0 to 23. There are some missing values, for now we won't fill in them, just ignore them.

In [21]:

flights_df["DepHour"] = flights_df["DepTime"] // 100
flights_df["DepHour"].replace(to_replace=24, value=0, inplace=True)

In [22]:

flights_df["DepHour"].describe()

Out[22]:

count    6873482.0
mean           NaN
std            0.0
min            0.0
25%            9.0
50%           13.0
75%           17.0
max           23.0
Name: DepHour, dtype: float64

Now it's your turn. Answer the questions below.¶

Question 1. How many unique carriers are there in our dataset?¶

10
15
20 [+]
25

In [23]:

flights_df["UniqueCarrier"].nunique()

Out[23]:

Question 2. We have both cancelled and completed flights in the dataset. Check if there are more completed or cancelled flights. What is the difference?

Cancelled overweights completed by 329 flights
Completed overweights cancelled by 6734860 flights [+]
Cancelled overweights completed by 671 flights
Completed overweights cancelled by 11088967 flights

In [24]:

cancelled_df = flights_df["Cancelled"].value_counts()
cancelled_df[0] - cancelled_df[1]

Out[24]:

Question 3. Find a flight with the longest departure delay and a flight with the longest arrival delay. Do they have the same destination airport, and if yes, what is its code?¶

yes, ATL
yes, HNL
yes, MSP [+]
no

In [103]:

target_df = flights_df.groupby("Dest").agg(
    {"ArrDelay": "max", "DepDelay": "max"})

In [104]:

target_df.sort_values("DepDelay", ascending=False).iloc[:1]

Out[104]:

	ArrDelay	DepDelay
Dest
MSP	2460.0	2468.0

In [102]:

target_df.sort_values("ArrDelay", ascending=False).iloc[:1]

Out[102]:

	ArrDelay	DepDelay
Dest
MSP	2460.0	2468.0

Question 4. Find the carrier that has the greatest number of cancelled flights.¶

AA
MQ [+]
WN
CO

In [105]:

flights_df.groupby("Cancelled")["UniqueCarrier"].value_counts()[1].iloc[:3]

# flights_df[flights_df["Cancelled"] == 1]["UniqueCarrier"].mode()

Out[105]:

UniqueCarrier
MQ    18331
AA    17440
OO    12436
Name: UniqueCarrier, dtype: int64

Question 5. Let's examine departure time and consider distribution by hour (column `DepHour` that we've created earlier). Which hour has the highest percentage of flights?¶

Hint: Check time format here.

1 am
5 am
8 am [+]
3 pm

In [27]:

flights_df.groupby("DepHour")["DepHour"].count(
).sort_values(ascending=False).iloc[:5]

Out[27]:

DepHour
8.0     449224
6.0     446737
11.0    444702
7.0     436700
13.0    435917
Name: DepHour, dtype: int64

Question 6. OK, now let's examine cancelled flight distribution by time. Which hour has the least percentage of cancelled flights?

2 am
9 pm
8 am
3 am [+]

In [109]:

flights_df.groupby("DepHour")["Cancelled"].mean().sort_values().head()

Out[109]:

DepHour
3.0     0.000000
13.0    0.000108
12.0    0.000125
9.0     0.000129
10.0    0.000132
Name: Cancelled, dtype: float64

Question 7. Is there any hour that didn't have any cancelled flights at all? Check all that apply.¶

3 [+]
19
22
4

In [29]:

cancelled_hours = set(
    flights_df[flights_df["Cancelled"] == 1].groupby(
        "DepHour")["DepHour"].count().index
)
all_hours = set(np.arange(0.0, 24.0))
all_hours.difference(cancelled_hours)

Out[29]:

{3.0}

Question 8. Find the busiest hour, or in other words, the hour when the number of departed flights reaches its maximum.

Hint: Consider only completed flights.

4
7
8 [+]
17

In [110]:

flights_df[flights_df["Cancelled"] == 0]["DepHour"].value_counts().sort_values(
    ascending=False
).iloc[:3]

# flights_df[(flights_df["Cancelled"] == 0)].groupby(["DepHour"]).size().idxmax()

Out[110]:

8.0     449145
6.0     446657
11.0    444643
Name: DepHour, dtype: int64

Question 9. Since we know the departure hour, it might be interesting to examine the average delay for corresponding hour. Are there any cases, when the planes on average departed earlier than they should have done? And if yes, at what departure hours did it happen?

Hint: Consider only completed flights.

no, there are no such cases
yes, at 5-6 am [+]
yes, at 9-10 am
yes, at 2-4 pm

In [31]:

flights_df[flights_df["Cancelled"] == 0].groupby("DepHour")["DepDelay"].mean()

Out[31]:

DepHour
0.0      70.750000
1.0     111.812500
2.0     122.562500
3.0      95.812500
4.0       3.375000
5.0      -4.253906
6.0      -1.629883
7.0       0.102539
8.0       1.760742
9.0       3.753906
10.0      5.460938
11.0      6.683594
12.0      7.902344
13.0      8.812500
14.0     10.351562
15.0     11.875000
16.0     11.882812
17.0     13.210938
18.0     15.343750
19.0     16.875000
20.0     22.781250
21.0     23.031250
22.0     38.031250
23.0     60.000000
Name: DepDelay, dtype: float16

Question 10. Considering only the completed flights by the carrier, that you have found in Question 4, find the distribution of these flights by hour. At what time does the greatest number of its planes depart?

at noon
at 7 am
at 8 am
at 10 am [+]

In [32]:

ax = flights_df[(flights_df["Cancelled"] == 0) & (flights_df["UniqueCarrier"] == "MQ")][
    "DepHour"
].hist(bins=24)
ax.set_xticks(np.arange(0, 24))

**Question 11. Find top-10 carriers in terms of the number of completed flights (UniqueCarrier column)?**¶

Which of the listed below is not in your top-10 list?

DL
AA
OO
EV [+]

In [111]:

flights_df[flights_df["Cancelled"] == 0][
    "UniqueCarrier"
].value_counts().sort_values(ascending=False).iloc[:10]

Out[111]:

WN    1189365
AA     587445
OO     554723
MQ     472362
US     447007
DL     445118
UA     438974
XE     364518
NW     344746
CO     294753
Name: UniqueCarrier, dtype: int64

Question 12. Plot distributions of flight cancellation reasons (CancellationCode).¶

What is the most frequent reason for flight cancellation? (Use this link to translate codes into reasons)

Carrier
Weather conditions [+]
National Air System
Security reasons

In [34]:

cancellation_map = {
    "A": "Carrier",
    "B": "Weather",
    "C": "National Air System",
    "D": "Security",
}

In [35]:

flights_df[flights_df["Cancelled"] == 1]["CancellationCode"].map(
    cancellation_map
).value_counts().plot(kind="barh")

Question 13. Which route is the most frequent, in terms of the number of flights?¶

(Take a look at 'Origin' and 'Dest' features. Consider A->B and B->A directions as different routes)

New-York – Washington (JFK-IAD)
San-Francisco – Los-Angeles (SFO-LAX) [+]
San-Jose – Dallas (SJC-DFW)
New-York – San-Francisco (JFK-SFO)

In [112]:

routes = flights_df["Origin"] + "-" + flights_df["Dest"]
routes.value_counts().iloc[:3]

# flights_df[["Origin", "Dest"]].groupby(["Origin", "Dest"]).size().idxmax()

Out[112]:

SFO-LAX    13788
LAX-SFO    13390
OGG-HNL    12383
dtype: int64

Question 14. Find top-5 delayed routes (count how many times they were delayed on departure). From all flights on these 5 routes, count all flights with weather conditions contributing to a delay.¶

Hint: consider only positive delays

449
539
549
668 [+]

In [37]:

target_df = flights_df[flights_df["DepDelay"] > 0].copy()
target_df["Route"] = target_df["Origin"] + "-" + target_df["Dest"]

five_routes = target_df["Route"].value_counts().iloc[:5].index
print(five_routes)

# unfinished

Index(['LAX-SFO', 'DAL-HOU', 'SFO-LAX', 'ORD-LGA', 'HOU-DAL'], dtype='object')

In [38]:

sum(target_df[target_df["Route"].isin(five_routes)]["WeatherDelay"] > 0)

Out[38]:

Question 15. Examine the hourly distribution of departure times. Choose all correct statements:¶

Flights are normally distributed within time interval [0-23] (Search for: Normal distribution, bell curve).
Flights are uniformly distributed within time interval [0-23].
In the period from 0 am to 4 am there are considerably less flights than from 7 pm to 8 pm. [+]

In [39]:

ax = flights_df["DepHour"].hist(bins=24)
ax.set_title("Number of flights by departure hour")

Question 16. Show how the number of flights changes through time (on the daily/weekly/monthly basis) and interpret the findings.¶

Choose all correct statements:

The number of flights during weekends is less than during weekdays (working days). [+]
The lowest number of flights is on Sunday.
There are less flights during winter than during summer. [+]

Hint: Look for official meteorological winter months for the Northern Hemisphere.

In [40]:

flights_df.columns

Out[40]:

Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'UniqueCarrier',
       'FlightNum', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance',
       'Cancelled', 'CancellationCode', 'CarrierDelay', 'WeatherDelay',
       'NASDelay', 'SecurityDelay', 'LateAircraftDelay', 'Date', 'DepHour'],
      dtype='object')

In [41]:

flights_df["DayOfWeek"].value_counts().sort_index(ascending=False).plot(
    kind="barh", title="Number of flights by the day of the week"
)

In [42]:

flights_df["Month"].value_counts().sort_index(ascending=False).plot(
    kind="barh", title="Number of flights by the day of the week"
)

Question 17. Examine the distribution of cancellation reasons with time. Make a bar plot of cancellation reasons aggregated by months.¶

Choose all correct statements:

October has the lowest number of cancellations due to weather. [+]
The highest number of cancellations in September is due to Security reasons.
April's top cancellation reason is carriers. [+]
Flights cancellations due to National Air System are more frequent than those due to carriers.

In [43]:

target_df = flights_df[flights_df["Cancelled"] == 1].copy()
target_df["CancellationReason"] = target_df["CancellationCode"].map(
    cancellation_map)

target_df.groupby(["Month", "CancellationReason"]).size().unstack(0).plot(
    kind="barh", figsize=(5, 12)
)

Out[43]:

<AxesSubplot:ylabel='CancellationReason'>

Reminder on Cancellatoin codes:

A - Carrier 
B - Weather 
C - National Air System 
D - Security

Question 18. Which month has the greatest number of cancellations due to Carrier?¶

May
January
September
April [+]

In [48]:

target_df.groupby(["CancellationReason", "Month"]).size().sort_values(
    ascending=False
).iloc[:5]

Out[48]:

CancellationReason  Month
Weather             12       10042
                    2        10034
                    3         7520
Carrier             4         7312
Weather             1         6711
dtype: int64

Question 19. Identify the carrier with the greatest number of cancellations due to carrier in the corresponding month from the previous question.¶

9E
EV
HA
AA [+]

In [51]:

target_df[target_df["CancellationReason"] == "Carrier"][
    "UniqueCarrier"
].value_counts().iloc[:3]

Out[51]:

AA    9609
WN    6554
UA    6352
Name: UniqueCarrier, dtype: int64

**Question 20. Examine median arrival and departure delays (in time) by carrier. Which carrier has the lowest median delay time for both arrivals and departures? Leave only non-negative values of delay times ('ArrDelay', 'DepDelay').¶

(Boxplots can be helpful in this exercise, as well as it might be a good idea to remove outliers in order to build nice graphs. You can exclude delay time values higher than a corresponding .95 percentile).**

EV
OO
AA
AQ [+]

In [94]:

target_df = flights_df.copy()
target_df = target_df[(target_df["ArrDelay"] >= 0) &
                      (target_df["DepDelay"] >= 0)]
flights_df.shape, target_df.shape

Out[94]:

((7009728, 21), (2320755, 21))

In [95]:

percentiles = target_df[["ArrDelay", "DepDelay"]].quantile(q=0.95)

In [96]:

target_df = target_df[
    (target_df["ArrDelay"] <= percentiles["ArrDelay"])
    & (target_df["DepDelay"] <= percentiles["DepDelay"])
]
target_df.shape

Out[96]:

(2188884, 21)

In [97]:

melted_df = target_df[["UniqueCarrier", "ArrDelay", "DepDelay"]].melt(
    id_vars="UniqueCarrier"
)

In [98]:

sns.boxplot(data=melted_df, y="UniqueCarrier", x="value", hue="variable")
sns.set(rc={"figure.figsize": (20, 20)}, font_scale=1.5, style="whitegrid")