Air pollution is widely believed as world's greatest environmental health threat.
Among the 30 most polluted cities in the world, 21 are located in India ,as reported by Swiss organisation, IQAir (based on PM 2.5 concentration.)
In this analysis we will be analysing the pollution level of indian cities over the last 7 years.
The aim is to analyze trends , pinpoint the possible causes and deriving insights from the large amount of granular data , relating to the concentration of air pollutants.
Eventually , we would propose solutions, based on the insights from data.
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import pycountry
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
!pip install chart_studio
import chart_studio.plotly as py
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
import folium
from folium import Choropleth, Circle, Marker
from folium import plugins
from folium.plugins import HeatMap, MarkerCluster
!pip install bar_chart_race
import bar_chart_race as bcr
from IPython.display import HTML
plt.rcParams['figure.figsize'] = 8, 5
plt.style.use("fivethirtyeight")
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import missingno as msno
Requirement already satisfied: chart_studio in c:\users\ankit sharma\miniconda3\lib\site-packages (1.1.0) Requirement already satisfied: requests in c:\users\ankit sharma\miniconda3\lib\site-packages (from chart_studio) (2.28.1) Requirement already satisfied: retrying>=1.3.3 in c:\users\ankit sharma\miniconda3\lib\site-packages (from chart_studio) (1.3.4) Requirement already satisfied: six in c:\users\ankit sharma\miniconda3\lib\site-packages (from chart_studio) (1.16.0) Requirement already satisfied: plotly in c:\users\ankit sharma\miniconda3\lib\site-packages (from chart_studio) (5.13.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\ankit sharma\miniconda3\lib\site-packages (from plotly->chart_studio) (8.1.0) Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\ankit sharma\miniconda3\lib\site-packages (from requests->chart_studio) (2.0.4) Requirement already satisfied: certifi>=2017.4.17 in c:\users\ankit sharma\miniconda3\lib\site-packages (from requests->chart_studio) (2022.12.7) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\ankit sharma\miniconda3\lib\site-packages (from requests->chart_studio) (1.26.13) Requirement already satisfied: idna<4,>=2.5 in c:\users\ankit sharma\miniconda3\lib\site-packages (from requests->chart_studio) (3.4)
Requirement already satisfied: bar_chart_race in c:\users\ankit sharma\miniconda3\lib\site-packages (0.1.0) Requirement already satisfied: matplotlib>=3.1 in c:\users\ankit sharma\miniconda3\lib\site-packages (from bar_chart_race) (3.6.3) Requirement already satisfied: pandas>=0.24 in c:\users\ankit sharma\miniconda3\lib\site-packages (from bar_chart_race) (1.5.2) Requirement already satisfied: cycler>=0.10 in c:\users\ankit sharma\miniconda3\lib\site-packages (from matplotlib>=3.1->bar_chart_race) (0.11.0) Requirement already satisfied: pillow>=6.2.0 in c:\users\ankit sharma\miniconda3\lib\site-packages (from matplotlib>=3.1->bar_chart_race) (9.4.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\ankit sharma\miniconda3\lib\site-packages (from matplotlib>=3.1->bar_chart_race) (4.38.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\ankit sharma\miniconda3\lib\site-packages (from matplotlib>=3.1->bar_chart_race) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\ankit sharma\miniconda3\lib\site-packages (from matplotlib>=3.1->bar_chart_race) (22.0) Requirement already satisfied: python-dateutil>=2.7 in c:\users\ankit sharma\miniconda3\lib\site-packages (from matplotlib>=3.1->bar_chart_race) (2.8.2) Requirement already satisfied: contourpy>=1.0.1 in c:\users\ankit sharma\miniconda3\lib\site-packages (from matplotlib>=3.1->bar_chart_race) (1.0.7) Requirement already satisfied: numpy>=1.19 in c:\users\ankit sharma\miniconda3\lib\site-packages (from matplotlib>=3.1->bar_chart_race) (1.23.5) Requirement already satisfied: pyparsing>=2.2.1 in c:\users\ankit sharma\miniconda3\lib\site-packages (from matplotlib>=3.1->bar_chart_race) (3.0.9) Requirement already satisfied: pytz>=2020.1 in c:\users\ankit sharma\miniconda3\lib\site-packages (from pandas>=0.24->bar_chart_race) (2022.7) Requirement already satisfied: six>=1.5 in c:\users\ankit sharma\miniconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib>=3.1->bar_chart_race) (1.16.0)
I start by importing the dataset into Jupyter notebook. After checking its dimensions and composition, I check for mixed-type data, missing values, and duplicates.
pwd = os.getcwd()
Imported_data = pd.read_csv( pwd + "/city_data.csv")
Imported_data
City | Date | PM2.5 | PM10 | NO | NO2 | NOx | NH3 | CO | SO2 | O3 | Benzene | Toluene | Xylene | AQI | AQI_Bucket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ahmedabad | 1/1/2015 | NaN | NaN | 0.92 | 18.22 | 17.15 | NaN | 0.92 | 27.64 | 133.36 | 0 | 0.02 | 0 | NaN | NaN |
1 | Ahmedabad | 2/1/2015 | NaN | NaN | 0.97 | 15.69 | 16.46 | NaN | 0.97 | 24.55 | 34.06 | 3.68 | 5.5 | 3.77 | NaN | NaN |
2 | Ahmedabad | 3/1/2015 | NaN | NaN | 17.4 | 19.3 | 29.7 | NaN | 17.4 | 29.07 | 30.7 | 6.8 | 16.4 | 2.25 | NaN | NaN |
3 | Ahmedabad | 4/1/2015 | NaN | NaN | 1.7 | 18.48 | 17.97 | NaN | 1.7 | 18.59 | 36.08 | 4.43 | 10.14 | 1 | NaN | NaN |
4 | Ahmedabad | 5/1/2015 | NaN | NaN | 22.1 | 21.42 | 37.76 | NaN | 22.1 | 39.33 | 39.31 | 7.01 | 18.89 | 2.78 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54169 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
54170 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
54171 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
54172 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
54173 | NaN | NaN | NaN | . | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
54174 rows × 16 columns
We can see in the above dataframe that , the rows in the end has null values only. So first step is to get rid of these null values.
Imported_data = Imported_data[Imported_data['City'].notna()]
Imported_data
City | Date | PM2.5 | PM10 | NO | NO2 | NOx | NH3 | CO | SO2 | O3 | Benzene | Toluene | Xylene | AQI | AQI_Bucket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ahmedabad | 1/1/2015 | NaN | NaN | 0.92 | 18.22 | 17.15 | NaN | 0.92 | 27.64 | 133.36 | 0 | 0.02 | 0 | NaN | NaN |
1 | Ahmedabad | 2/1/2015 | NaN | NaN | 0.97 | 15.69 | 16.46 | NaN | 0.97 | 24.55 | 34.06 | 3.68 | 5.5 | 3.77 | NaN | NaN |
2 | Ahmedabad | 3/1/2015 | NaN | NaN | 17.4 | 19.3 | 29.7 | NaN | 17.4 | 29.07 | 30.7 | 6.8 | 16.4 | 2.25 | NaN | NaN |
3 | Ahmedabad | 4/1/2015 | NaN | NaN | 1.7 | 18.48 | 17.97 | NaN | 1.7 | 18.59 | 36.08 | 4.43 | 10.14 | 1 | NaN | NaN |
4 | Ahmedabad | 5/1/2015 | NaN | NaN | 22.1 | 21.42 | 37.76 | NaN | 22.1 | 39.33 | 39.31 | 7.01 | 18.89 | 2.78 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
53263 | Visakhapatnam | 6/2/2023 | 130.17 | 369.34 | 118.16 | 63.8 | 130 | 8.13 | 1.5 | 24.67 | 13.77 | 12.09 | 44.85 | 4.61 | NaN | NaN |
53264 | Visakhapatnam | 7/2/2023 | None | None | None | None | None | None | None | None | None | None | None | None | NaN | NaN |
53265 | Visakhapatnam | 8/2/2023 | 52.6 | 154.07 | 30.98 | 31.22 | 39.93 | 5.77 | 1.32 | 19.69 | 15 | 3.46 | 10.85 | 0.98 | NaN | NaN |
53266 | Visakhapatnam | 9/2/2023 | 57.74 | 147.25 | 29.47 | 31.34 | 40.67 | 5.84 | 1.07 | 26.98 | 13.99 | 3.65 | 21.75 | 0.8 | NaN | NaN |
53267 | Visakhapatnam | 10/2/2023 | 60.09 | 153.13 | 30.36 | 32.69 | 42.08 | 4.54 | 0.98 | 28.22 | 14.05 | 2.99 | 24.62 | 0.96 | NaN | NaN |
53268 rows × 16 columns
Imported_data = Imported_data.replace('None', np.NaN)
Imported_data['PM2.5'] = pd.to_numeric(Imported_data['PM2.5'])
Imported_data['PM10'] = pd.to_numeric(Imported_data['PM10'])
Imported_data['NO2'] = pd.to_numeric(Imported_data['NO2'])
Imported_data['CO'] = pd.to_numeric(Imported_data['CO'])
Imported_data['SO2'] = pd.to_numeric(Imported_data['SO2'])
Imported_data['O3'] = pd.to_numeric(Imported_data['O3'])
Imported_data['Benzene'] = pd.to_numeric(Imported_data['Benzene'])
Imported_data['Toluene'] = pd.to_numeric(Imported_data['Toluene'])
Imported_data['Xylene'] = pd.to_numeric(Imported_data['Xylene'])
Imported_data['NOx'] = pd.to_numeric(Imported_data['NOx'])
Imported_data['NO'] = pd.to_numeric(Imported_data['NO'])
Imported_data['NH3'] = pd.to_numeric(Imported_data['NH3'])
Imported_data['Date'] = pd.to_datetime(Imported_data['Date'])
print(f"The available data is between {Imported_data['Date'].min()} and {Imported_data['Date'].max()}")
The available data is between 2015-01-01 00:00:00 and 2023-12-01 00:00:00
Imported_data_modified = Imported_data.copy()
To visually understand , the proporation of rows with missing information , The missingno library can be used.missingno is a third-party Python library used for visualizing missing data in datasets
Imported_data_modified.drop(columns= ["AQI" , "AQI_Bucket"] , inplace= True)
plt.style.use('seaborn-white')
msno.matrix(Imported_data_modified, );
We can clearly see that Xylene information is largely missing .
Missing information indicates the installed sensor's inability to measure pollutant's concentration.
In order to dig a little deeper , We can use heatmap for understanding the quantity of missing information , for each of the given pollutant. Heatmap is created by summing up the number of null values and converting them into percentages.
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mis_val_table = pd.concat([mis_val,mis_val_percent] , axis = 1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values' , 1 : '% of Total Values'})
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0 ].sort_values(
'% of Total Values' , ascending = False).round(1)
print('Your selected dataframe has' + str(df.shape[1]) + 'columns.\n'
'There are ' + str(mis_val_table_ren_columns.shape[0]) + ' columns that have missing values.')
return mis_val_table_ren_columns
missing_values = missing_values_table(Imported_data_modified)
missing_values.style.background_gradient(cmap = 'YlOrBr')
Your selected dataframe has14columns. There are 12 columns that have missing values.
Missing Values | % of Total Values | |
---|---|---|
Xylene | 33578 | 63.000000 |
Toluene | 17023 | 32.000000 |
NH3 | 13809 | 25.900000 |
PM10 | 13056 | 24.500000 |
Benzene | 9112 | 17.100000 |
O3 | 6798 | 12.800000 |
PM2.5 | 6762 | 12.700000 |
NOx | 5781 | 10.900000 |
NO2 | 5677 | 10.700000 |
NO | 5307 | 10.000000 |
SO2 | 5163 | 9.700000 |
CO | 3244 | 6.100000 |
As expected Xylene's information is largely missing. The major pollutants like PM2.5 and PM10 concentration is within permissible limit.
Now , we can proceed further to analyze the dataset.
In correlation analysis, I use pandas to create a correlation matrix for all the numerical variables in the dataset. This will indicate the level of interdependence between the variables.
The volatile organic compounds Benzene , Toluene and Xylene are grouped together and represented as 'BTX'.
Imported_data_modified['BTX'] = Imported_data_modified['Benzene']+Imported_data_modified['Toluene']+Imported_data_modified['Xylene']
Imported_data_modified['Particulate_Matter'] = Imported_data_modified['PM2.5']+Imported_data_modified['PM10']
Imported_data_modified['Nitrogen Oxides'] = Imported_data_modified['NO']+Imported_data_modified['NO2']+Imported_data_modified['NOx']
Imported_data_modified.drop(['Benzene','Toluene','Xylene','PM2.5','PM10','NO','NO2','NOx'],axis=1,inplace=True)
plt.figure(figsize=(5,4))
sns.heatmap(Imported_data_modified.corr(),cmap='seismic',annot=True);
The heatmap shows that there are only moderate relationships (like - BTX with other pollutants , Particulate matter with Nitrogen Oxides). It is important to note that- There is no strong relationships (i.e cofficent above 0.75).
I choose to use visual tools, charts, and graphs, to do the analysis.
It would help us to identify trends , In the pollutants concentration.
It is imperative to test the Hypothesis that - "Air pollution is associated with the type of climate and specificity of particular months."
Let's now analyse the data to see what patterns and insights we can uncover from it.
pollutants = ['PM2.5' , 'PM10' , 'NO2' , 'CO' , 'SO2' , 'O3']
filtered_city_day = Imported_data_modified[Imported_data_modified['Date'] <= '2023-01-01']
filtered_city_day.set_index('Date' , inplace = True)
axes = filtered_city_day[pollutants].plot(marker = '.' , alpha = 0.5 , linestyle = 'None' , figsize = (16,30) , subplots = True)
for ax in axes:
ax.set_xlabel('Years')
ax.set_ylabel('ug / m3')
To dig a little deeper, i created subplots to analyze the distribution of pollutants over 5 year time period. To create the subplots I used the .subplot[] function in python.
It would help us to see the seasonal effect in more clearly.
It is imperative to test *Hypothesis* that- " Covid-19 pandemic lockdown (started in March 25 2020) had no effect on air pollution" .
Imported_data_modified['NH3'] = pd.to_numeric(Imported_data_modified['NH3'])
corr_with_AQI = Imported_data_modified.corr().PM10.sort_values(ascending = False)
metrices = corr_with_AQI[corr_with_AQI>0.01].index
Imported_data_modified['Year_Month'] = Imported_data_modified.Date.apply(lambda x : x.strftime('%Y-%m'))
Imported_data_modified = Imported_data_modified[Imported_data_modified['Year_Month'] <= '2023-01-01']
Imported_data_modified = Imported_data_modified[Imported_data_modified['Year_Month'] >= '2017-10-01']
df = Imported_data_modified.groupby(['Year_Month']).sum().reset_index()
sns.set_style('ticks')
fig, ax_ = plt.subplots(len(metrices), 1, figsize=(20,40))
fig.tight_layout(pad=4)
for i, col in enumerate(metrices,start = 0):
x = df['Year_Month']
y = df[col]
ax_[i].plot_date(x ,y ,label=col, linestyle="--")
ax_[i].set_xticklabels(df['Year_Month'], rotation=85);
ax_[i].legend();
The trend is clearly visible , The concentraion of pollutants in summer months (i.e April to August) is lowest . While concentration is highest in the winter months (i.e November to February).
Also , there is a notable dip after 25 March 2020 lockdown for all pollutants. so the hypothesis that " Covid-19 pandemic lockdown (started in March 25 2020) had no effect on air pollution" . is False. The decrease in Particulate matter pollution and gaseous pollutants (CO and NO2) clearly reflects the impact of ceased industrial and vehicular activities during lockdown.
To analyze the pollutantion levels in the cities. Let's start by listing down all the 26 cities present in this dataset. To identify any duplicates , due to improper naming.
Imported_data_modified['City'].unique()
array(['Ahmedabad', 'Aizawl', 'Amaravati', 'Amritsar', 'Bengaluru', 'Bhopal', 'Brajrajnagar', 'Chandigarh', 'Chennai', 'Coimbatore', 'Delhi', 'Ernakulam', 'Gurugram', 'Guwahati', 'Hyderabad', 'Jaipur', 'Jorapokhar', 'Kochi', 'Kolkata', 'Lucknow', 'Mumbai', 'Patna', 'Shillong', 'Talcher', 'Thiruvananthapuram', 'Visakhapatnam'], dtype=object)
Proporation of pollutants can be plotted side by side using the treemap.
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
!pip install chart_studio
import chart_studio.plotly as py
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
df = Imported_data_modified.drop(columns = ['Date', 'AQI_Bucket', 'AQI']).groupby('City').sum().reset_index()
fig = px.treemap(pd.melt(df, id_vars = 'City'), path=['City','variable'],
values=pd.melt(df, id_vars = 'City')['value'],
title = 'Cities and the proportion of pollutants in each')
fig.show()
North indian like cities - Delhi , Gurugram . tops this visualization.
In cities like Ahmedabad and Patna, The high rate of urbanisation and industrialisation are the main reason for high pollution levels.
Let's dig down a little deeper , In order to find out the cities that reported maximum concentration of pollutants , like - PM2.5 , PM10 , SO2 etc.
def max_polluted_city(pollutant):
x1 = Imported_data_modified[[pollutant , 'City']].groupby(['City']).mean().sort_values(by = pollutant , ascending = False).reset_index()
x1[pollutant] = round(x1[pollutant],2)
return x1[:10].style.background_gradient(cmap = 'coolwarm')
from IPython.display import display_html
def display_side_by_side(*args):
html_str = ''
for df in args:
html_str += df.render()
display_html(html_str.replace('table', 'table style = "display:inline"'),raw = True)
pm2_5 = max_polluted_city('PM2.5')
pm10 = max_polluted_city('PM10')
no2 = max_polluted_city('NO2')
so2 = max_polluted_city('SO2')
co = max_polluted_city('CO')
display_side_by_side(pm2_5,pm10,no2,so2,co)
City | PM2.5 | |
---|---|---|
0 | Delhi | 115.980000 |
1 | Gurugram | 114.780000 |
2 | Patna | 112.450000 |
3 | Lucknow | 96.220000 |
4 | Mumbai | 69.630000 |
5 | Jorapokhar | 64.370000 |
6 | Guwahati | 63.040000 |
7 | Ahmedabad | 60.590000 |
8 | Jaipur | 59.620000 |
9 | Kolkata | 59.360000 |
City | PM10 | |
---|---|---|
0 | Delhi | 227.930000 |
1 | Gurugram | 208.870000 |
2 | Patna | 165.060000 |
3 | Lucknow | 146.860000 |
4 | Jorapokhar | 132.790000 |
5 | Jaipur | 129.820000 |
6 | Ahmedabad | 128.350000 |
7 | Talcher | 121.680000 |
8 | Bhopal | 117.490000 |
9 | Guwahati | 117.430000 |
City | NO2 | |
---|---|---|
0 | Ahmedabad | 57.290000 |
1 | Delhi | 48.350000 |
2 | Jaipur | 42.220000 |
3 | Visakhapatnam | 36.470000 |
4 | Lucknow | 35.990000 |
5 | Kolkata | 34.650000 |
6 | Patna | 34.320000 |
7 | Hyderabad | 31.180000 |
8 | Bengaluru | 29.050000 |
9 | Coimbatore | 26.800000 |
City | SO2 | |
---|---|---|
0 | Ahmedabad | 45.750000 |
1 | Jorapokhar | 33.450000 |
2 | Talcher | 27.240000 |
3 | Bhopal | 18.150000 |
4 | Mumbai | 17.250000 |
5 | Guwahati | 17.070000 |
6 | Brajrajnagar | 16.400000 |
7 | Patna | 15.910000 |
8 | Visakhapatnam | 14.900000 |
9 | Delhi | 14.340000 |
City | CO | |
---|---|---|
0 | Ahmedabad | 13.680000 |
1 | Jorapokhar | 9.530000 |
2 | Lucknow | 4.490000 |
3 | Bengaluru | 3.820000 |
4 | Delhi | 1.710000 |
5 | Talcher | 1.550000 |
6 | Brajrajnagar | 1.420000 |
7 | Patna | 1.360000 |
8 | Gurugram | 1.150000 |
9 | Kochi | 1.110000 |
As expected , the same cities has the highest mean value.
Now , According to cpcb.nic.in . The PM2.5 and PM10 are most crucial pollutant to find out the air quality.
Permissible limit for the annual average of pollutants PM2.5 is 40 ug/m3 , PM10 is 60 ug/m3. As seen in visualization above , The Indian cities PM2.5 and PM10 levels are signifcantly higher than permissible limit.
Status of indian cities :
These figures clearly shows that "air pollution is a silent crisis in india , It's an emergency".
It could be clearly the reason for the abrupt rise in diseases such as Lung cancer in these cities.
Air pollution is associated with the type of climate and specificity of particular months. In other words , The Pollution shows seasonal affect , as the concentraion of pollutants in summer months (i.e April to August) is lowest . And highest in the winter months (i.e November to February).
Urbanized North indian cities - Delhi , Gurugram. are seriously affected due to pollution . As during winters , low temperature coupled with stubble burning( after harvest) in the adjoining states , leads to significant rise in pollutants concentration.
There was a notable dip in pollutants concentration due to lockdown of Covid-19 pandemic .
The Average particulate matter in some polluted indian cities is over 3 times the maximum permissible limit.
Government should double down the effort during the winter season to control the pollution.
The notable fall in the pollutants concentration during Covid-19 pandemic , has shown that reduction in the use of polluting vehicles and industries , could lead to an immediate relief from the severe air pollution. Therefore , promotion of public transport and curbing polluting industries , could provide an impressive cut in pollution levels.
The people should be informed about the fact that pollutants level is twice to thrice times above the safe limit. And , its impact on their health.