Seeing Through Data
© Explore AI
Carbon dioxide (CO2) is a colourless, odourless and non-poisonous gas formed by combustion of carbon and in the respiration of living organisms and is considered a greenhouse gas.
CO2 emissions from the burning of fossil fuels are the primary cause of global warming which happens to be one of the biggest threats facing humanity in this day and age. Although there are plenty of other emissions that are emitted on this earth, including Methane, nitrous oxide, and CFCs, none compare to the emission of CO2, and we as humans are mostly to blame for this. For this analysis we will be focosing on CO2 Emissions and its effect on the world we live in as well as some key factors and stats that may play a role in the emission of CO2 globally.
The world as we know it, is becoming more modernized by the year, and with this becoming all the more POLLUTED.
According to UN Official Data States:
1. Over 3 BILLION PEOPLE of the world’s 8 Billion people are affected by degrading ecosystems
2. Pollution is responsible for some 9 MILLION premature deaths each year
3. Over 1 million plant and animal species risk extinction
4. 200 million people could be displaced EACH YEAR by climate disruption by 2050.
Our Work is a continuation on the analysis done by Benjamin from Minneapolis, Minnesota, United States on Kaggle. The result of his analysis includes;
The dataset used is broadly catategorizing all emitters, transportation, lifestyle, industry etc.. into one total amount for each energy type.
This notebook looks to further in the analysis and building of several Machine Learning Models which can predict accurately the CO2 emission based of several parameters.
Warning: We are not a climate scientist, some things may be inacurate. This is simply just a study on a subject im interested in, allowing me to go deeper into the subject while at the same time imporving my graphing skills. All my sources are at the bottom of the notebook.
⚡ Description: Importing Packages ⚡ |
---|
In this section we will be importing libraries used throughout our analysis and modelling which will allow us to call functions that are not part of your main python program, and briefly discuss them. |
# Libraries for Analysis
import numpy as np
import pandas as pd
# Mute warnings
import warnings
warnings.filterwarnings('ignore')
⚡ Description: Loading the data ⚡ |
---|
In this section we will be loading the data from the CSV and EXCEL files into Pandas DataFrames. |
# Load Base Data
df = pd.read_csv("data/Our_CO2emission_Clean_Data.csv")
# View first 5 rows of Loaded Base Data
df.head()
Unnamed: 0 | Country | e_type | Year | e_con | e_prod | GDP | Population | ei_capita | ei_gdp | CO2_emission | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | World | all | 1988 | 345.56 | 347.41 | 42106.6 | 4927545.08 | 70.13 | 8.21 | 21163.84 |
1 | 1 | World | coal | 1988 | 96.87 | 98.48 | 42106.6 | 4927545.08 | 70.13 | 8.21 | 8930.92 |
2 | 2 | World | nat_gas | 1988 | 71.01 | 71.85 | 42106.6 | 4927545.08 | 70.13 | 8.21 | 3571.68 |
3 | 3 | World | pet/oth | 1988 | 133.45 | 132.49 | 42106.6 | 4927545.08 | 70.13 | 8.21 | 8661.24 |
4 | 4 | World | nuclear | 1988 | 19.23 | 19.23 | 42106.6 | 4927545.08 | 70.13 | 8.21 | 0.00 |
# Drop Unamed Column
df = df.drop('Unnamed: 0', axis=1)
Column descriptions:
It will also be exciting to see how we can enrich the dataset with extra features. Hence, We will adding the following datasets;
From: REUTERS: Carbon emissions dip in 2009, to jump in 2010 - report “The real surprise was that we were expecting a bigger dip due to the financial crisis in terms of fossil fuel emissions,” said Pep Canadell, executive director of the Global Carbon Project and one of the co-authors of the study published in the latest issue of the journal Nature Geoscience. Scientists say rising levels of CO2, the main greenhouse gas, from burning fossil fuels and deforestation is heating up the planet; So we had BURNING FOSSIL FUEL covered but not the impact of DEFORESTATION...
Then FROM: Measuring Carbon Emissions from Tropical Deforestation: An Overview It states that Tropical deforestation contributes about 20% of annual global greenhouse gas (GHG) emissions and reducing it will be necessary to avoid dangerous climate change. China and the US are the world’s number one and two emitters, but numbers three and four are Indonesia and Brazil, with ~80% and ~70% of their emissions respectively from deforestation.
# Load Population Growth (Rate of population change)
pop_df = pd.read_csv("data/Population_Growth_from_world_Bank_Integrate.csv")
# Load Population Density per Country
den_df = pd.read_excel('data/Population_Density_per_country_data.xls')
# Load Manufacturing GDP Contribution (GDP splits)
mgdp_df = pd.read_excel('data/GDP_split_Manufacturing_contribution_data_per_Country.xls')
# Load Agri GDP Contribution (GDP splits)
agdp_df = pd.read_excel('data/GDP_split_Agricultural_contribution_data_per_Country.xls')
# Load Deforestation Impact per country
forest_df = pd.read_csv("data/Deforestation_data.csv") # Forest area (% of land area)
land_df = pd.read_csv("data/Land_Area_Data.csv") # Land Area (sq. km)
# Load Emission per Capita
df['emission_per_cap'] = df['CO2_emission']/df['Population']
# Check
pop_df.head(2)
Country Name | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | -1.246457 | -0.063879 | 1.816830 | 3.898739 | 5.446052 | 6.048669 | 5.644930 | 4.610156 | 3.531110 | ... | 0.503385 | 0.583290 | 0.590508 | 0.541048 | 0.502860 | 0.471874 | 0.459266 | 0.437415 | 0.428017 | NaN |
1 | Africa Eastern and Southern | 2.987172 | 2.956405 | 2.913059 | 2.871078 | 2.832013 | 2.791294 | 2.751374 | 2.710420 | 2.673851 | ... | 2.763426 | 2.761496 | 2.750400 | 2.732598 | 2.712218 | 2.690902 | 2.665620 | 2.636666 | 2.605427 | NaN |
2 rows × 35 columns
# Check
forest_df.head(1)
Country Name | Indicator Name | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | Forest area (% of land area) | NaN | NaN | 2.333333 | 2.333333 | 2.333333 | 2.333333 | 2.333333 | 2.333333 | ... | 2.333333 | 2.333333 | 2.333333 | 2.333333 | 2.333333 | 2.333333 | 2.333333 | 2.333333 | 2.333333 | NaN |
1 rows × 36 columns
OBSERVATION: After Careful studies, we observed that several countries have either been integrated into another or have had their names modified or changed thereby tending to result in lot's of missing values. SEE the table below
Country Name in Data | Current Name | Replacement Name | Special Case |
---|---|---|---|
Burma | Myanmar | Myanmar | |
Congo-Brazzaville | Republic of the Congo | Congo, Rep. | |
Congo-Kinshasa | Democratic Republic of the Congo | Congo, Dem. Rep. | |
Côte d’Ivoire | --- | Cote d'Ivoire | |
Guadeloupe | overseas département and overseas region of FRANCE | DROP | |
Laos | Lao People's Democratic Republic | Lao PDR | |
Macau | special administrative region CHINA | Macao SAR, China | |
Martinique | Island and overseas territorial collectivity of FRANCE | DROP | |
North Korea | Korea, Dem. People's Rep. | Korea, Dem. People's Rep. | Lump Together North & South Korea or DROP |
Reunion | Réunion La Réunion (French) | DROP | |
Saint Kitts and Nevis | Federation of Saint Christopher and Nevis | St. Kitts and Nevis | |
Saint Lucia | --- | St. Lucia | |
Saint Vincent/Grenadines | --- | St. Vincent and the Grenadines | |
South Korea | Korea, Dem. People's Rep. | --- | Lump Together North & South Korea or DROP |
Taiwan | Republic of China (ROC) | DROP | |
The Bahamas | --- | Bahamas, The | |
Kyrgyzstan | Kyrgyz Republic | Kyrgyz Republic | |
Slovakia | Slovak Republic | Slovak Republic | |
Palestinian Territories | Israel | West Bank and Gaza |
So Let's proceed to making this changes...
'''
This countries Have been merged to Major countries already present
in our Dataset and won't be present in dataset adding the additional
features, hence deleting them.
'''
# Drop Rows of Countries: Guadeloupe, Martinique, Reunion, Taiwan
df = df[df.Country.isin(['Guadeloupe', 'Martinique', 'Reunion', 'Taiwan'])==False]
# Decalre Replacement Names as Dict
replace_values = {'Burma' : 'Myanmar',
'Congo-Brazzaville' : 'Congo, Rep.',
'Congo-Kinshasa' : 'Congo, Dem. Rep.',
"Côte d’Ivoire": "Cote d'Ivoire",
"Laos": 'Lao PDR',
'Macau': 'Macao SAR, China',
'Saint Kitts and Nevis': 'St. Kitts and Nevis',
'Saint Lucia': 'St. Lucia',
'Saint Vincent/Grenadines': 'St. Vincent and the Grenadines',
'The Bahamas': 'Bahamas, The', 'Kyrgyzstan': 'Kyrgyz Republic',
'Slovakia': 'Slovak Republic',
'Palestinian Territories': 'West Bank and Gaza'
}
# Apply Replacement Names
df = df.replace({"Country": replace_values})
# Check
df[df['Country'] == 'Myanmar'].head(1)
Country | e_type | Year | e_con | e_prod | GDP | Population | ei_capita | ei_gdp | CO2_emission | emission_per_cap | |
---|---|---|---|---|---|---|---|---|---|---|---|
150 | Myanmar | all | 1988 | 0.09 | 0.08 | 26.35 | 40085.6 | 2.13 | 3.23 | 4.83 | 0.00012 |
⚡ Description: Exploratory data analysis ⚡ |
---|
In this section, we will be Integrating and Engineering our features with areas that may prove vaible in or analysis on the Emmission of CO2. |
Hence let's proceed to Integrating additional features.
# Defining function that Integrates Pop_Growth
def add_pop_growth(row):
val = pop_df.loc[pop_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
return round(float(list(val)[0]),3) if len(val)>0 else np.NaN
# Applying Fuction
df['pop_growth'] = df.apply(add_pop_growth, axis=1)
# Check for missing values
df[df["pop_growth"].isnull()]["Country"].unique()
array(['North Korea', 'South Korea', 'New Zealand', 'Kuwait', 'Eritrea'], dtype=object)
North Korea & South Korea
needs to be collapse as one Country called Korea, Dem. People's Rep
. While for 'New Zealand', 'Kuwait', 'Eritrea' the missing values will have to be delt with conventionally.
Country | Year's with Missing Values | Action |
---|---|---|
New Zealand | 1991 only | Fill with mean/median |
Kuwait | 1992 to 1995 | Fill with mean/median |
Eritrea | 2012 to 2019 | Fill with mean/median |
# Defining function that Integrates Pop_Density
def add_pop_den(row):
val = den_df.loc[den_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
return round(float(list(val)[0]),3) if len(val)>0 else np.NaN
# Applying Fuction
df['pop_density'] = df.apply(add_pop_den, axis=1)
# Check for missing values
df[df["pop_density"].isnull()]["Country"].unique()
array(['Luxembourg', 'North Korea', 'South Korea', 'Kuwait', 'Kosovo'], dtype=object)
So we see a couple of additional missing values present in the population density Column for Countries 'Luxembourg', 'North Korea', 'South Korea', 'Kuwait', 'Kosovo'
We will look to treat this later
The calculation of a country's GDP encompasses all private and public consumption, government outlays, investments, additions to private inventories, paid-in construction costs, and the foreign balance of trade.
We will be focusing on Just the GDP Contributions of the Manufacturing & Agricultural Industries per country per time.
# Defining function that Integrates Manufacturing GDP Contribution
def add_gdp_manu(row):
val = mgdp_df.loc[mgdp_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN
# Defining function that Integrates Agricuture GDP Contribution
def add_gdp_agri(row):
val = agdp_df.loc[agdp_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN
# Applying Fuction Respectively
df['manuf_GDP'] = df.apply(add_gdp_manu, axis=1)
df['agri_GDP'] = df.apply(add_gdp_agri, axis=1)
It's also worthy of Note that the manuf_GDP & agri_GDP
are percentage contribution of the overal GDP, Hence we'll have to extract the value for computation.
df['Manuf_GDP'] = (df['manuf_GDP']/100)*df['GDP']
df['Agric_GDP'] = (df['agri_GDP']/100)*df['GDP']
# Defining function that Integrates Forest Area % of Land & Land Area Sq.M Data
def add_forest(row):
val = forest_df.loc[forest_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN
def add_land(row):
val = land_df.loc[land_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN
# Applying Fuction Respectively
df['Forest'] = df.apply(add_forest, axis=1)
df['Land'] = df.apply(add_land, axis=1)
# Get Exact Forest area in SqM
df['Deforestation'] = (df['Forest']/100)*df['Land']
# Note: we cleaned all land info below 1990 as to avoid errors since Forest data starts from 1990
# Drop redundant Forest & Land Columns
df = df.drop(['Forest', 'Land'], axis=1)
The forest data begins from year 1990, hence we will be experiencing missing values across all Countries for the years 1988 & 1989 and maybe a few others within the dataset.
This refers to the per capita/person emission for each country per energy type
# Adding the emission_per_cap column
df['emission_per_cap'] = df['CO2_emission']/df['Population']
"""
Let's Reposition our Target variable
CO2 Emission to the End of our Dataframe
"""
# Seperate Other Features From Target Variable
others = df.drop(['CO2_emission', 'emission_per_cap'], axis=1)
co = df[['emission_per_cap', 'CO2_emission']]
# Delete df
del df
# concat both Tables into fresh df
df = pd.concat([others, co], axis=1)
# Dropping % version of agric & Manufac GDP
df = df.drop(['manuf_GDP', 'agri_GDP'], axis=1)
df.head()
Country | e_type | Year | e_con | e_prod | GDP | Population | ei_capita | ei_gdp | pop_growth | pop_density | Manuf_GDP | Agric_GDP | Deforestation | emission_per_cap | CO2_emission | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | World | all | 1988 | 345.56 | 347.41 | 42106.6 | 4927545.08 | 70.13 | 8.21 | 1.77 | 39.285 | NaN | 2233.75513 | NaN | 0.004295 | 21163.84 |
1 | World | coal | 1988 | 96.87 | 98.48 | 42106.6 | 4927545.08 | 70.13 | 8.21 | 1.77 | 39.285 | NaN | 2233.75513 | NaN | 0.001812 | 8930.92 |
2 | World | nat_gas | 1988 | 71.01 | 71.85 | 42106.6 | 4927545.08 | 70.13 | 8.21 | 1.77 | 39.285 | NaN | 2233.75513 | NaN | 0.000725 | 3571.68 |
3 | World | pet/oth | 1988 | 133.45 | 132.49 | 42106.6 | 4927545.08 | 70.13 | 8.21 | 1.77 | 39.285 | NaN | 2233.75513 | NaN | 0.001758 | 8661.24 |
4 | World | nuclear | 1988 | 19.23 | 19.23 | 42106.6 | 4927545.08 | 70.13 | 8.21 | 1.77 | 39.285 | NaN | 2233.75513 | NaN | 0.000000 | 0.00 |
"""
Extract & Save Data as CSV
Unhash: To Run
"""
# df.to_csv('data/Our_CO2emission_Analysis_Data.csv')