CO2 Emission Analysis and Prediction¶

PART 1: Data Integration & Cleaning Notebook¶

AiGlass¶

Seeing Through Data

Overview¶

Carbon dioxide (CO2) is a colourless, odourless and non-poisonous gas formed by combustion of carbon and in the respiration of living organisms and is considered a greenhouse gas.

CO2 emissions from the burning of fossil fuels are the primary cause of global warming which happens to be one of the biggest threats facing humanity in this day and age. Although there are plenty of other emissions that are emitted on this earth, including Methane, nitrous oxide, and CFCs, none compare to the emission of CO2, and we as humans are mostly to blame for this. For this analysis we will be focosing on CO2 Emissions and its effect on the world we live in as well as some key factors and stats that may play a role in the emission of CO2 globally.

The world as we know it, is becoming more modernized by the year, and with this becoming all the more POLLUTED.

According to UN Official Data States:

1. Over 3 BILLION PEOPLE of the world’s 8 Billion people are affected by degrading ecosystems 
2. Pollution is responsible for some 9 MILLION premature deaths each year
3. Over 1 million plant and animal species risk extinction
4. 200 million people could be displaced EACH  YEAR by climate disruption by 2050.

Our Work is a continuation on the analysis done by Benjamin from Minneapolis, Minnesota, United States on Kaggle. The result of his analysis includes;

CO2 Emission has been increasing throughout the time period.
Coal and Petroleum/other liquids have been the dominant energy source for this time period.
CO2 Emission has been icreasing 1.71% yearlly on average, and has overall increased by 68.14% over the entire time period.
As of 2019, the average CO2 emission emitted was 10.98 (MMtonnes CO2) for the year.
The top CO2 emitters over the entire time period have been China and The United States, both exceding nearlly 4x or more the amount of every other country.
Throughout the time period, China and India have increased there CO2 Emissions the most out of every other country.
Throughout the time period, Former soviet republics have had the largest decrease in CO2 emission, The United Kingdom and Germany have also decreased there emissions a bit as well.
Generally speaking, the larger the population, the more CO2 the country will be likely to emit.
The larger the GDP, the more likely the country will have a high CO2 emission.
The larger the Energy Consumption of a country, the larger the CO2 emission.
A high or low Energy Intensity by GDP of Energy Intensity per capita isnt necesarilly predictive of a large CO2 emission, but generally speaking the lower it is the better (the more energy conserved means less CO2 emitted).

The dataset used is broadly catategorizing all emitters, transportation, lifestyle, industry etc.. into one total amount for each energy type.

This notebook looks to further in the analysis and building of several Machine Learning Models which can predict accurately the CO2 emission based of several parameters.

Warning: We are not a climate scientist, some things may be inacurate. This is simply just a study on a subject im interested in, allowing me to go deeper into the subject while at the same time imporving my graphing skills. All my sources are at the bottom of the notebook.

Table of Contents¶

1. Importing Packages

2. Loading Data

3. Integrating Additional Data

4. Extract Integrated Data

1. Importing Packages¶

Back to Table of Contents

⚡ Description: Importing Packages ⚡
In this section we will be importing libraries used throughout our analysis and modelling which will allow us to call functions that are not part of your main python program, and briefly discuss them.

In [1]:

# Libraries for Analysis
import numpy as np
import pandas as pd

# Mute warnings
import warnings
warnings.filterwarnings('ignore')

2. Loading the Data¶

Back to Table of Contents

⚡ Description: Loading the data ⚡
In this section we will be loading the data from the CSV and EXCEL files into Pandas DataFrames.

In [2]:

# Load Base Data
df = pd.read_csv("data/Our_CO2emission_Clean_Data.csv")

In [3]:

# View first 5 rows of Loaded Base Data
df.head()

Out[3]:

	Unnamed: 0	Country	e_type	Year	e_con	e_prod	GDP	Population	ei_capita	ei_gdp	CO2_emission
0	0	World	all	1988	345.56	347.41	42106.6	4927545.08	70.13	8.21	21163.84
1	1	World	coal	1988	96.87	98.48	42106.6	4927545.08	70.13	8.21	8930.92
2	2	World	nat_gas	1988	71.01	71.85	42106.6	4927545.08	70.13	8.21	3571.68
3	3	World	pet/oth	1988	133.45	132.49	42106.6	4927545.08	70.13	8.21	8661.24
4	4	World	nuclear	1988	19.23	19.23	42106.6	4927545.08	70.13	8.21	0.00

In [4]:

# Drop Unamed Column
df = df.drop('Unnamed: 0', axis=1)

Column descriptions:

Country - Country in question
Energy_type - Type of energy source
Year - Year the data was recorded
Energy_consumption - Amount of Consumption for the specific energy source, measured (quad Btu)
Energy_production - Amount of Production for the specific energy source, measured (quad Btu)
GDP - Countries GDP at purchasing power parities, measured (Billion 2015$ PPP)
Population - Population of specific Country, measured (Mperson)
Energy_intensity_per_capita - Energy intensity is a measure of the energy inefficiency of an economy. It is calculated as units of energy per unit of capita (capita = individual person), measured (MMBtu/person)
Energy_intensity_by_GDP- Energy intensity is a measure of the energy inefficiency of an economy. It is calculated as units of energy per unit of GDP, measred (1000 Btu/2015$ GDP PPP)
CO2_emission - The amount of C02 emitted, measured (MMtonnes CO2)

It will also be exciting to see how we can enrich the dataset with extra features. Hence, We will adding the following datasets;

Rate of population change - To see if a possible change in population of a place will result in change in CO2 emission & to What extent
Population density - Does the density of a population have any effect on CO2 Emission?
GDP splits - Example, % for agriculture vs manufacturing; Hypothetically, GDP increase due to agricultural/Green activities should oppose the direct correlation of rise in GDP to CO2 Emission
Rate of Deforestation - As a result of our research on why the Dip in CO2 Emission of the World occurred in 2009 and the sudden rise in 2010 when Energy Type, Pop, and GDP were Constant....

From: REUTERS: Carbon emissions dip in 2009, to jump in 2010 - report “The real surprise was that we were expecting a bigger dip due to the financial crisis in terms of fossil fuel emissions,” said Pep Canadell, executive director of the Global Carbon Project and one of the co-authors of the study published in the latest issue of the journal Nature Geoscience. Scientists say rising levels of CO2, the main greenhouse gas, from burning fossil fuels and deforestation is heating up the planet; So we had BURNING FOSSIL FUEL covered but not the impact of DEFORESTATION... Then FROM: Measuring Carbon Emissions from Tropical Deforestation: An Overview It states that Tropical deforestation contributes about 20% of annual global greenhouse gas (GHG) emissions and reducing it will be necessary to avoid dangerous climate change. China and the US are the world’s number one and two emitters, but numbers three and four are Indonesia and Brazil, with ~80% and ~70% of their emissions respectively from deforestation.

Emission per Capita - Also: probing into the theory that a unit increase in Population is directly impacting on the increase in CO2 Emission, we opted to getting a column which represents per capita emission for each country per energy type which will be plotted against the co2 emission and resulting graph compared with the graph of countries/population of highest emitters. The idea is if the comparism correlates, then our Hypothesis theory of increase in pop is directly propotional to increase in CO2 Emission, is 100% valid, if not; To be modified with extra clause.

In [5]:

# Load Population Growth (Rate of population change)
pop_df = pd.read_csv("data/Population_Growth_from_world_Bank_Integrate.csv")
# Load Population Density per Country
den_df = pd.read_excel('data/Population_Density_per_country_data.xls')
# Load Manufacturing GDP Contribution (GDP splits)
mgdp_df = pd.read_excel('data/GDP_split_Manufacturing_contribution_data_per_Country.xls')
# Load Agri GDP Contribution (GDP splits)
agdp_df = pd.read_excel('data/GDP_split_Agricultural_contribution_data_per_Country.xls')
# Load Deforestation Impact per country 
forest_df = pd.read_csv("data/Deforestation_data.csv") # Forest area (% of land area)
land_df = pd.read_csv("data/Land_Area_Data.csv") # Land Area (sq. km)
# Load Emission per Capita
df['emission_per_cap'] = df['CO2_emission']/df['Population']

In [6]:

# Check
pop_df.head(2)

Out[6]:

	Country Name	1988	1989	1990	1991	1992	1993	1994	1995	1996	...	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
0	Aruba	-1.246457	-0.063879	1.816830	3.898739	5.446052	6.048669	5.644930	4.610156	3.531110	...	0.503385	0.583290	0.590508	0.541048	0.502860	0.471874	0.459266	0.437415	0.428017	NaN
1	Africa Eastern and Southern	2.987172	2.956405	2.913059	2.871078	2.832013	2.791294	2.751374	2.710420	2.673851	...	2.763426	2.761496	2.750400	2.732598	2.712218	2.690902	2.665620	2.636666	2.605427	NaN

2 rows × 35 columns

In [7]:

# Check
forest_df.head(1)

Out[7]:

	Country Name	Indicator Name	1988	1989	1990	1991	1992	1993	1994	1995	...	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
0	Aruba	Forest area (% of land area)	NaN	NaN	2.333333	2.333333	2.333333	2.333333	2.333333	2.333333	...	2.333333	2.333333	2.333333	2.333333	2.333333	2.333333	2.333333	2.333333	2.333333	NaN

1 rows × 36 columns

OBSERVATION: After Careful studies, we observed that several countries have either been integrated into another or have had their names modified or changed thereby tending to result in lot's of missing values. SEE the table below

Country Name in Data	Current Name	Replacement Name	Special Case
Burma	Myanmar	Myanmar
Congo-Brazzaville	Republic of the Congo	Congo, Rep.
Congo-Kinshasa	Democratic Republic of the Congo	Congo, Dem. Rep.
Côte d’Ivoire	---	Cote d'Ivoire
Guadeloupe	overseas département and overseas region of FRANCE	DROP
Laos	Lao People's Democratic Republic	Lao PDR
Macau	special administrative region CHINA	Macao SAR, China
Martinique	Island and overseas territorial collectivity of FRANCE	DROP
North Korea	Korea, Dem. People's Rep.	Korea, Dem. People's Rep.	Lump Together North & South Korea or DROP
Reunion	Réunion La Réunion (French)	DROP
Saint Kitts and Nevis	Federation of Saint Christopher and Nevis	St. Kitts and Nevis
Saint Lucia	---	St. Lucia
Saint Vincent/Grenadines	---	St. Vincent and the Grenadines
South Korea	Korea, Dem. People's Rep.	---	Lump Together North & South Korea or DROP
Taiwan	Republic of China (ROC)	DROP
The Bahamas	---	Bahamas, The
Kyrgyzstan	Kyrgyz Republic	Kyrgyz Republic
Slovakia	Slovak Republic	Slovak Republic
Palestinian Territories	Israel	West Bank and Gaza

So Let's proceed to making this changes...

In [8]:

'''
This countries Have been merged to Major countries already present
in our Dataset and won't be present in dataset adding the additional 
features, hence deleting them.
'''
# Drop Rows of Countries: Guadeloupe, Martinique, Reunion, Taiwan
df = df[df.Country.isin(['Guadeloupe', 'Martinique', 'Reunion', 'Taiwan'])==False]

# Decalre Replacement Names as Dict
replace_values = {'Burma' : 'Myanmar', 
                  'Congo-Brazzaville' : 'Congo, Rep.', 
                  'Congo-Kinshasa' : 'Congo, Dem. Rep.', 
                  "Côte d’Ivoire": "Cote d'Ivoire",
                  "Laos": 'Lao PDR', 
                  'Macau': 'Macao SAR, China', 
                  'Saint Kitts and Nevis': 'St. Kitts and Nevis',
                  'Saint Lucia': 'St. Lucia', 
                  'Saint Vincent/Grenadines': 'St. Vincent and the Grenadines',
                  'The Bahamas': 'Bahamas, The', 'Kyrgyzstan': 'Kyrgyz Republic', 
                  'Slovakia': 'Slovak Republic',
                  'Palestinian Territories': 'West Bank and Gaza'
                 }    
# Apply Replacement Names
df = df.replace({"Country": replace_values}) 

In [9]:

# Check
df[df['Country'] == 'Myanmar'].head(1)

Out[9]:

	Country	e_type	Year	e_con	e_prod	GDP	Population	ei_capita	ei_gdp	CO2_emission	emission_per_cap
150	Myanmar	all	1988	0.09	0.08	26.35	40085.6	2.13	3.23	4.83	0.00012

3. Integrating Additional Data¶

Back to Table of Contents

⚡ Description: Exploratory data analysis ⚡
In this section, we will be Integrating and Engineering our features with areas that may prove vaible in or analysis on the Emmission of CO2.

Hence let's proceed to Integrating additional features.

3.1 Integrating Population Growth to Base DF¶

In [10]:

# Defining function that Integrates Pop_Growth
def add_pop_growth(row):
    
    val = pop_df.loc[pop_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]),3) if len(val)>0 else np.NaN

In [11]:

# Applying Fuction
df['pop_growth'] = df.apply(add_pop_growth, axis=1)

In [12]:

# Check for missing values
df[df["pop_growth"].isnull()]["Country"].unique()

Out[12]:

array(['North Korea', 'South Korea', 'New Zealand', 'Kuwait', 'Eritrea'],
      dtype=object)

North Korea & South Korea needs to be collapse as one Country called Korea, Dem. People's Rep. While for 'New Zealand', 'Kuwait', 'Eritrea' the missing values will have to be delt with conventionally.

Country	Year's with Missing Values	Action
New Zealand	1991 only	Fill with mean/median
Kuwait	1992 to 1995	Fill with mean/median
Eritrea	2012 to 2019	Fill with mean/median

3.2 Integrating Population Density to Modified DF¶

In [13]:

# Defining function that Integrates Pop_Density
def add_pop_den(row):
    
    val = den_df.loc[den_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]),3) if len(val)>0 else np.NaN

In [14]:

# Applying Fuction
df['pop_density'] = df.apply(add_pop_den, axis=1)

In [15]:

# Check for missing values
df[df["pop_density"].isnull()]["Country"].unique()

Out[15]:

array(['Luxembourg', 'North Korea', 'South Korea', 'Kuwait', 'Kosovo'],
      dtype=object)

So we see a couple of additional missing values present in the population density Column for Countries 'Luxembourg', 'North Korea', 'South Korea', 'Kuwait', 'Kosovo'

We will look to treat this later

3.3 Integrating GDP Split (Agric & Manuf) to Modified DF¶

The calculation of a country's GDP encompasses all private and public consumption, government outlays, investments, additions to private inventories, paid-in construction costs, and the foreign balance of trade.

We will be focusing on Just the GDP Contributions of the Manufacturing & Agricultural Industries per country per time.

In [16]:

# Defining function that Integrates Manufacturing GDP Contribution
def add_gdp_manu(row):
    
    val = mgdp_df.loc[mgdp_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN

# Defining function that Integrates Agricuture GDP Contribution
def add_gdp_agri(row):
    
    val = agdp_df.loc[agdp_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN

In [17]:

# Applying Fuction Respectively
df['manuf_GDP'] = df.apply(add_gdp_manu, axis=1)
df['agri_GDP'] = df.apply(add_gdp_agri, axis=1)

It's also worthy of Note that the manuf_GDP & agri_GDP are percentage contribution of the overal GDP, Hence we'll have to extract the value for computation.

In [18]:

df['Manuf_GDP'] = (df['manuf_GDP']/100)*df['GDP']
df['Agric_GDP'] = (df['agri_GDP']/100)*df['GDP']

3.4 Integrating Deforestation Data to Modified DF¶

In [19]:

# Defining function that Integrates Forest Area % of Land & Land Area Sq.M Data
def add_forest(row):
    
    val = forest_df.loc[forest_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN

def add_land(row):
    
    val = land_df.loc[land_df["Country Name"].str.contains(row['Country']), str(row['Year'])]
    return round(float(list(val)[0]), 3) if len(val)>0 else np.NaN

In [20]:

# Applying Fuction Respectively
df['Forest'] = df.apply(add_forest, axis=1)
df['Land'] = df.apply(add_land, axis=1)
# Get Exact Forest area in SqM
df['Deforestation'] = (df['Forest']/100)*df['Land'] 

# Note: we cleaned all land info below 1990 as to avoid errors since Forest data starts from 1990
# Drop redundant Forest & Land Columns
df = df.drop(['Forest', 'Land'], axis=1)

The forest data begins from year 1990, hence we will be experiencing missing values across all Countries for the years 1988 & 1989 and maybe a few others within the dataset.

3.5 Extracting Emission per Capita¶

This refers to the per capita/person emission for each country per energy type

In [21]:

# Adding the emission_per_cap column
df['emission_per_cap'] = df['CO2_emission']/df['Population']

In [22]:

"""
Let's Reposition our Target variable 
CO2 Emission to the End of our Dataframe
"""
# Seperate Other Features From Target Variable
others = df.drop(['CO2_emission', 'emission_per_cap'], axis=1)
co = df[['emission_per_cap', 'CO2_emission']]
# Delete df
del df
# concat both Tables into fresh df
df = pd.concat([others, co], axis=1)
# Dropping % version of agric & Manufac GDP
df = df.drop(['manuf_GDP', 'agri_GDP'], axis=1)

In [23]:

df.head()

Out[23]:

	Country	e_type	Year	e_con	e_prod	GDP	Population	ei_capita	ei_gdp	pop_growth	pop_density	Manuf_GDP	Agric_GDP	Deforestation	emission_per_cap	CO2_emission
0	World	all	1988	345.56	347.41	42106.6	4927545.08	70.13	8.21	1.77	39.285	NaN	2233.75513	NaN	0.004295	21163.84
1	World	coal	1988	96.87	98.48	42106.6	4927545.08	70.13	8.21	1.77	39.285	NaN	2233.75513	NaN	0.001812	8930.92
2	World	nat_gas	1988	71.01	71.85	42106.6	4927545.08	70.13	8.21	1.77	39.285	NaN	2233.75513	NaN	0.000725	3571.68
3	World	pet/oth	1988	133.45	132.49	42106.6	4927545.08	70.13	8.21	1.77	39.285	NaN	2233.75513	NaN	0.001758	8661.24
4	World	nuclear	1988	19.23	19.23	42106.6	4927545.08	70.13	8.21	1.77	39.285	NaN	2233.75513	NaN	0.000000	0.00

4. Extract Integrated Data¶

Back to Table of Contents

In [24]:

"""
Extract & Save Data as CSV
Unhash: To Run
""" 

# df.to_csv('data/Our_CO2emission_Analysis_Data.csv')

Kindly Proceed to Notebook PART 2 For further engineering and Exploratory Analysis¶

In [ ]: