AIM: To practice Linear Regression using Python code. DATA: The dataset includes monthly mean carbon dioxide globally averaged over marine surface sites for the span 1980-2020.
Data Source: National Oceanic and Atmospheric Administration (NOAA) https://gml.noaa.gov/ccgg/trends/global.html
#Import necessary librarires of Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from sklearn import metrics
import math
import sklearn
import statistics
print("Yearly Global Average CO2 Concentrations in parts per million (ppm) and Linear Regression")
Yearly Global Average CO2 Concentrations in parts per million (ppm) and Linear Regression
#Read the Dataset
df=pd.read_csv('global-atm-co2.csv')
#Know the basics of the dataset
print (df.head(10)) # display first 10 entries
print(df.shape) # display the dimensions of the dataset (rows and columns)
print(df.columns.values) #display columns names
df.info() # display data types and memory usage
year average_co2_concentrations 0 1980 338.911667 1 1981 340.105000 2 1982 340.856667 3 1983 342.530833 4 1984 344.074167 5 1985 345.544167 6 1986 346.965833 7 1987 348.674167 8 1988 351.159167 9 1989 352.782500 (42, 2) ['year' 'average_co2_concentrations'] <class 'pandas.core.frame.DataFrame'> RangeIndex: 42 entries, 0 to 41 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 42 non-null int64 1 average_co2_concentrations 42 non-null float64 dtypes: float64(1), int64(1) memory usage: 800.0 bytes
#Scatter plot : Plot the scatter plot of yearly average_co2_concentrations variable
df.plot.scatter(x="year",y="average_co2_concentrations")
plt.xlabel('Year')
plt.ylabel('Global Average CO2 Concentrations (ppm)')
plt.title ('Yearly Global Average CO2 Concentrations in parts per million (ppm)')
plt.show()
Let us try to fit a Line to the data. Equation of a line is y = b0 + b1*x, where b0 is Y-intercept and b1 is the slope.
# Use NumPy library to convert the DataFrame to NumPy Array which would be used in the further steps.
x=[]
y=[]
x=df['year'].to_numpy()
y=df['average_co2_concentrations'].to_numpy()
n = np.size(x) # number of observations/points
# Function: Calculate Regression Coefficients : b0 is Y-intercept and b1 is slope for a Regression Line b0 + b1*x
def estimate_coef(x, y):
# mean of x and y vector
m_x, m_y = np.mean(x), np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)
# Function: Plot the scatter plot and Regression Line as per the predicted coefficients
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plot the regression line
plt.plot(x, y_pred, color = "g")
# prepare and render the scatter plot
plt.xlabel('Year')
plt.ylabel('Global Average CO2 Concentrations (ppm)')
plt.title ('Yearly Global Average CO2 Concentrations in parts per million (ppm) and Linear Regression')
plt.show()
# Function: Calculate RMSE (Root Mean-Squared Error values)
def rmse(b,y):
predict=[]
for i in range(0,n):
predict.append(b[0]+b[1]*x[i])
predict=np.array(predict)
mse = sklearn.metrics.mean_squared_error(y, predict)
root_mse = math.sqrt(mse) # RMSE value
nrmse = root_mse/statistics.mean(y) # Normalized RMSE value
return(root_mse,nrmse)
# Function: Call the functions in a particular order
def main(x,y):
# Estimate Regression Coefficients
b = estimate_coef(x, y)
print("Estimated coefficients of the line y = b0 + b1*x are:\nb0 = {} \nb1 = {}".format(b[0], b[1]))
# Plot regression line
residual_error = rmse(b,y)
print("RMSE VALUE is",residual_error[0])
print("Normalized RMSE VALUE is",residual_error[1])
plot_regression_line(x, y, b)
#Call the main function
if __name__ == "__main__":
main(x,y)
#EoF
Estimated coefficients of the line y = b0 + b1*x are: b0 = -3291.4041041406904 b1 = 1.8315106823366656 RMSE VALUE is 2.1644921537661395 Normalized RMSE VALUE is 0.005810202214397372
Root Mean Square Error,RMSE, is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are. RMSE is a measure of how spread out these residuals are. It tells us how concentrated the data is around the line of best fit.