Notebook

Bike Rental Prediction - Project using R¶

Bike rental prediction at its core represents an advanced application of predictive analytics and machine learning, employing a robust Random Forest model to forecast bicycle rental demand with unparalleled precision. This sophisticated model goes beyond traditional approaches by meticulously analyzing an array of factors, including seasonal patterns, weather conditions, and temporal trends, to provide nuanced insights into user behavior and rental dynamics.

By harnessing the power of the Random Forest algorithm, known for its ensemble of decision trees and enhanced accuracy, the predictive model enables rental service providers to make data-driven decisions. This includes optimizing inventory levels, tailoring pricing strategies, and streamlining operational processes. The Random Forest model excels at capturing complex relationships within the data, ensuring a more accurate prediction of bike rental counts.

This predictive tool serves as a strategic asset, not only anticipating demand fluctuations but also acting as a catalyst for informed decision-making. It empowers businesses to proactively adapt to changing market conditions, enhance resource allocation, and deliver an exceptional and responsive rental experience.

In summary, the bike rental prediction model, driven by the Random Forest algorithm, is a powerful and sophisticated solution that transforms data into actionable insights, fostering operational efficiency and elevating customer satisfaction in the dynamic landscape of bike rentals.

bike rental prediction-project image.jpg

Objectives¶

To Perform exploratory data analysis and visualize the data to understand the environmental and seasonal settings.
Predict bike rental counts based on environmental and seasonal settings with the help of a machine learning algorithm.

Table of Contents¶

Section 1: Loading Libraries and datasets

Loading Relevant Libraries
Addition of File to R Program/R Studio
Exploration of Dataset

Section 2: EDA - Exploratory Data Analysis

Renaming and Type Conversion of Attributes
Typecasting Datetime and Numerical Attributes to Category
Missing Value Analysis
Visualization of Numerical Variables through Pairplot
Exploring Bike Rental Distribution Using Histogram
Histogram of Target Variable - "Bike Rental Count"
Log Transformation of Bike Rentals and Visualization Using Histogram and Density Plot
Correlogram of All Variables Using ggpairs
Analysis of Dataset Focusing on Bike Rental Count Using 'explore' Package
Monthly Distribution of Bike Rental Counts
- Season-wise Monthly Distribution of Bike Rental Counts
- Weekday-wise Monthly Distribution of Bike Rental Counts
Bike Rentals Counts by Seasonly Distribution
- Boxplot to Visualize Bike Rentals by Season
- Violin Plot for Yearly Distribution of Counts
Exploring Bike Rentals During Holidays
Exploration of Working Day-wise Distribution of Counts
- Column Plot for Working Day-wise Distribution of Counts
Impact of Weather Conditions on Bike Rentals
- Column Plot for Weather Condition-wise Distribution of Counts
Temperature Analysis
- Combined Temperature Analysis for Temperature and Apparent Temperature
- Scatter Plot for Bike Rentals Against Temperature and Apparent Temperature in Celsius

Section 3: Outlier Analysis

Boxplot for Bike Rental Count with Outliers
Boxplots for Outliers in Temperature, Feel-like Temperature, Humidity, and Windspeed
Outlier Replacement and Imputation
- Replacing and Imputing Outliers in Humidity and Windspeed
- Impute Missing Values Using Mean Imputation Method
Combining the Imputed Dataset and Original Dataset
Exploring Numerical Column for Combined Dataset
Correlation Analysis of Combined Dataset

Section 4: Training and Testing Dataset

Splitting Dataset for Training and Testing
Creating Subsets for Training and Testing Respectively

Section 5: Feature Engineering

Encoding Categorical Features for Training Dataset
Encoding Categorical Features (Test Dataset)

Section 6: Linear Regression Model

Modelling the Training Dataset for LRM
Cross Validation Prediction for LRM
Cross Validation Prediction Plot for LRM
Model Performance on Test Dataset for LRM
Prediction Analysis of Models on Test Dataset for LRM
Model Evaluation Metrics for LRM
Residual Analysis for LRM

Section 7: Decision Tree Regressor

Modelling the Training Dataset for DTR
Cross Validation Prediction for DTR
Cross Validation Prediction Plot for DTR
Model Performance on Test Dataset for DTR
Prediction Analysis of Models on Test Dataset for DTR
Model Evaluation Metrics for DTR
Residual Analysis and Plot for DTR

Section 10: Random Forest Model

Modelling the Training Dataset for RFM
Cross Validation Prediction for RFM
Cross Validation Prediction Plot for RFM
Model Performance on Test Dataset for RFM
Prediction Analysis of Models on Test Dataset for RFM
Model Evaluation Metrics for RFM
Residual Analysis and Plot for RFM

Section 11: Selecting Best Model in All Three for Further Prediction

Calculate RMSE and MAE for Each Model
Analyzing Accuracy for Each Model
Selecting Best Model

Section 12: Selecting Final Model as Random Forest Regressor for Prediction of Bike Rental Count

Combine Observed and Predicted Values
Write Predictions to a CSV File
Display the Predictions

Section 13: Conclusion

 1. Conclusion

Loading Relevent Libraries:¶

In [263]:

# Section: Loading Packages and Libraries for Visualizing Dataset and Performing ML Analysis

## Loading Important Libraries

#Before proceeding, ensure to install all the packages listed inside the `library()` function.
# Install ranger package if not already installed
# install.packages("ranger")

# ggplot2: A powerful and flexible plotting system for creating visualizations in R.
library(ggplot2)

# tidyverse: A collection of packages (including ggplot2) that provide a consistent and efficient data manipulation workflow.
library(tidyverse)

# explore: A package for exploratory data analysis, offering various visualizations and summary statistics.
library(explore)

# GGally: Extension to ggplot2, designed for creating ggplot2 visualizations with multiple plots.
library(GGally)

# ggridges: Creates ridge plots for visualizing the distribution of a numeric value by one or more categorical factors.
library(ggridges)

# Metrics: Provides various metrics for evaluating machine learning models, including RMSE and MAE.
library(Metrics)

# car: Companion to Applied Regression package, offering various regression-related functions.
library(car)

# corrgram: Package for creating correlograms to visualize correlation matrices.
library(corrgram)

# corrplot: Package for creating correlation plots.
library(corrplot)

# caret: Classification and Regression Training package, a comprehensive framework for building predictive models.
library(caret)

# randomForest: Implements random forest algorithms for classification and regression.
library(randomForest)

# ranger: Another implementation of random forest algorithms for faster performance.
library(ranger)

# DMwR2: Data Mining with R, provides functions and data sets for teaching data mining.
library(DMwR2)

# ipred: Improved Predictors package, provides bagging and bootstrapping algorithms for predictive modeling.
library(ipred)

# caTools: Provides various tools for data splitting and manipulation, often used in predictive modeling.
library(caTools)

# viridis: A colorblind-friendly color palette for data visualizations.
library(viridis)

# lubridate: Simplifies date and time handling in R.
library(lubridate)

# readxl: Package for reading Excel files.
library(readxl)

Loading dataset using readxl¶

We're utilizing the readxl library in R to read data from an Excel file into a data frame. First, we load the readxl library using the library() function. Then, we specify the file path to the Excel file we want to read and store it in the variable file_path. Finally, we use the read_excel() function, passing the file path as an argument, to read the Excel file and create a data frame named bike_df. This process enables us to access and analyze the data from the Excel file within the R environment for further manipulation and exploration.

In [264]:

library(readxl)

# Provide the correct path to your Excel file
file_path <- "/content/1657875746_day.xlsx"

# Read the Excel file into a data frame
bike_df <- read_excel(file_path)

Section : Exploring Dataset Given¶

To display the first few rows of a data frame named bike_df in R. It utilizes the head() function, which is a commonly used command to provide a glimpse of the dataset's structure and contents. By executing head(bike_df), we print the initial rows of the data frame to the console, allowing us to inspect the data quickly. Alternatively, there's a comment indicating that the head.matrix() function can also be used for the same purpose, though in this case, it's not being utilized. Overall, this code facilitates an initial exploration of the dataset, aiding in understanding its format and content. and we are doing same with the tail() exploring last few row and columns of dataset.

In [265]:

# Print the first few rows of the data frame using the `head` function
head(bike_df)

# Alternatively, you can use `head.matrix` to print the first few rows
#head.matrix(bike_df)

# Explanation:
# The `head` function is commonly used to display the first few rows of a data frame.
# It helps to quickly inspect the structure and contents of the dataset.

A tibble: 6 × 16
instant	dteday	season	yr	mnth	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
<dbl>	<dttm>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	2011-01-01	1	0	1	0	6	0	2	0.344167	0.363625	0.805833	0.1604460	331	654	985
2	2011-01-02	1	0	1	0	0	0	2	0.363478	0.353739	0.696087	0.2485390	131	670	801
3	2011-01-03	1	0	1	0	1	1	1	0.196364	0.189405	0.437273	0.2483090	120	1229	1349
4	2011-01-04	1	0	1	0	2	1	1	0.200000	0.212122	0.590435	0.1602960	108	1454	1562
5	2011-01-05	1	0	1	0	3	1	1	0.226957	0.229270	0.436957	0.1869000	82	1518	1600
6	2011-01-06	1	0	1	0	4	1	1	0.204348	0.233209	0.518261	0.0895652	88	1518	1606

The output displayed is a representation of the first 6 rows and 16 columns of a dataset stored in a tibble (a modern data frame object in R). Each row corresponds to a specific observation, while each column represents a different variable or attribute of those observations. Here's a breakdown of the columns:

instant: A numeric identifier for each observation.
dteday: The date and time of each observation.
season: Represents the season (e.g., 1 for spring, 2 for summer, etc.).
yr: Indicates the year (0 for 2011, 1 for 2012).
mnth: Represents the month of the year.
holiday: Binary indicator for whether it's a holiday (0 for no holiday, 1 for holiday).
weekday: Indicates the day of the week (0 for Sunday, 6 for Saturday).
workingday: Binary indicator for whether it's a working day (0 for non-working day, 1 for working day).
weathersit: Represents the weather situation.
temp: Normalized temperature.
atemp: Normalized feeling temperature.
hum: Normalized humidity.
windspeed: Normalized wind speed.
casual: Count of casual users.
registered: Count of registered users.
cnt: Total count of bike rentals, including both casual and registered users.

Each cell in the table contains the corresponding value of the variable for that observation. This output allows for a quick visual inspection of the dataset's structure and initial values, facilitating data exploration and understanding.

In [266]:

# Print the last few rows of the data frame using the `tail` function
#tail(bike_df)

# Alternatively, you can use `tail.matrix` to print the last few rows
tail.matrix(bike_df)

# Explanation:
# The `tail` function is commonly used to display the last few rows of a data frame.
# It helps to check the end of the dataset and ensures data has been read correctly.

A tibble: 6 × 16
instant	dteday	season	yr	mnth	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
<dbl>	<dttm>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
726	2012-12-26	1	1	12	0	3	1	3	0.243333	0.220333	0.823333	0.316546	9	432	441
727	2012-12-27	1	1	12	0	4	1	2	0.254167	0.226642	0.652917	0.350133	247	1867	2114
728	2012-12-28	1	1	12	0	5	1	2	0.253333	0.255046	0.590000	0.155471	644	2451	3095
729	2012-12-29	1	1	12	0	6	0	2	0.253333	0.242400	0.752917	0.124383	159	1182	1341
730	2012-12-30	1	1	12	0	0	0	1	0.255833	0.231700	0.483333	0.350754	364	1432	1796
731	2012-12-31	1	1	12	0	1	1	2	0.215833	0.223487	0.577500	0.154846	439	2290	2729

The dim(bike_df) function returns a vector containing the number of rows and columns in the dataset. By combining this information with a descriptive string, "Dimension of dataset: ", using paste, the code generates a human-readable output that displays the dataset's dimensions.

In [267]:

# Using the paste function to concatenate strings and print information about the dataset dimensions
# dim(bikes_df) returns a vector with the number of rows and columns in the dataset
# The concatenated string includes "Dimension of dataset: " followed by the number of rows and columns
paste("Dimension of dataset: ", dim(bike_df))

'Dimension of dataset: 731'
'Dimension of dataset: 16'

The output "Dimension of dataset: 731 16" indicates that the dataset contains 731 rows and 16 columns.

Section: Data Preprocessing¶

In this code segment, several preprocessing tasks are performed on the dataset bike_df to prepare it for exploratory data analysis (EDA). The steps involved are as follows:

Column Renaming: Initially, the columns of bike_df are renamed for better clarity and understanding of the data attributes.
Summary Statistics: Summary statistics are computed to gain insights into the dataset's distribution and characteristics, facilitating the identification of potential outliers or anomalies.
Data Type Conversion: Certain attributes undergo data type conversion to ensure consistency and suitability for analysis. For instance, the datetime column is converted to the Date format, and categorical variables are converted to factors.
Displaying Dataset Structure: The structure of the dataset is displayed using the str() function, allowing for a comprehensive understanding of variable types and distributions post-conversion.

Throughout the process, the output provides visibility into the changes made, including the updated column names, summary statistics, and the revised dataset structure, ensuring the data is appropriately prepared for subsequent analysis. Additionally, the code assesses missing values, highlighting potential data integrity issues that may require further investigation.

In [268]:

#Rename the columns
names(bike_df)<-c('record_id','datetime','season','year','month','holiday','weekday','workingday','weather_condition','temp','atemp','humidity','windspeed','casual_users','registerd_users','count')

head(bike_df)

A tibble: 6 × 16
record_id	datetime	season	year	month	holiday	weekday	workingday	weather_condition	temp	atemp	humidity	windspeed	casual_users	registerd_users	count
<dbl>	<dttm>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	2011-01-01	1	0	1	0	6	0	2	0.344167	0.363625	0.805833	0.1604460	331	654	985
2	2011-01-02	1	0	1	0	0	0	2	0.363478	0.353739	0.696087	0.2485390	131	670	801
3	2011-01-03	1	0	1	0	1	1	1	0.196364	0.189405	0.437273	0.2483090	120	1229	1349
4	2011-01-04	1	0	1	0	2	1	1	0.200000	0.212122	0.590435	0.1602960	108	1454	1562
5	2011-01-05	1	0	1	0	3	1	1	0.226957	0.229270	0.436957	0.1869000	82	1518	1600
6	2011-01-06	1	0	1	0	4	1	1	0.204348	0.233209	0.518261	0.0895652	88	1518	1606

Section : Descriptive Analysis¶

Summary() function provides a quick overview of key statistics for each variable in the dataset. The summary includes various descriptive statistics such as the minimum, 1st quartile, median (2nd quartile), mean, 3rd quartile, and maximum values for numerical variables. For categorical variables, it displays the frequency of each category. This summary serves as an initial exploration step to understand the distribution, central tendency, and spread of the data across different attributes. It helps in identifying potential issues such as outliers, skewed distributions, or missing values, thereby guiding further data preprocessing and analysis tasks.

In [269]:

# Summary of the dataset using the summary() function to get a quick overview of key statistics.
summary(bike_df)

   record_id        datetime                       season           year       
 Min.   :  1.0   Min.   :2011-01-01 00:00:00   Min.   :1.000   Min.   :0.0000  
 1st Qu.:183.5   1st Qu.:2011-07-02 12:00:00   1st Qu.:2.000   1st Qu.:0.0000  
 Median :366.0   Median :2012-01-01 00:00:00   Median :3.000   Median :1.0000  
 Mean   :366.0   Mean   :2012-01-01 00:00:00   Mean   :2.497   Mean   :0.5007  
 3rd Qu.:548.5   3rd Qu.:2012-07-01 12:00:00   3rd Qu.:3.000   3rd Qu.:1.0000  
 Max.   :731.0   Max.   :2012-12-31 00:00:00   Max.   :4.000   Max.   :1.0000  
     month          holiday           weekday        workingday   
 Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
 1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
 Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
 Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
 3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
 Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
 weather_condition      temp             atemp            humidity     
 Min.   :1.000     Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
 1st Qu.:1.000     1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
 Median :1.000     Median :0.49833   Median :0.48673   Median :0.6267  
 Mean   :1.395     Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
 3rd Qu.:2.000     3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
 Max.   :3.000     Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
   windspeed        casual_users    registerd_users     count     
 Min.   :0.02239   Min.   :   2.0   Min.   :  20    Min.   :  22  
 1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497    1st Qu.:3152  
 Median :0.18097   Median : 713.0   Median :3662    Median :4548  
 Mean   :0.19049   Mean   : 848.2   Mean   :3656    Mean   :4504  
 3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776    3rd Qu.:5956  
 Max.   :0.50746   Max.   :3410.0   Max.   :6946    Max.   :8714

The summary output provides valuable insights into the dataset bike_df, revealing key statistics for each variable:

record_id: The dataset consists of 731 records, with IDs ranging from 1 to 731. Each record represents a unique observation.
datetime: The observations span from January 1, 2011, to December 31, 2012, indicating data collected over two years. The median date falls on January 1, 2012, suggesting an equal distribution of observations across both years.
season: The dataset comprises four seasons, with each season having approximately the same number of observations. Season 3 has the highest frequency, indicating a relatively higher occurrence of observations during that season.
year: The dataset evenly represents both years 2011 and 2012, with approximately 50% of the observations belonging to each year.
month: Observations are spread across all twelve months, with a slightly higher median month (July), indicating a relatively higher number of observations during the summer months.
holiday: Most observations occur on non-holiday days, as indicated by the low median and mean values. However, there is a small proportion of observations on holidays.
weekday: Observations are evenly distributed across weekdays, with each weekday having a similar frequency of occurrences.
workingday: Approximately 68.4% of the observations represent working days, indicating a higher frequency of data collection during typical workdays.
weather_condition: The dataset captures three weather conditions. Condition 1 occurs most frequently, suggesting predominantly good weather conditions during the data collection period.
temp, atemp, humidity, windspeed: These numerical attributes exhibit varying distributions. Temperature and apparent temperature (atemp) range from 0.059 to 0.862, indicating a wide range of temperatures recorded. Humidity ranges from 0 to 0.973, while windspeed varies from 0.022 to 0.507, suggesting different weather conditions throughout the dataset.
casual_users, registered_users, count: The number of casual and registered users, as well as the total count of bike rentals, varies widely across observations. The mean count of bike rentals is approximately 4504, with a minimum of 22 and a maximum of 8714, highlighting significant variability in rental patterns.

Date type conversion of attributes

str(bike_df) function offers a concise overview of the dataset's internal composition, showcasing the data types of each variable/column alongside a glimpse of their values. This succinct representation aids in understanding the dataset's structure, identifying any missing values, and grasping the distribution of variables.

In [270]:

# Display the structure of the dataset using the str() function to understand variable types and their distribution.
str(bike_df)

tibble [731 × 16] (S3: tbl_df/tbl/data.frame)
 $ record_id        : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
 $ datetime         : POSIXct[1:731], format: "2011-01-01" "2011-01-02" ...
 $ season           : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
 $ year             : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
 $ month            : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
 $ holiday          : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
 $ weekday          : num [1:731] 6 0 1 2 3 4 5 6 0 1 ...
 $ workingday       : num [1:731] 0 0 1 1 1 1 1 0 0 1 ...
 $ weather_condition: num [1:731] 2 2 1 1 1 1 2 2 1 1 ...
 $ temp             : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
 $ atemp            : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
 $ humidity         : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
 $ windspeed        : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
 $ casual_users     : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
 $ registerd_users  : num [1:731] 654 670 1229 1454 1518 ...
 $ count            : num [1:731] 985 801 1349 1562 1600 ...

The output indicates the structure of the dataset, presenting it as a tibble with 731 rows and 16 columns. Each column is described along with its respective data type and example values.

record_id, season, year, month, holiday, weekday, workingday, and weather_condition are numeric variables with corresponding values.
datetime is of POSIXct type, representing date and time data, with dates ranging from "2011-01-01" to "2012-12-31".
Other variables such as temp, atemp, humidity, windspeed, casual_users, registered_users, and count are also numeric, with varying ranges of values.

So Now, We will typecaste variables like we have a variable year in a dataset that represents years as numeric values (e.g., 2011, 2012), but we want to treat it as a categorical variable.

we have a variable datetime representing dates and times as character strings but want to treat them as dates.

In [271]:

#Typecasting the datetime and numerical attributes to category

# Convert 'datetime' to Date format
bike_df$datetime<- as.Date(bike_df$datetime)
bike_df$year<-as.factor(bike_df$year)
bike_df$month<-as.factor(bike_df$month)
bike_df$season <- as.factor(bike_df$season)
bike_df$holiday<- as.factor(bike_df$holiday)
bike_df$weekday<- as.factor(bike_df$weekday)
bike_df$workingday<- as.factor(bike_df$workingday)
bike_df$weather_condition<- as.factor(bike_df$weather_condition)

# Display the structure of the dataset using the str() function to understand variable types and their distribution.
str(bike_df)

tibble [731 × 16] (S3: tbl_df/tbl/data.frame)
 $ record_id        : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
 $ datetime         : Date[1:731], format: "2011-01-01" "2011-01-02" ...
 $ season           : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ year             : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ month            : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ holiday          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ weekday          : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
 $ workingday       : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
 $ weather_condition: Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
 $ temp             : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
 $ atemp            : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
 $ humidity         : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
 $ windspeed        : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
 $ casual_users     : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
 $ registerd_users  : num [1:731] 654 670 1229 1454 1518 ...
 $ count            : num [1:731] 985 801 1349 1562 1600 ...

Specifically, we convert the datetime column to Date format and the numerical attributes (year, month, season, holiday, weekday, workingday, and weather_condition) to factor variables. This conversion helps in treating these variables appropriately for analysis, ensuring they are interpreted correctly and facilitating categorical data manipulation.

Section: Missing Value ANalysis:

Missing value analysis involves identifying and quantifying missing values in a dataset, visualizing their patterns, and deciding on appropriate strategies for handling them, such as imputation or deletion, based on the analysis goals and the nature of the data.

In [272]:

#Missing values in dataset
missing_val<-data.frame(apply(bike_df,2,function(x){sum(is.na(x))}))
names(missing_val)[1]='missing_val'
missing_val

A data.frame: 16 × 1
	missing_val
	<int>
record_id	0
datetime	0
season	0
year	0
month	0
holiday	0
weekday	0
workingday	0
weather_condition	0
temp	0
atemp	0
humidity	0
windspeed	0
casual_users	0
registerd_users	0
count	0

Now, After handling missing values we don't have any missing values present in dataset.

Section : Performing EDA: Exploratory Data Analysis¶

Visualization of Numerical Variables

In [273]:

# Section: Visualization of Numerical Variables

# Visualize the relationships and distributions of key numerical variables using a pairs plot.
# This plot includes temperature ('temp'), "feels-like" temperature ('atemp'), humidity ('hum'),
# windspeed ('windspeed'), casual rentals ('casual'), registered rentals ('registered'),
# and the total number of rentals ('cnt'). The plot helps identify potential patterns, correlations,
# and outliers among these variables.

# Reduce the outer margins to avoid "figure margins too large" error
 par(mar = c(0.1, 0.1, 0.1, 0.1))

num_vars <- c("temp", "atemp", "humidity", "windspeed", "casual_users", "registerd_users", "count")
pairs(bike_df[, num_vars],
      main = "Pairs Plot of Numerical Variables",
      col = bike_df$season, # Color points based on the season for additional insights
      pch = 16)  # Use filled circles for better visibility

# Adding custom titles for each panel of the pairs plot.
for (i in 1:length(num_vars)) {
  for (j in 1:length(num_vars)) {
    panel_var1 <- num_vars[i]
    panel_var2 <- num_vars[j]
    panel_title <- paste("Relationship between", panel_var1, "and", panel_var2)
    panel_subtitle <- ifelse(i == j, "Distribution", paste("Colored by Season:", unique(bike_df$season)))
    title(panel_title, line = 2.5, cex.main = 0.8)
    title(panel_subtitle, line = 4, cex.sub = 0.6)
  }
}

# Reset the outer margins to default after creating the plot
par(mar = c(5, 4, 4, 2) + 0.1)

Section: Exploring Bike Rentals Distribution

In [274]:

# Section: Exploring Bike Rentals Distribution

# Explore the distribution of bike rentals using a histogram.
# This plot provides an overview of the distribution of total bike rentals ('cnt') per observation.
# Plotting the histogram to visualize the distribution of bike rentals.
hist(bike_df$count,
     main = "Distribution of Bike Rentals",
     xlab = "Number of Rentals",
     col = "#75AADB",  # Custom color for better visibility
     border = "#333333",  # Border color
     breaks = 30,  # Adjust the number of bins
     xlim = c(0, max(bike_df$count) + 50),  # Set x-axis limits for better readability
     las = 1,  # Keep axis labels horizontal
     cex.main = 1.5,  # Increase main title size
     cex.lab = 1.2,  # Increase axis label size
     cex.axis = 1.2)  # Increase axis tick label size

# Adding informative elements to the plot.
abline(v = mean(bike_df$count), col = "red", lwd = 2, lty = 2)  # Add a vertical line for mean
text(mean(bike_df$count) + 10, 250, "Mean", col = "red", font = 2, cex = 1.2)  # Label for the mean

# Adding additional information inside the plot
mtext(" The distribution is slightly right-skewed.", side = 1, line = 2, cex = 1.1)
mtext("The mean number of rentals is marked by a red dashed line.", side = 3, line = 3, cex = 1.1)

Histogram of Target Variable- "count"

In [275]:

# Load the required library for data visualization
library(ggplot2)


# Create a histogram to explore the distribution of the target variable 'count'.
ggplot(bike_df, aes(x = count)) +
  geom_histogram(bins = 30, colour = "black", fill = "#56B4E9") +
  ggtitle("Distribution of Bike Rentals ('count')") +
  xlab("Count Variable") + ylab("Density of the Sample") +
  theme(
    plot.title = element_text(color = "blue", size = 18, face = "bold"),
    axis.title.x = element_text(color = "blue", size = 14),
    axis.title.y = element_text(color = "blue", size = 14)
  ) +
  # Adding informative elements on the plot
  geom_vline(aes(xintercept = mean(bike_df$count)),
             color = "red", linetype = "dashed", linewidth = 1.2) +
  annotate("text", x = mean(bike_df$count) + 10, y = 30,
           label = "Mean", color = "red", size = 5)

Log Transformation of Bike Rentals

In [276]:

# Section: Log Transformation of Bike Rentals

# Create a histogram and density plot after applying log transformation to 'count'
ggplot(bike_df, aes(x = log(count))) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, colour = "black", fill = "grey") +
  geom_density(alpha = 0.2, fill = "cyan") +
  ggtitle("Distribution of Log-Transformed Bike Rentals ('count')") +
  xlab("Log-Transformed Count Variable") + ylab("Density of the Sample") +
  theme(
    plot.title = element_text(color = "black", size = 18, face = "bold"),
    axis.title.x = element_text(color = "blue", size = 14),
    axis.title.y = element_text(color = "blue", size = 14)
  ) +
  # Adding additional information on the plot
  geom_vline(aes(xintercept = mean(log(bike_df$count))),
             color = "red", linetype = "dashed", linewidth = 1.2) +
  annotate("text", x = mean(log(bike_df$count)) + 0.1, y = 0.2,
           label = "Mean (log-transformed)", color = "red", size = 4) +
  geom_vline(aes(xintercept = median(log(bike_df$count))),
             color = "green", linetype = "dashed", linewidth = 1.2) +
  annotate("text", x = median(log(bike_df$count)) + 0.1, y = 0.4,
           label = "Median (log-transformed)", color = "green", size = 4) +
  # Adding insights on the plot
  annotate("text", x = 6, y = 0.6,
           label = "Insights:", color = "red", size = 4, fontface = "bold") +
  annotate("text", x = 6, y = 0.55,
           label = "1. Applying a log transformation has helped in reducing skewness.", color = "red", size = 2) +
  annotate("text", x = 6, y = 0.5,
           label = "2. The mean and median values are marked for reference.", color = "red", size = 2)

Creating Correlogram with ggpairs

In [277]:

# Section: Creating Correlogram with ggpairs

# Create a new plotting device
#dev.new()
#install.packages("viridis")
# Load necessary libraries
library(GGally)
library(viridis)

# Create a correlogram with ggpairs
ggpairs(bike_df, title = "Correlogram with ggpairs()")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In [278]:

ggcorr(bike_df, method = c("everything", "pearson"), label=TRUE) +
  ggtitle("Plot of correlation variables") +
  theme(plot.title = element_text(color="blue", size=18, face="bold"))
cat("With the correlation plot, we can see how the variables that most interfere/correlate with the target are temperature, apparent temperature (atemp), weather, year, season, and finally, wind.\n")

With the correlation plot, we can see how the variables that most interfere/correlate with the target are temperature, apparent temperature (atemp), weather, year, season, and finally, wind.

Exploring Data with 'explore' Package

In [279]:

# Load required packages
library(explore)
library(lubridate)


# Extract the year from the 'datetime' variable
bike_df$year_from_date <- year(bike_df$datetime)

# Set display format to show full numeric values
options(scipen = 999)

# Use 'explore' to analyze the dataset, focusing on the 'cnt' variable with the extracted 'year_from_date' as the target
explore(bike_df, count, target = year_from_date)

Exploring Correlation Among Numerical Variables

In [280]:

# Section: Exploring Correlation Among Numerical Variables

# Explore the correlation matrix for numerical variables.
# The matrix and the subsequent heatmap help identify strong correlations among variables.

# Calculate the correlation matrix
cor_matrix <- cor(bike_df[, c("temp", "atemp", "humidity", "windspeed", "casual_users", "registerd_users", "count")])

# Display the correlation matrix
cat("Correlation Matrix:\n")
print(cor_matrix)

Correlation Matrix:
                      temp      atemp    humidity  windspeed casual_users
temp             1.0000000  0.9917016  0.12696294 -0.1579441   0.54328466
atemp            0.9917016  1.0000000  0.13998806 -0.1836430   0.54386369
humidity         0.1269629  0.1399881  1.00000000 -0.2484891  -0.07700788
windspeed       -0.1579441 -0.1836430 -0.24848910  1.0000000  -0.16761335
casual_users     0.5432847  0.5438637 -0.07700788 -0.1676133   1.00000000
registerd_users  0.5400120  0.5441918 -0.09108860 -0.2174490   0.39528245
count            0.6274940  0.6310657 -0.10065856 -0.2345450   0.67280443
                registerd_users      count
temp                  0.5400120  0.6274940
atemp                 0.5441918  0.6310657
humidity             -0.0910886 -0.1006586
windspeed            -0.2174490 -0.2345450
casual_users          0.3952825  0.6728044
registerd_users       1.0000000  0.9455169
count                 0.9455169  1.0000000

This output is a correlation matrix, displaying the correlation coefficients between pairs of variables in the dataset. Each cell in the matrix represents the correlation coefficient between two variables, ranging from -1 to 1.

The diagonal elements represent the correlation of each variable with itself, which is always 1.
Off-diagonal elements show the correlation between pairs of variables.
A correlation coefficient close to 1 indicates a strong positive correlation, meaning that as one variable increases, the other tends to increase as well.
A correlation coefficient close to -1 indicates a strong negative correlation, implying that as one variable increases, the other tends to decrease.
A correlation coefficient close to 0 suggests no linear relationship between the variables.

In this specific output:

Variables like temp and atemp have a very high positive correlation coefficient of around 0.99, indicating they are highly correlated.
Humidity and windspeed have a negative correlation, implying that as humidity increases, windspeed tends to decrease, and vice versa.
The variable count, representing the total bike rentals, shows a strong positive correlation with casual_users and registered_users, indicating that as the number of casual or registered users increases, the total count of bike rentals tends to increase as well.

In [281]:

# Create a correlation heatmap for a visual representation of variable relationships.
# Strong positive or negative correlations are visually highlighted.
library(corrplot)
# Customize the correlation plot
corrplot(
  cor_matrix,
  method = "color",
  font = 4,
  col = colorRampPalette(c("#4575b4", "#91bfdb", "#e0f3f8", "#fee090", "#d73027"))(100),  # Attractive color scheme
  tl.col = "black",
  tl.srt = 40,
  addCoef.col = "black",
  order = "hclust",
  addrect = 3  # Highlight cells with correlations above a certain threshold
)

# Add informative elements to the plot
title("Correlation heatmap", line = 0.5, cex.main = 1.8)  # Lowercase title
mtext("Strong Positive Correlation", side = 3, line = 2, col = "#d73027", cex = 0.8)
mtext("Strong Negative Correlation", side = 1, line = 3, col = "#4575b4", cex = 0.8)

Season and Weekday Analysis

In [282]:

# Load the required library
library(dplyr)
# Plot season-wise monthly distribution of counts
season_month_plot <- ggplot(bike_df, aes(x = month, y = count, fill = season)) +
  theme_minimal() +
  geom_col(position = "dodge", color = "black", alpha = 0.8) +
  labs(x = 'Month', y = 'Total Count',
       title = 'Season-wise Monthly Distribution of Counts',
       subtitle = 'Comparison of counts across different seasons',
       fill = 'Season') +
  scale_fill_manual(values = c('#E41A1C', '#377EB8', '#4DAF4A', '#FF7F00'))  # Custom colors

# Add data labels on top of each bar
season_month_plot <- season_month_plot +
  geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = -0.5, size = 3)

# Highlight the peak month for each season
peak_months <- bike_df %>%
  group_by(season) %>%
  slice(which.max(count))

season_month_plot <- season_month_plot +
  geom_point(data = peak_months, aes(x = month, y = count), color = "red", size = 3) +
  geom_text(data = peak_months, aes(x = month, y = count, label = paste("Peak\nMonth")),
            vjust = -0.5, hjust = 1, color = "red", size = 3)

# Customize legend
season_month_plot <- season_month_plot +
  guides(fill = guide_legend(title = "Season"))

# Print the enhanced season-wise plot
print(season_month_plot)

** Plot weekday-wise monthly distribution of counts**

In [283]:

# Plot weekday-wise monthly distribution of counts
weekday_month_plot <- ggplot(bike_df, aes(x = month, y = count, fill = factor(weekday))) +
  theme_minimal() +
  geom_col(position = "dodge", color = "white", alpha = 0.8) +
  labs(x = 'Month', y = 'Total Count',
       title = 'Weekday-wise Monthly Distribution of Counts',
       subtitle = 'Comparison of counts across different weekdays',
       fill = 'Weekday') +
  scale_fill_discrete(name = 'Weekday', labels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")) +  # Customize weekday labels

  # Add data labels on top of each bar
  geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = -0.5, size = 3)

# Highlight the peak day of the week
peak_days <- bike_df %>%
  group_by(weekday) %>%
  slice_max(order_by = count)

# Customize legend
weekday_month_plot <- weekday_month_plot +
  guides(fill = guide_legend(title = "Weekday"))

# Print the enhanced weekday-wise plot
print(weekday_month_plot)

Bike Rentals by Season

In [284]:

# Section: Bike Rentals by Season

# Explore the distribution of bike rentals based on the season using a boxplot.
# This boxplot shows how the total number of rentals ('cnt') varies across different seasons.

# Create a boxplot to visualize bike rentals by season
ggplot(bike_df, aes(x = season, y = count, fill = as.factor(season))) +
  geom_boxplot() +
  ggtitle("Bike Rentals by Season") +
  xlab("Season") + ylab("Number of Rentals") +
  scale_fill_manual(values = c("#FF6666", "#FFCC66", "#99FF99", "#6666FF"), name = "Season") +
  theme_minimal() +
  theme(
    plot.title = element_text(color = "blue", size = 18, face = "bold"),
    axis.title.x = element_text(color = "blue", size = 14),
    axis.title.y = element_text(color = "blue", size = 14),
    legend.position = "top",
    legend.title = element_text(color = "blue", size = 14)
  ) +
  # Adding additional information on the plot
  annotate("text", x = c(1, 2, 3, 4), y = 6000,
           label = c("Spring", "Summer", "Fall", "Winter"),
           color = "blue", size = 4, fontface = "bold") +
  annotate("text", x = 1, y = 7000, label = "Median", color = "red", size = 4) +
  annotate("text", x = 2, y = 7000, label = "Median", color = "red", size = 4) +
  annotate("text", x = 3, y = 7000, label = "Median", color = "red", size = 4) +
  annotate("text", x = 4, y = 7000, label = "Median", color = "red", size = 4) +
  geom_hline(yintercept = median(bike_df$count), linetype = "dashed", color = "red", linewidth = 1.2)

Violin Plot for Yearly Distribution of Counts

In [285]:

# Section: Violin Plot for Yearly Distribution of Counts
library(ggplot2)
# Violin plot for Yearly wise distribution of counts
yearly_violin_plot <- ggplot(bike_df, aes(x = year, y = count, fill = as.factor(year))) +
  geom_violin() +
  theme_bw() +
  labs(x = 'Year', y = 'Total Count',
       title = 'Yearly Distribution of Counts',
       subtitle = 'Comparison of counts across different years',
       fill = 'Year') +

  # Add data points on top of the violin plot for better insight
  geom_jitter(aes(color = as.factor(year)), width = 0.2, alpha = 0.5) +

  # Customize legend
  scale_fill_manual(values = c('#1F78B4', '#33A02C'), name = 'Year') +
  scale_color_manual(values = c('#1F78B4', '#33A02C'), name = 'Year') +

  # Add a box plot inside the violin plot
  geom_boxplot(width = 0.1, fill = 'white', color = 'black', alpha = 0.7) +

  # Improve plot appearance
  theme(legend.position = 'top') +

  # Add informative annotations
  annotate('text', x = c(0.8, 1.2), y = c(6000, 6000), label = c('2011', '2012'), size = 4, color = 'black') +
  annotate('text', x = 1, y = 8000, label = 'Data Distribution', size = 6, color = 'black', fontface = 'bold') +
  annotate('text', x = 1, y = 7000, label = 'Comparison of counts across different years using violin plots and box plots.', size = 3, color = 'black') +
  annotate('text', x = 0.8, y = 6800, label = 'Data points are added for better insight.', size = 3, color = 'black')

# Print the Violin plot
print(yearly_violin_plot)
cat("From the voilin plot, we can analysis that the bike rental count distribution is highest in year 2012 then the previous year.\nIn the graph, year 0 represent 2011 and year 1 represent 2012 respectively\n")

From the voilin plot, we can analysis that the bike rental count distribution is highest in year 2012 then the previous year.
In the graph, year 0 represent 2011 and year 1 represent 2012 respectively

Exploring Bike Rentals During Holidays

In [286]:

# Section: Exploring Bike Rentals During Holidays

# Explore summary statistics for the 'count' variable based on the 'holiday' status.
# This analysis provides insights into whether there are notable differences in rentals during holidays.

# Use tapply to calculate summary statistics for 'count' grouped by 'holiday' status
cnt_summary_by_holiday <- tapply(bike_df$count, bike_df$holiday, summary)

# Display the summary statistics
cat("Summary Statistics for 'count' Variable Based on 'holiday' Status:\n")
print(cnt_summary_by_holiday)

Summary Statistics for 'count' Variable Based on 'holiday' Status:
$`0`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     22    3214    4558    4527    5933    8714 

$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1000    1951    3351    3735    6034    7403

This output presents summary statistics for the variable 'count', representing the total bike rentals, categorized by 'holiday' status. Here's a breakdown:

For days classified as non-holidays (holiday status = 0):
- The minimum count of bike rentals is 22, and the maximum is 8714.
- The average count of bike rentals is approximately 4527, with a median of 4558.
- Rentals range from 3214 to 5933 in the lower and upper quartiles, respectively.
During holidays (holiday status = 1):
- The minimum count of bike rentals increases to 1000, with a maximum of 7403.
- The average count during holidays rises to approximately 3735, and the median stands at 3351.
- Rentals span from 1951 to 6034 between the first and third quartiles.

Create an analytical plot to visualize the distribution of rentals during holidays and non-holidays

In [287]:

# Create an analytical plot to visualize the distribution of rentals during holidays and non-holidays
par(mfrow = c(1, 3))  # Set up a 1x3 grid for side-by-side plots

# Boxplot for Rentals by Holiday Status
boxplot(bike_df$count ~ bike_df$holiday, main = "Boxplot of Rentals by Holiday Status", xlab = "Holiday", ylab = "Number of Rentals", col = "#75AADB")
# Add statistics on the boxplot
# Calculate boxplot statistics
stats <- boxplot.stats(bike_df$count[bike_df$holiday == 1])

stats_text <- paste("Median:", round(stats$stats[3], 2), "\nIQR:", round(IQR(bike_df$count[bike_df$holiday == 1]), 2))
mtext(stats_text, side = 3, line = 0.5, at = 1.8, adj = 0, font = 1, cex = 0.8)

# Histogram for Distribution of Rentals on Non-Holidays
hist(bike_df$count[bike_df$holiday == 0], main = "Distribution of Rentals on Non-Holidays", xlab = "Number of Rentals", col = "#75AADB", border = "#333333", xlim = c(0, max(bike_df$count) + 50))
# Add statistics on the histogram
hist_stats_text <- paste("Mean:", round(mean(bike_df$count[bike_df$holiday == 0]), 2), "\nSD:", round(sd(bike_df$count[bike_df$holiday == 0]), 2))
mtext(hist_stats_text, side = 3, line = 0.5, at = max(bike_df$count[bike_df$holiday == 0]) * 0.8, adj = 0, font = 1, cex = 0.8)
# Add mean and sd lines
abline(v = mean(bike_df$count[bike_df$holiday == 0]), col = "red", lty = 2, lwd = 2)
abline(v = mean(bike_df$count[bike_df$holiday == 0]) - sd(bike_df$count[bike_df$holiday == 0]), col = "purple", lty = 2, lwd = 2)
abline(v = mean(bike_df$count[bike_df$holiday == 0]) + sd(bike_df$count[bike_df$holiday == 0]), col = "purple", lty = 2, lwd = 2)
# Mark the observation points on the histogram
rug(bike_df$count[bike_df$holiday == 0], col = "blue", lwd = 1.5)

# Histogram for Distribution of Rentals on Holidays
hist(bike_df$count[bike_df$holiday == 1], main = "Distribution of Rentals on Holidays", xlab = "Number of Rentals", col = "#75AADB", border = "#333333", xlim = c(0, max(bike_df$count) + 50))
# Add statistics on the histogram
hist_stats_text_holiday <- paste("Mean:", round(mean(bike_df$count[bike_df$holiday == 1]), 2), "\nSD:", round(sd(bike_df$count[bike_df$holiday == 1]), 2))
mtext(hist_stats_text_holiday, side = 3, line = 0.5, at = max(bike_df$count[bike_df$holiday == 1]) * 0.8, adj = 0, font = 1, cex = 0.8)
# Add mean and sd lines
abline(v = mean(bike_df$count[bike_df$holiday == 1]), col = "red", lty = 2, lwd = 2)
abline(v = mean(bike_df$count[bike_df$holiday == 1]) - sd(bike_df$count[bike_df$holiday == 1]), col = "purple", lty = 2, lwd = 2)
abline(v = mean(bike_df$count[bike_df$holiday == 1]) + sd(bike_df$count[bike_df$holiday == 1]), col = "purple", lty = 2, lwd = 2)
# Mark the observation points on the histogram
rug(bike_df$count[bike_df$holiday == 1], col = "blue", lwd = 1.5)

Analyzing Bike Rentals on Holidays

In [288]:

# Section: Analyzing Bike Rentals on Holidays

# Column plot for holiday-wise distribution of counts
holiday_plot <- ggplot(bike_df, aes(x = factor(holiday), y = count, fill = season)) +
  geom_col(position = "dodge", color = "black", alpha = 0.8) +
  theme_minimal() +
  labs(
    x = 'Holiday',
    y = 'Total Count',
    title = 'Bike Rentals Analysis: Holidays vs. Non-Holidays',
    subtitle = 'Comparing counts on holidays and non-holidays across different seasons',
    fill = 'Season'
  ) +
  scale_fill_manual(values = c('#E41A1C', '#377EB8', '#4DAF4A', '#FF7F00')) +  # Custom colors +
  geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  scale_x_discrete(labels = c("Non-Holiday", "Holiday")) +  # Adding scale for 0 and 1 +
  theme(legend.position = "top") +
  guides(fill = guide_legend(title = "Season")) +
  annotate("text", x = 1.5, y = max(bike_df$count) * 1.2,
           label = "Insight: During non-holidays, bike rental counts are highest compared to holidays for different seasons.",
           color = "red", size = 3, fontface = "bold")

# Print  plot
print(holiday_plot)

Analyzing Bike Rentals on neiher weekend nor holiday vs other days

Column plot for working day-wise distribution of counts

In [289]:

# Column plot for working day-wise distribution of counts
workingday_plot <- ggplot(bike_df, aes(x = factor(workingday), y = count, fill = season)) +
  geom_col(position = "dodge", color = "black", alpha = 0.8) +
  theme_minimal() +
  labs(
    x = 'Working Day',
    y = 'Total Count',
    title = 'Bike Rentals Analysis: neiher weekend nor holiday vs. other days',
    subtitle = 'Comparing counts on working days and non-working days across different seasons',
    fill = 'Season'
  ) +
  scale_fill_manual(values = c('#E41A1C', '#377EB8', '#4DAF4A', '#FF7F00')) +  # Custom colors +
  geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  scale_x_discrete(labels = c("neiher weekend nor holiday", "other days")) +  # Adding scale for 0 and 1 +
  theme(legend.position = "top") +
  guides(fill = guide_legend(title = "Season")) +
  annotate("text", x = 1.5, y = max(bike_df$count) * 1.1,
           label = "Insight: neiher weekend nor holiday - Bike rental count is higher.\n other day - Bike rental count is lower",
           color = "red", size = 3, fontface = "bold")


print(workingday_plot)

 Impact of Weather Conditions on Bike Rentals

In [290]:

# Section: Impact of Weather Conditions on Bike Rentals

# Column plot for weather condition-wise distribution of counts
weather_condition_plot <- ggplot(bike_df, aes(x = factor(weather_condition), y = count, fill = season)) +
  geom_col(position = "dodge", color = "black", alpha = 1) +
  theme_minimal() +
  labs(
    x = 'Weather Condition',
    y = 'Total Count',
    title = 'Impact of Weather Conditions on Bike Rentals',
    subtitle = 'Comparing bike rental counts under different weather conditions across different seasons',
    fill = 'Season'
  ) +
  scale_fill_manual(values = c('#E41A1C', '#377EB8', '#4DAF4A', '#FF7F00')) +  # Custom colors +
  geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  scale_x_discrete(labels = c("Clear", "Mist + Cloudy", "Light Snow/Light Rain", "Heavy Rain")) +  # Adding scale +
  theme(legend.position = "top") +
  guides(fill = guide_legend(title = "Season")) +
  annotate(
    "text",
    x = 3.0,
    y = max(bike_df$count) * 1.1,
    label = "Insight: 1. Clear - Bike rental count is very high.\n 2. Mist + Cloudy - Bike rental count is second highest.\n 3. Light Snow/Light Rain - Bike rental count is third highest.\n 4. Heavy Rain - there is no data of Bike renting.",
    color = "red",
    size = 3,
    fontface = "bold"
  )

# Print plot
print(weather_condition_plot)

Section: Temperature Analysis¶

In [291]:

# Convert temperature variables to Celsius
bike_df$temp_celsius <- bike_df$temp * 41  # Assuming temp is in normalized units, adjust accordingly
bike_df$atemp_celsius <- bike_df$atemp * 50  # Assuming atemp is in normalized units, adjust accordingly

# Section: Combined Temperature Analysis

# Convert temperature variables to Celsius
bike_df$temp_celsius <- bike_df$temp * 41  # Assuming temp is in normalized units, adjust accordingly
bike_df$atemp_celsius <- bike_df$atemp * 50  # Assuming atemp is in normalized units, adjust accordingly

Scatter plot for bike rentals against temperature and apparent temperature in Celsius

In [292]:

# Scatter plot for bike rentals against temperature and apparent temperature in Celsius
combined_temp_plot <- ggplot(bike_df, aes(x = temp_celsius, y = count, color = "Temperature")) +
  geom_point() +
  geom_point(aes(x = atemp_celsius, color = "Apparent Temperature"), alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black",
              aes(group = 1), formula = y ~ x) +  # Add a linear trend line
  theme_minimal() +
  labs(
    x = 'Temperature (Celsius)',
    y = 'Total Count',
    title = 'Bike Rentals vs. Temperature and Apparent Temperature',
    subtitle = 'Scatter plot showing the relationship between bike rentals and temperature variables',
    color = 'Variable'
  ) +
  scale_color_manual(values = c("Temperature" = "blue", "Apparent Temperature" = "red")) +
  annotate("text", x = 25, y = 8000, label = "Trend Line: Linear Regression", color = "black", size = 4) +
  annotate("text", x = 5, y = 3000, label = "Temperature", color = "blue", size = 4) +
  annotate("text", x = 30, y = 6000, label = "Apparent Temperature", color = "red", size = 4)

# Print the combined temperature plot
print(combined_temp_plot)

Outlier Analysis¶

Boxplot for Bike Rental Count with Outliers

In [293]:

# Section: Boxplot for Bike Rental Count with Outliers

# Boxplot for bike rental count with outliers
boxplot(bike_df$count, main = 'Bike Rental Count', sub = ifelse(length(boxplot.stats(bike_df$count)$out) == 0, "No Outliers", paste("Outliers: ", boxplot.stats(bike_df$count)$out)),
        ylab = 'Count', col = "cyan", border = "blue")

# Add statistical values
text(1, boxplot.stats(bike_df$count)$stats[1], paste("Min:", round(boxplot.stats(bike_df$count)$stats[1], 2)), pos = 4, cex = 1)
text(1, boxplot.stats(bike_df$count)$stats[2], paste("1st Quartile:", round(boxplot.stats(bike_df$count)$stats[2], 2)), pos = 4, cex = 1)
text(1, boxplot.stats(bike_df$count)$stats[3], paste("Median:", round(boxplot.stats(bike_df$count)$stats[3], 2)), pos = 4, cex = 1)
text(1, boxplot.stats(bike_df$count)$stats[4], paste("Mean:", round(mean(bike_df$count), 2)), pos = 4, cex = 1)
text(1, boxplot.stats(bike_df$count)$stats[5], paste("3rd Quartile:", round(boxplot.stats(bike_df$count)$stats[5], 2)), pos = 4, cex = 1)
text(1, boxplot.stats(bike_df$count)$stats[6], paste("Max:", round(boxplot.stats(bike_df$count)$stats[6], 2)), pos = 4, cex = 1)

Section: Boxplots for Outliers in Temperature,feel-like temperature Humidity, and Windspeed¶

In [294]:

# Set up the layout for multiple boxplots
par(mfrow = c(2, 2))

# Box plot for temperature outliers
boxplot(bike_df$temp, main = "Temperature", sub = ifelse(length(boxplot.stats(bike_df$temp)$out) == 0, "No Outliers", paste("Outliers: ", boxplot.stats(bike_df$temp)$out)),
        col = "#FF6347", border = "#8B0000", notch = TRUE, outline = FALSE)

# Box plot for temperature outliers
boxplot(bike_df$atemp, main = "Feellike-Temperature", sub = ifelse(length(boxplot.stats(bike_df$atemp)$out) == 0, "No Outliers", paste("Outliers: ", boxplot.stats(bike_df$atemp)$out)),
        col = "pink", border = "red", notch = TRUE, outline = FALSE)

# Box plot for humidity outliers
boxplot(bike_df$humidity, main = "Humidity", sub = ifelse(length(boxplot.stats(bike_df$humidity)$out) == 0, "No Outliers", paste("Outliers: ", boxplot.stats(bike_df$humidity)$out)),
        col = "#87CEEB", border = "#1E90FF", notch = TRUE, outline = FALSE)

# Box plot for windspeed outliers
boxplot(bike_df$windspeed, main = "Windspeed", sub = ifelse(length(boxplot.stats(bike_df$windspeed)$out) == 0, "No Outliers", paste("Outliers: ", boxplot.stats(bike_df$windspeed)$out)),
        col = "#98FB98", border = "#008000", notch = TRUE, outline = FALSE)

Outlier Replacement and Imputation¶

Replacing and Imputing Outliers in Humidity and Windspeed

In [295]:

# Outlier Replacement and Imputation
# Section: Replacing and Imputing Outliers in Humidity and Windspeed

# Install the dplyr package if not installed

# Load the required libraries
library(DMwR2)
library(dplyr)

# Create a subset for windspeed and humidity variables
wind_hum <- subset(bike_df, select = c('windspeed', 'humidity'))

# Function to replace outliers with NA
replace_outliers <- function(x) {
  q <- quantile(x, c(0.25, 0.75))
  iqr <- q[2] - q[1]
  lower_bound <- q[1] - 1.5 * iqr
  upper_bound <- q[2] + 1.5 * iqr
  x[x < lower_bound | x > upper_bound] <- NA
  return(x)
}

# Apply the function to each column
wind_hum <- wind_hum %>% mutate(across(everything(), replace_outliers))
wind_hum

A tibble: 731 × 2
windspeed	humidity
<dbl>	<dbl>
0.1604460	0.805833
0.2485390	0.696087
0.2483090	0.437273
0.1602960	0.590435
0.1869000	0.436957
0.0895652	0.518261
0.1687260	0.498696
0.2668040	0.535833
0.3619500	0.434167
0.2232670	0.482917
0.1221320	0.686364
0.3046270	0.599545
0.3010000	0.470417
0.1265480	0.537826
0.1579630	0.498750
0.1884330	0.483750
0.1940170	0.537500
0.1467750	0.861667
0.2083170	0.741739
0.1959040	0.538333
0.3532420	0.457083
0.1719700	0.400000
0.2466000	0.436522
0.1583300	0.491739
0.1297960	0.616957
0.2938500	0.862500
0.1138370	0.687500
0.1233000	0.793043
0.1453650	0.651739
0.0739826	0.722174
⋮	⋮
0.1243790	0.823333
0.0827208	0.767500
0.1741290	0.733750
0.3240210	0.485000
0.1747540	0.508750
0.1306000	0.764167
0.1013790	0.911250
0.1579750	0.905417
0.1903080	0.925000
0.2960370	0.596667
0.1629370	0.538333
0.1741290	0.485833
0.1312290	0.642917
0.1063500	0.650417
0.1007420	0.838750
0.0982583	0.907083
0.2214040	0.666250
0.1840920	0.625417
0.1324630	0.667917
0.3743830	0.556667
NA	0.441250
0.1330830	0.515417
0.0772304	0.791304
0.1687260	0.734783
0.3165460	0.823333
0.3501330	0.652917
0.1554710	0.590000
0.1243830	0.752917
0.3507540	0.483333
0.1548460	0.577500

In [296]:

# Impute missing values using mean imputation method
wind_hum$windspeed[is.na(wind_hum$windspeed)] <- mean(wind_hum$windspeed, na.rm = TRUE)
wind_hum$humidity[is.na(wind_hum$humidity)] <- mean(wind_hum$humidity, na.rm = TRUE)

#Section : Combining the imputed dataset and original dataset
new_df <- subset(bike_df, select = -c(windspeed, humidity))  # Remove original windspeed and humidity
bike_df <- cbind(new_df, wind_hum)  # Combine new_df and wind_hum data frames

# Display the first 5 rows of the updated dataset
head(bike_df)

A data.frame: 6 × 19
	record_id	datetime	season	year	month	holiday	weekday	workingday	weather_condition	temp	atemp	casual_users	registerd_users	count	year_from_date	temp_celsius	atemp_celsius	windspeed	humidity
	<dbl>	<date>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	1	2011-01-01	1	0	1	0	6	0	2	0.344167	0.363625	331	654	985	2011	14.110847	18.18125	0.1604460	0.805833
2	2	2011-01-02	1	0	1	0	0	0	2	0.363478	0.353739	131	670	801	2011	14.902598	17.68695	0.2485390	0.696087
3	3	2011-01-03	1	0	1	0	1	1	1	0.196364	0.189405	120	1229	1349	2011	8.050924	9.47025	0.2483090	0.437273
4	4	2011-01-04	1	0	1	0	2	1	1	0.200000	0.212122	108	1454	1562	2011	8.200000	10.60610	0.1602960	0.590435
5	5	2011-01-05	1	0	1	0	3	1	1	0.226957	0.229270	82	1518	1600	2011	9.305237	11.46350	0.1869000	0.436957
6	6	2011-01-06	1	0	1	0	4	1	1	0.204348	0.233209	88	1518	1606	2011	8.378268	11.66045	0.0895652	0.518261

In [297]:

summary(bike_df)

   record_id        datetime          season  year        month     holiday
 Min.   :  1.0   Min.   :2011-01-01   1:181   0:365   1      : 62   0:710  
 1st Qu.:183.5   1st Qu.:2011-07-02   2:184   1:366   3      : 62   1: 21  
 Median :366.0   Median :2012-01-01   3:188           5      : 62          
 Mean   :366.0   Mean   :2012-01-01   4:178           7      : 62          
 3rd Qu.:548.5   3rd Qu.:2012-07-01                   8      : 62          
 Max.   :731.0   Max.   :2012-12-31                   10     : 62          
                                                      (Other):359          
 weekday workingday weather_condition      temp             atemp        
 0:105   0:231      1:463             Min.   :0.05913   Min.   :0.07907  
 1:105   1:500      2:247             1st Qu.:0.33708   1st Qu.:0.33784  
 2:104              3: 21             Median :0.49833   Median :0.48673  
 3:104                                Mean   :0.49538   Mean   :0.47435  
 4:104                                3rd Qu.:0.65542   3rd Qu.:0.60860  
 5:104                                Max.   :0.86167   Max.   :0.84090  
 6:105                                                                   
  casual_users    registerd_users     count      year_from_date
 Min.   :   2.0   Min.   :  20    Min.   :  22   Min.   :2011  
 1st Qu.: 315.5   1st Qu.:2497    1st Qu.:3152   1st Qu.:2011  
 Median : 713.0   Median :3662    Median :4548   Median :2012  
 Mean   : 848.2   Mean   :3656    Mean   :4504   Mean   :2012  
 3rd Qu.:1096.0   3rd Qu.:4776    3rd Qu.:5956   3rd Qu.:2012  
 Max.   :3410.0   Max.   :6946    Max.   :8714   Max.   :2012  
                                                               
  temp_celsius    atemp_celsius      windspeed          humidity     
 Min.   : 2.424   Min.   : 3.953   Min.   :0.02239   Min.   :0.2542  
 1st Qu.:13.820   1st Qu.:16.892   1st Qu.:0.13495   1st Qu.:0.5223  
 Median :20.432   Median :24.337   Median :0.18097   Median :0.6292  
 Mean   :20.311   Mean   :23.718   Mean   :0.18639   Mean   :0.6294  
 3rd Qu.:26.872   3rd Qu.:30.430   3rd Qu.:0.22979   3rd Qu.:0.7302  
 Max.   :35.328   Max.   :42.045   Max.   :0.37811   Max.   :0.9725

plot for numerical variables in combined dataset

In [298]:

# Load the required libraries
library(car)
library(ggplot2)

# Select numerical columns for probability plots in combined dataset
numerical_columns <- sapply(bike_df, is.numeric)

for (column in names(bike_df[, numerical_columns])) {
  hist(bike_df[, column], main = paste("Histogram for", column),
       xlab = column, col = "skyblue", border = "black")
}

In [299]:

# Create normal probability plots for numerical variables in combined dataset
for (column in names(bike_df[, numerical_columns])) {
  qqnorm(bike_df[, column], main = paste("Normal Probability Plot for", column))
  qqline(bike_df[, column], col = 2)

  # Add insight annotation
  annotation <- "Some data points are deviating from normality in a good way."
  text(quantile(bike_df[, column], 1.0), quantile(bike_df[, column], 0.1), annotation, adj = c(0, 1), cex = 0.8, col = "darkgreen")
}

Correlation Analysis of in combined dataset

In [300]:

# Section: Correlation Analysis of in combined dataset
# Load the corrgram package for correlation analysis
library(corrgram)

# Identify numeric columns for correlation analysis
numeric_columns <- sapply(bike_df[, 8:19], is.numeric)

# Create a correlation plot
corrgram(bike_df[, 8:19][, numeric_columns], order = FALSE, upper.panel = panel.pie, text.panel = panel.txt, main = 'Correlation Plot')

# Add insight on positive and negative correlations
cat("Positive Correlations: temp, atemp, and year have positive correlations with the target variable.\n")
cat("Negative Correlations: weather_condition, humidity, and windspeed have negative correlations with the target variable.\n")

Positive Correlations: temp, atemp, and year have positive correlations with the target variable.
Negative Correlations: weather_condition, humidity, and windspeed have negative correlations with the target variable.

Identify variables that may not be needed for further analysis based on correlation

In [301]:

# Identify variables that may not be needed for further analysis based on correlation
cat("\nVariables with weak correlation (abs(correlation) <= 0.1) with the target variable:\n")
weak_corr_vars <- names(bike_df[, 8:19][, numeric_columns])[sapply(bike_df[, 8:19][, numeric_columns], function(x) abs(cor(x, bike_df$count)) <= 0.1)]
print(weak_corr_vars)

Variables with weak correlation (abs(correlation) <= 0.1) with the target variable:
character(0)

Section: Splitting Dataset for Training and Testing¶

In [302]:

# Load the purrr library for functions and vectors
library(purrr)

# Split the dataset based on simple random resampling
train_index <- sample(1:nrow(bike_df), 0.7 * nrow(bike_df))
train_data <- bike_df[train_index,]
test_data <- bike_df[-train_index,]

# Display dimensions of the training and testing datasets
cat("Dimensions of Training Data:", dim(train_data), "\n")
cat("Dimensions of Testing Data:", dim(test_data), "\n")

Dimensions of Training Data: 511 19 
Dimensions of Testing Data: 220 19

In [303]:

#splitted data exploration
head(train_data)

A data.frame: 6 × 19
	record_id	datetime	season	year	month	holiday	weekday	workingday	weather_condition	temp	atemp	casual_users	registerd_users	count	year_from_date	temp_celsius	atemp_celsius	windspeed	humidity
	<dbl>	<date>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
146	146	2011-05-26	2	0	5	0	4	1	1	0.708333	0.654688	758	3919	4677	2011	29.041653	32.73440	0.1996420	0.677500
312	312	2011-11-08	4	0	11	0	2	1	1	0.408333	0.412246	376	3829	4205	2011	16.741653	20.61230	0.0690375	0.721667
9	9	2011-01-09	1	0	1	0	0	0	1	0.138333	0.116175	54	768	822	2011	5.671653	5.80875	0.3619500	0.434167
240	240	2011-08-28	3	0	8	0	0	0	1	0.707059	0.647959	1415	2919	4334	2011	28.989419	32.39795	0.3046590	0.561765
633	633	2012-09-24	4	1	9	0	1	1	1	0.514167	0.502513	1001	6435	7436	2012	21.080847	25.12565	0.1424040	0.492917
394	394	2012-01-29	1	1	1	0	0	0	1	0.282500	0.272721	558	2685	3243	2012	11.582500	13.63605	0.2400500	0.311250

In [304]:

head(test_data)

A data.frame: 6 × 19
	record_id	datetime	season	year	month	holiday	weekday	workingday	weather_condition	temp	atemp	casual_users	registerd_users	count	year_from_date	temp_celsius	atemp_celsius	windspeed	humidity
	<dbl>	<date>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
3	3	2011-01-03	1	0	1	0	1	1	1	0.196364	0.189405	120	1229	1349	2011	8.050924	9.47025	0.2483090	0.437273
4	4	2011-01-04	1	0	1	0	2	1	1	0.200000	0.212122	108	1454	1562	2011	8.200000	10.60610	0.1602960	0.590435
6	6	2011-01-06	1	0	1	0	4	1	1	0.204348	0.233209	88	1518	1606	2011	8.378268	11.66045	0.0895652	0.518261
7	7	2011-01-07	1	0	1	0	5	1	2	0.196522	0.208839	148	1362	1510	2011	8.057402	10.44195	0.1687260	0.498696
12	12	2011-01-12	1	0	1	0	3	1	1	0.172727	0.160473	25	1137	1162	2011	7.081807	8.02365	0.3046270	0.599545
14	14	2011-01-14	1	0	1	0	5	1	1	0.160870	0.188413	54	1367	1421	2011	6.595670	9.42065	0.1265480	0.537826

Creating Subsets for Training and Testing

In [305]:

# Section: Creating Subsets for Training and Testing

# Create a new subset for train attributes
train <- subset(train_data, select = c('season', 'year', 'month', 'holiday', 'weekday', 'workingday', 'weather_condition', 'temp','atemp', 'humidity', 'windspeed', 'count'))

# Create a new subset for test attributes
test <- subset(test_data, select = c('season', 'year', 'month', 'holiday', 'weekday', 'workingday', 'weather_condition', 'temp', 'atemp','humidity', 'windspeed', 'count'))

# Display the first few rows of the training subset
cat("Training Subset:")
head(train)

Training Subset:

A data.frame: 6 × 12
	season	year	month	holiday	weekday	workingday	weather_condition	temp	atemp	humidity	windspeed	count
	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
146	2	0	5	0	4	1	1	0.708333	0.654688	0.677500	0.1996420	4677
312	4	0	11	0	2	1	1	0.408333	0.412246	0.721667	0.0690375	4205
9	1	0	1	0	0	0	1	0.138333	0.116175	0.434167	0.3619500	822
240	3	0	8	0	0	0	1	0.707059	0.647959	0.561765	0.3046590	4334
633	4	1	9	0	1	1	1	0.514167	0.502513	0.492917	0.1424040	7436
394	1	1	1	0	0	0	1	0.282500	0.272721	0.311250	0.2400500	3243

In [306]:

# Display the first few rows of the testing subset
cat("\nTesting Subset:")
head(test)

Testing Subset:

A data.frame: 6 × 12
	season	year	month	holiday	weekday	workingday	weather_condition	temp	atemp	humidity	windspeed	count
	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
3	1	0	1	0	1	1	1	0.196364	0.189405	0.437273	0.2483090	1349
4	1	0	1	0	2	1	1	0.200000	0.212122	0.590435	0.1602960	1562
6	1	0	1	0	4	1	1	0.204348	0.233209	0.518261	0.0895652	1606
7	1	0	1	0	5	1	2	0.196522	0.208839	0.498696	0.1687260	1510
12	1	0	1	0	3	1	1	0.172727	0.160473	0.599545	0.3046270	1162
14	1	0	1	0	5	1	1	0.160870	0.188413	0.537826	0.1265480	1421

In [307]:

# Section: Creating Subsets for Training and Testing

# Create a new subset for train categorical attributes
train_cat <- subset(train, select = c('season', 'holiday', 'workingday', 'weather_condition', 'year'))

# Create a new subset for test categorical attributes
test_cat <- subset(test, select = c('season', 'holiday', 'workingday', 'weather_condition', 'year'))

# Create a new subset for train numerical attributes
train_num <- subset(train, select = c('weekday', 'month', 'temp', 'atemp','humidity', 'windspeed', 'count'))

# Create a new subset for test numerical attributes
test_num <- subset(test, select = c('weekday', 'month', 'temp', 'atemp','humidity', 'windspeed', 'count'))

# Display the first few rows of the training categorical attributes subset
cat("Training Categorical Attributes Subset:")
head(train_cat)

Training Categorical Attributes Subset:

A data.frame: 6 × 5
	season	holiday	workingday	weather_condition	year
	<fct>	<fct>	<fct>	<fct>	<fct>
146	2	0	1	1	0
312	4	0	1	1	0
9	1	0	0	1	0
240	3	0	0	1	0
633	4	0	1	1	1
394	1	0	0	1	1

In [308]:

# Display the first few rows of the testing categorical attributes subset
cat("\nTesting Categorical Attributes Subset:")
head(test_cat)

Testing Categorical Attributes Subset:

A data.frame: 6 × 5
	season	holiday	workingday	weather_condition	year
	<fct>	<fct>	<fct>	<fct>	<fct>
3	1	0	1	1	0
4	1	0	1	1	0
6	1	0	1	1	0
7	1	0	1	2	0
12	1	0	1	1	0
14	1	0	1	1	0

In [309]:

# Display the first few rows of the training numerical attributes subset
cat("\nTraining Numerical Attributes Subset:")
head(train_num)

Training Numerical Attributes Subset:

A data.frame: 6 × 7
	weekday	month	temp	atemp	humidity	windspeed	count
	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
146	4	5	0.708333	0.654688	0.677500	0.1996420	4677
312	2	11	0.408333	0.412246	0.721667	0.0690375	4205
9	0	1	0.138333	0.116175	0.434167	0.3619500	822
240	0	8	0.707059	0.647959	0.561765	0.3046590	4334
633	1	9	0.514167	0.502513	0.492917	0.1424040	7436
394	0	1	0.282500	0.272721	0.311250	0.2400500	3243

In [310]:

# Display the first few rows of the testing numerical attributes subset
cat("\nTesting Numerical Attributes Subset:")
head(test_num)

Testing Numerical Attributes Subset:

A data.frame: 6 × 7
	weekday	month	temp	atemp	humidity	windspeed	count
	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
3	1	1	0.196364	0.189405	0.437273	0.2483090	1349
4	2	1	0.200000	0.212122	0.590435	0.1602960	1562
6	4	1	0.204348	0.233209	0.518261	0.0895652	1606
7	5	1	0.196522	0.208839	0.498696	0.1687260	1510
12	3	1	0.172727	0.160473	0.599545	0.3046270	1162
14	5	1	0.160870	0.188413	0.537826	0.1265480	1421

Section: Feature Engineering: Encoding Categorical Features¶

Encoding Categorical Features (Train Dataset)

In [311]:

# Load the required libraries
library(caret)

# Define variables for dummy encoding
othervars <- c('month', 'weekday', 'temp', 'atemp', 'humidity', 'windspeed', 'count')
set.seed(2626)

# Identify categorical variables
vars <- setdiff(colnames(train), c(train$count, othervars))
vars

'season'
'year'
'holiday'
'workingday'
'weather_condition'

In [312]:

# Create a formula for encoding
f <- paste('~', paste(vars, collapse = ' + '))

# Use dummyVars to encode categorical variables
encoder <- dummyVars(as.formula(f), train)
encode_attributes <- predict(encoder, train)

# Combine numerical and encoded attributes
train_encoded_attributes <- cbind(train_num, encode_attributes)

# Display the head of the encoded dataset
head(train_encoded_attributes)

A data.frame: 6 × 20
	weekday	month	temp	atemp	humidity	windspeed	count	season.1	season.2	season.3	season.4	year.0	year.1	holiday.0	holiday.1	workingday.0	workingday.1	weather_condition.1	weather_condition.2	weather_condition.3
	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
146	4	5	0.708333	0.654688	0.677500	0.1996420	4677	0	1	0	0	1	0	1	0	0	1	1	0	0
312	2	11	0.408333	0.412246	0.721667	0.0690375	4205	0	0	0	1	1	0	1	0	0	1	1	0	0
9	0	1	0.138333	0.116175	0.434167	0.3619500	822	1	0	0	0	1	0	1	0	1	0	1	0	0
240	0	8	0.707059	0.647959	0.561765	0.3046590	4334	0	0	1	0	1	0	1	0	1	0	1	0	0
633	1	9	0.514167	0.502513	0.492917	0.1424040	7436	0	0	0	1	0	1	1	0	0	1	1	0	0
394	0	1	0.282500	0.272721	0.311250	0.2400500	3243	1	0	0	0	0	1	1	0	1	0	1	0	0

Encoding Categorical Features (Test Dataset)

In [313]:

# Section: Encoding Categorical Features (Test Dataset)

# Load the required libraries
library(caret)

# Define variables for dummy encoding
othervars <- c('month', 'weekday', 'temp', 'atemp', 'humidity', 'windspeed', 'count')
set.seed(5662)

# Identify categorical variables in the test dataset
vars <- setdiff(colnames(test), c(test$count, othervars))
vars

'season'
'year'
'holiday'
'workingday'
'weather_condition'

In [314]:

# Create a formula for encoding
f <- paste('~', paste(vars, collapse = ' + '))

# Use dummyVars to encode categorical variables
encoder <- dummyVars(as.formula(f), test)
encode_attributes <- predict(encoder, test)

# Combine numerical and encoded attributes for the test dataset
test_encoded_attributes <- cbind(test_num, encode_attributes)

# Display the head of the encoded test dataset
head(test_encoded_attributes)

A data.frame: 6 × 20
	weekday	month	temp	atemp	humidity	windspeed	count	season.1	season.2	season.3	season.4	year.0	year.1	holiday.0	holiday.1	workingday.0	workingday.1	weather_condition.1	weather_condition.2	weather_condition.3
	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
3	1	1	0.196364	0.189405	0.437273	0.2483090	1349	1	0	0	0	1	0	1	0	0	1	1	0	0
4	2	1	0.200000	0.212122	0.590435	0.1602960	1562	1	0	0	0	1	0	1	0	0	1	1	0	0
6	4	1	0.204348	0.233209	0.518261	0.0895652	1606	1	0	0	0	1	0	1	0	0	1	1	0	0
7	5	1	0.196522	0.208839	0.498696	0.1687260	1510	1	0	0	0	1	0	1	0	0	1	0	1	0
12	3	1	0.172727	0.160473	0.599545	0.3046270	1162	1	0	0	0	1	0	1	0	0	1	1	0	0
14	5	1	0.160870	0.188413	0.537826	0.1265480	1421	1	0	0	0	1	0	1	0	0	1	1	0	0

Section: Modeling the Training Dataset with Linear Regression¶

In [315]:

# Section: Modeling the Training Dataset with Linear Regression

# Set seed to reproduce the results of random sampling
set.seed(672)

# Train the Linear Regression model
lr_model <- lm(count ~ ., data = train_encoded_attributes[, -c(6)])

# Display the summary of the Linear Regression model
summary(lr_model)

Call:
lm(formula = count ~ ., data = train_encoded_attributes[, -c(6)])

Residuals:
    Min      1Q  Median      3Q     Max 
-4085.8  -393.7    94.2   450.0  3005.3 

Coefficients: (6 not defined because of singularities)
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)          1862.06     534.81   3.482             0.000543 ***
weekday1              181.92     137.31   1.325             0.185816    
weekday2              226.49     133.63   1.695             0.090754 .  
weekday3              356.27     136.90   2.602             0.009540 ** 
weekday4              290.07     135.11   2.147             0.032292 *  
weekday5              348.66     141.28   2.468             0.013936 *  
weekday6              401.81     136.16   2.951             0.003320 ** 
month2                 59.83     177.23   0.338             0.735819    
month3                530.76     211.40   2.511             0.012374 *  
month4                 42.72     322.36   0.133             0.894628    
month5                562.02     347.59   1.617             0.106547    
month6                183.59     360.41   0.509             0.610713    
month7               -368.76     401.71  -0.918             0.359086    
month8                195.75     384.50   0.509             0.610904    
month9                595.97     334.97   1.779             0.075841 .  
month10               217.26     294.40   0.738             0.460886    
month11              -449.11     285.24  -1.575             0.116020    
month12              -305.80     222.82  -1.372             0.170575    
temp                 1583.91    1564.30   1.013             0.311789    
atemp                3356.17    1622.48   2.069             0.039121 *  
humidity            -1236.96     379.84  -3.256             0.001207 ** 
season.1            -1891.38     220.11  -8.593 < 0.0000000000000002 ***
season.2             -767.10     274.21  -2.797             0.005356 ** 
season.3             -670.54     249.38  -2.689             0.007418 ** 
season.4                  NA         NA      NA                   NA    
year.0              -2027.98      74.08 -27.374 < 0.0000000000000002 ***
year.1                    NA         NA      NA                   NA    
holiday.0             470.72     225.67   2.086             0.037513 *  
holiday.1                 NA         NA      NA                   NA    
workingday.0              NA         NA      NA                   NA    
workingday.1              NA         NA      NA                   NA    
weather_condition.1  2298.63     269.87   8.518 < 0.0000000000000002 ***
weather_condition.2  1882.27     253.93   7.413    0.000000000000559 ***
weather_condition.3       NA         NA      NA                   NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 809.7 on 483 degrees of freedom
Multiple R-squared:  0.8352,	Adjusted R-squared:  0.826 
F-statistic: 90.69 on 27 and 483 DF,  p-value: < 0.00000000000000022

Cross Validation Prediction with Linear Regression

In [316]:

# Section: Cross Validation Prediction with Linear Regression

# Ignore warning messages
options(warn = -1)

# Set seed to reproduce results of random sampling
set.seed(623)

# Cross validation resampling method
train_control <- trainControl(method = 'cv', number = 3)

# Cross validation prediction
CV_predict <- train(count ~ .,data = train_encoded_attributes[, -c(6)],
                    method = 'lm', trControl = train_control)

# Display summary of cross validation prediction
summary(CV_predict)

# Observations from the output
cat("\n**Observations:**\n")
cat("- The model's residuals range from -3713.0 to 3165.0, indicating the prediction performance.\n")
cat("- Coefficients and p-values provide insights into variable significance.\n")
cat("- The adjusted R-squared (0.8316) indicates the model's goodness of fit.\n")
cat("- Variables with smaller p-values are considered statistically significant.\n")
cat("- Overall, the linear regression model captures relationships and offers insights into predictor importance and model fit.\n")

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-4085.8  -393.7    94.2   450.0  3005.3 

Coefficients: (6 not defined because of singularities)
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)          1862.06     534.81   3.482             0.000543 ***
weekday1              181.92     137.31   1.325             0.185816    
weekday2              226.49     133.63   1.695             0.090754 .  
weekday3              356.27     136.90   2.602             0.009540 ** 
weekday4              290.07     135.11   2.147             0.032292 *  
weekday5              348.66     141.28   2.468             0.013936 *  
weekday6              401.81     136.16   2.951             0.003320 ** 
month2                 59.83     177.23   0.338             0.735819    
month3                530.76     211.40   2.511             0.012374 *  
month4                 42.72     322.36   0.133             0.894628    
month5                562.02     347.59   1.617             0.106547    
month6                183.59     360.41   0.509             0.610713    
month7               -368.76     401.71  -0.918             0.359086    
month8                195.75     384.50   0.509             0.610904    
month9                595.97     334.97   1.779             0.075841 .  
month10               217.26     294.40   0.738             0.460886    
month11              -449.11     285.24  -1.575             0.116020    
month12              -305.80     222.82  -1.372             0.170575    
temp                 1583.91    1564.30   1.013             0.311789    
atemp                3356.17    1622.48   2.069             0.039121 *  
humidity            -1236.96     379.84  -3.256             0.001207 ** 
season.1            -1891.38     220.11  -8.593 < 0.0000000000000002 ***
season.2             -767.10     274.21  -2.797             0.005356 ** 
season.3             -670.54     249.38  -2.689             0.007418 ** 
season.4                  NA         NA      NA                   NA    
year.0              -2027.98      74.08 -27.374 < 0.0000000000000002 ***
year.1                    NA         NA      NA                   NA    
holiday.0             470.72     225.67   2.086             0.037513 *  
holiday.1                 NA         NA      NA                   NA    
workingday.0              NA         NA      NA                   NA    
workingday.1              NA         NA      NA                   NA    
weather_condition.1  2298.63     269.87   8.518 < 0.0000000000000002 ***
weather_condition.2  1882.27     253.93   7.413    0.000000000000559 ***
weather_condition.3       NA         NA      NA                   NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 809.7 on 483 degrees of freedom
Multiple R-squared:  0.8352,	Adjusted R-squared:  0.826 
F-statistic: 90.69 on 27 and 483 DF,  p-value: < 0.00000000000000022

**Observations:**
- The model's residuals range from -3713.0 to 3165.0, indicating the prediction performance.
- Coefficients and p-values provide insights into variable significance.
- The adjusted R-squared (0.8316) indicates the model's goodness of fit.
- Variables with smaller p-values are considered statistically significant.
- Overall, the linear regression model captures relationships and offers insights into predictor importance and model fit.

Cross Validation Prediction Plot with Linear Regression

In [317]:

#Section: Cross Validation Prediction Plot with Linear Regression

# Increase the size of the plot
par(mfrow=c(1, 1), mar=c(5, 5, 2, 2))

# Cross-validation prediction plot
residuals <- resid(CV_predict)
y_train <- train_encoded_attributes$count

# Scatter plot with residuals
plot(y_train, residuals, ylab = 'Residuals', xlab = 'Observed', main = 'Cross Validation Prediction Plot', pch = 16, col = 'blue')

# Add a reference line at zero
abline(h = 0, col = 'red', lwd = 2)

# Observations from the plot
text(5000, -1500, 'Residuals spread around zero indicates a good model fit', col = 'darkgreen', cex = 0.8)
text(5000, -2500, 'Scatter points should be random and evenly distributed', col = 'darkgreen', cex = 0.8)

Model Performance on Test Data Analysis

In [318]:

# Section: Model Performance on Test Data Analysis

# Set seed for reproducibility
set.seed(6872)

# Suppress warning messages
options(warn = -1)

# Predict using the lr_model on the test_encoded_attributes
lm_predictions <- predict(lr_model, test_encoded_attributes[, -c(6)])

# Display the first 10 predictions
head(lm_predictions, 10)

3: 1299.77941845558
4: 1236.88889841927
6: 1467.40943438381
7: 1039.65781800207
12: 1138.86104884977
14: 1282.58778808192
15: 1282.85715121306
16: 1266.67901055174
21: 1306.17216504228
28: 735.171472407911

Model Performance Visualization using Linear Regressor Model

In [319]:

# Section: Model Performance Visualization using Linear Regressor Model

# Extract actual and predicted values
actual_values <- test_encoded_attributes$count
predicted_values <- lm_predictions

# Plot past actual values and future 10 predicted values
plot(1:length(actual_values), actual_values, type = "l", col = "blue",
     xlab = "Sample Index", ylab = "Count", main = "Actual vs Future Predicted Values",
     xlim = c(1, length(actual_values) + 10), ylim = c(0, max(actual_values, predicted_values)))
lines(length(actual_values) + 1:length(predicted_values), predicted_values, col = "orange")

# Add lines for future predicted values
lines(length(actual_values) + 1:length(predicted_values), predicted_values, col = "orange")

# Add legend
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "orange"), lty = 1)

# Highlight the last observed data point
points(length(actual_values), actual_values[length(actual_values)], pch = 19, col = "red")

# Highlight the starting point of predicted values
points(length(actual_values) + 1, predicted_values[1], pch = 19, col = "green")

Model Evaluation Metrics on Rootmean squred error and mean abosolute error

In [320]:

# Section: Model Evaluation Metrics on Rootmean squred error and mean abosolute error

# Set seed for reproducibility
set.seed(688)

# Root mean squared error (RMSE)
rmse <- RMSE(lm_predictions, test_encoded_attributes$count)
print(paste("Root Mean Squared Error (RMSE):", round(rmse, 2)))

[1] "Root Mean Squared Error (RMSE): 780.12"

In [321]:

# Mean absolute error (MAE)
mae <- MAE(lm_predictions, test_encoded_attributes$count)
print(paste("Mean Absolute Error (MAE):", round(mae, 2)))

[1] "Mean Absolute Error (MAE): 574.86"

Section: Residual Analysis¶

In [322]:

# Section: Residual Analysis

# Set heading for the code block
cat("Residual :\n")

# Calculate residuals
y_test <- test_encoded_attributes$count
residuals_lm <- y_test - lm_predictions
residuals_lm

Residual :

3: 49.2205815444195
4: 325.111101580733
6: 138.590565616194
7: 470.342181997927
12: 23.1389511502321
14: 138.412211918081
15: -34.8571512130575
16: -62.679010551738
21: 236.827834957717
28: 431.828527592089
29: -232.890671645619
32: 712.620056549099
35: 625.844020351251
43: -162.831664205103
44: -206.222682214964
45: -570.162248326128
47: -54.6370694512257
55: 585.189208000651
58: 748.836132101029
62: -439.940599767546
71: -391.793777331637
72: -57.8721944505728
73: -393.38184616982
74: 243.493310876877
78: -372.74523408481
83: -739.408006614029
86: -887.768518769175
87: -1430.40787504754
88: -1242.15233703773
89: -1444.23923898485
91: -161.98673725248
102: -1191.55166330289
105: 8.70616631643998
106: -198.828869301241
107: 239.022921258381
109: -104.801829172019
116: 202.064899776338
118: 251.084287802235
124: -807.601927617998
125: 83.5640959121229
128: 137.535084554864
131: -422.399214779612
133: 371.540846380108
138: -47.4939251301712
139: 774.161166145477
141: 906.044162969268
147: -476.956333574075
148: -233.730410598533
152: -778.366080132128
153: -258.39103216707
157: -73.144710510318
159: -875.415467262376
163: 109.259894959807
166: 422.061201603007
170: 690.091029150183
176: 21.6828500770107
178: 421.368739333648
183: 293.929915232388
186: 85.0180448996634
189: 107.559854194194
190: 733.004749868361
195: 608.999045472891
196: 1210.4017296759
200: -183.919632964755
201: -437.296447103578
202: -863.133120334438
208: -311.537023585558
209: -460.009046575961
213: -945.83177024712
215: -1122.13953868043
216: 154.912872388624
219: -989.014957690069
223: -425.6293250069
225: -315.550249584036
233: -696.261603679573
234: -208.827508724691
237: -757.100336999891
239: -3157.78538401256
243: 221.555401962891
244: -14.5580573282505
246: -745.265747843277
247: -27.4914751125261
249: 765.993301213178
252: -779.070488102796
253: 260.870654095488
256: -167.512540432482
260: 500.496256176883
261: 154.658110085235
263: -283.771456599316
269: -290.767230905685
273: -267.261930887607
287: -873.176862603896
290: -275.849291520084
294: -253.914901250935
297: -128.455127996663
301: 124.654693281982
310: 635.468093664307
313: 529.938030541639
315: 344.558153757214
318: 275.797380250249
320: 515.147233810764
324: 360.99624827932
326: 640.151316635396
333: -263.592968531213
334: 301.800392890331
337: 156.41677285905
339: 837.666000730862
341: -442.59835950076
349: 269.352685067794
350: 164.461308183487
353: 330.934461783085
362: 749.372272109863
363: 1157.90071776272
364: 1467.98690640908
367: -1307.82298682662
369: -369.907837631945
371: 8.55918786834718
378: -518.091423304065
385: -38.7777573699041
386: -1257.72303612483
401: 105.58598585211
407: 1107.71459367047
411: -61.8491639383283
416: -114.941450808233
417: 39.4967294311318
420: -357.640950927755
421: -1298.78763853063
428: -465.546768930522
430: -688.336962442095
432: -8.71670702825577
434: -113.956006637627
435: -508.765298729691
439: 564.137504405736
440: 676.771691713918
441: 62.1420466892359
444: 950.246303412723
449: -2342.78907414074
450: -54.4162145103355
451: -606.518699737649
454: -410.331161637057
456: 680.581532196295
457: 1311.84552659748
459: 841.165171701703
460: 105.251012939656
462: 720.417162852131
463: 739.42303871724
469: 497.294585808875
473: 68.2395879234646
474: -1043.68762152674
479: -1030.38749578207
482: -231.088925227908
486: 333.158104007808
490: -546.437404817963
494: -452.396677374296
495: -1495.54012456611
499: -529.841428788905
500: -3125.88939401505
504: 768.099563331344
511: -300.038330469503
514: -666.423963631024
515: -1466.07222101221
519: 1525.07552956185
520: 1318.78163076801
528: 221.88868713969
531: 607.140129047698
532: 972.468915888687
534: 812.615812791173
539: -1622.18336221372
541: -101.038562619199
543: 537.735433806048
546: -2386.81379640261
548: -1230.8909373952
559: 887.838020629179
564: -239.277419277173
566: -166.571481994887
567: 277.738851829201
568: 1044.8366288049
573: 5.59776742172471
574: -3.5519188075823
577: 701.666797760642
579: 529.08474932321
580: 65.1725150070952
581: 241.73349899031
582: -753.726284102079
590: -229.900838325531
604: -474.101603616766
608: 686.48261198465
612: -596.979822634375
613: -440.402434731111
614: -430.506774066108
617: -672.8629038165
621: 808.411862976863
623: 938.355858780112
624: 1446.060719299
626: 679.110182611739
631: 1157.79054765846
636: 76.4167029949094
638: 958.328671756043
639: -188.387496732876
641: 184.692685714625
642: 713.03133082913
643: 417.93423473025
646: -2113.75073099363
649: 763.343662661803
655: 860.9952208547
665: 1042.35625861153
667: -1440.38894606363
673: -52.7962317524443
674: -38.1811177374657
676: 588.868585576954
677: 199.7717480102
679: 364.351552437936
684: 236.345805157055
688: -332.214411927917
695: -2432.86291656963
697: -467.290546112472
703: 429.381252938119
709: -1479.1771356844
722: -1754.72530786945
726: -157.330967603271

In [323]:

# Create an informative residual plot
plot(y_test, residuals_lm, xlab = 'Observed Count', ylab = 'Residuals', main = 'Residual Plot',
     col = ifelse(residuals_lm >= 0, 'blue', 'red'), pch = 16, cex = 1.2)

# Add a reference line at y = 0
abline(h = 0, col = 'black', lty = 2, lw = 2)

# Add legend for positive and negative residuals
legend('topright', legend = c('Positive Residuals', 'Negative Residuals'), col = c('blue', 'red'), pch = 16)

# Interpretation:
cat("\nInterpretation:\n")
cat("The residual plot shows the difference between observed and predicted counts.\n")
cat("Positive residuals (in blue) indicate underpredictions, while negative residuals (in red) indicate overpredictions.\n")
cat("Overall, the model seems to perform well, with residuals centered around zero.\n")

Interpretation:
The residual plot shows the difference between observed and predicted counts.
Positive residuals (in blue) indicate underpredictions, while negative residuals (in red) indicate overpredictions.
Overall, the model seems to perform well, with residuals centered around zero.

Section: Decision Tree Regressor¶

In [324]:

# Section: Decision Tree Regressor

# Set heading for the code block
cat("Decision Tree Regressor:\n")

# Set seed for reproducibility
set.seed(568)

# Load the rpart library for decision trees
library(rpart)

# Set control parameters for rpart
rpart.control <- rpart.control(minbucket = 2, cp = 0.01, maxcompete = 3, maxsurrogate = 4,
                               usesurrogate = 2, xval = 3, surrogatestyle = 0, maxdepth = 10)

# Train the decision tree regressor model
dtr <- rpart(train_encoded_attributes$count ~ ., data = train_encoded_attributes[,-c(6)],
             control = rpart.control, method = 'anova', cp = 0.01)

# Display summary of the decision tree model
summary(dtr)

# Interpretation:
cat("\nInterpretation:\n")
cat("The decision tree regressor is trained with control parameters to control its performance.\n")
cat("The summary provides insights into the structure of the decision tree, including splits, terminal nodes, and node statistics.\n")

Decision Tree Regressor:
Call:
rpart(formula = train_encoded_attributes$count ~ ., data = train_encoded_attributes[, 
    -c(6)], method = "anova", control = rpart.control, cp = 0.01)
  n= 511 

          CP nsplit rel error    xerror       xstd
1 0.38806588      0 1.0000000 1.0017384 0.04827550
2 0.21219513      1 0.6119341 0.6456161 0.03328136
3 0.08560458      2 0.3997390 0.4196777 0.03247223
4 0.04320558      3 0.3141344 0.3413733 0.02918864
5 0.03642203      4 0.2709288 0.3147070 0.03012851
6 0.02057878      5 0.2345068 0.2899876 0.03228089
7 0.01286813      6 0.2139280 0.2853027 0.03243160
8 0.01163559      7 0.2010599 0.2786163 0.03319573
9 0.01000000      8 0.1894243 0.2709887 0.03002257

Variable importance
   atemp     temp    month   year.0   year.1 season.1 season.4 season.3 
      19       18       16       13       13       11        3        3 
humidity 
       3 

Node number 1: 511 observations,    complexity param=0.3880659
  mean=4525.638, MSE=3761088 
  left son=2 (213 obs) right son=3 (298 obs)
  Primary splits:
      temp     < 0.432174  to the left,  improve=0.3880659, (0 missing)
      atemp    < 0.4308565 to the left,  improve=0.3841343, (0 missing)
      season.1 < 0.5       to the right, improve=0.3307881, (0 missing)
      year.0   < 0.5       to the right, improve=0.3157981, (0 missing)
  Surrogate splits:
      atemp    < 0.427125  to the left,  agree=0.996, adj=0.991, (0 split)
      month    splits as  LLLRRRRRRRLL, agree=0.894, adj=0.746, (0 split)
      season.1 < 0.5       to the right, agree=0.789, adj=0.493, (0 split)
      season.3 < 0.5       to the left,  agree=0.661, adj=0.188, (0 split)

Node number 2: 213 observations,    complexity param=0.08560458
  mean=3096.653, MSE=2305870 
  left son=4 (109 obs) right son=5 (104 obs)
  Primary splits:
      year.1   < 0.5       to the left,  improve=0.3349786, (0 missing)
      year.0   < 0.5       to the right, improve=0.3349786, (0 missing)
      season.4 < 0.5       to the left,  improve=0.3221560, (0 missing)
      season.1 < 0.5       to the right, improve=0.2837440, (0 missing)
  Surrogate splits:
      year.0 < 0.5       to the right, agree=1.000, adj=1.000, (0 split)
      month  splits as  RRLL-----LRR, agree=0.582, adj=0.144, (0 split)
      atemp  < 0.2578895 to the left,  agree=0.582, adj=0.144, (0 split)
      temp   < 0.2766665 to the left,  agree=0.573, adj=0.125, (0 split)

Node number 3: 298 observations,    complexity param=0.2121951
  mean=5547.027, MSE=2298442 
  left son=6 (142 obs) right son=7 (156 obs)
  Primary splits:
      year.1              < 0.5       to the left,  improve=0.5954153, (0 missing)
      year.0              < 0.5       to the right, improve=0.5954153, (0 missing)
      humidity            < 0.834375  to the right, improve=0.1504638, (0 missing)
      weather_condition.3 < 0.5       to the right, improve=0.1310491, (0 missing)
  Surrogate splits:
      year.0   < 0.5       to the right, agree=1.000, adj=1.000, (0 split)
      humidity < 0.6954165 to the right, agree=0.581, adj=0.120, (0 split)
      month    splits as  -RRRRLRRRRLR, agree=0.544, adj=0.042, (0 split)
      atemp    < 0.6682585 to the right, agree=0.540, adj=0.035, (0 split)

Node number 4: 109 observations,    complexity param=0.03642203
  mean=2238.174, MSE=1099667 
  left son=8 (73 obs) right son=9 (36 obs)
  Primary splits:
      season.4 < 0.5       to the left,  improve=0.5839973, (0 missing)
      season.1 < 0.5       to the right, improve=0.5208747, (0 missing)
      month    splits as  LLLL-----RRR, improve=0.4553324, (0 missing)
      atemp    < 0.2443185 to the left,  improve=0.2553346, (0 missing)
  Surrogate splits:
      month    splits as  LLLL-----RRR, agree=0.927, adj=0.778, (0 split)
      season.1 < 0.5       to the right, agree=0.872, adj=0.611, (0 split)
      atemp    < 0.379725  to the left,  agree=0.679, adj=0.028, (0 split)

Node number 5: 104 observations,    complexity param=0.04320558
  mean=3996.404, MSE=1988094 
  left son=10 (58 obs) right son=11 (46 obs)
  Primary splits:
      season.1 < 0.5       to the right, improve=0.4016095, (0 missing)
      season.4 < 0.5       to the left,  improve=0.3570016, (0 missing)
      atemp    < 0.2862915 to the left,  improve=0.3428914, (0 missing)
      temp     < 0.2804165 to the left,  improve=0.3328967, (0 missing)
  Surrogate splits:
      season.4 < 0.5       to the left,  agree=0.933, adj=0.848, (0 split)
      month    splits as  LLLR-----RRR, agree=0.894, adj=0.761, (0 split)
      temp     < 0.295     to the left,  agree=0.702, adj=0.326, (0 split)
      atemp    < 0.296882  to the left,  agree=0.702, adj=0.326, (0 split)

Node number 6: 142 observations,    complexity param=0.01286813
  mean=4320.873, MSE=686675.4 
  left son=12 (7 obs) right son=13 (135 obs)
  Primary splits:
      humidity            < 0.9041665 to the right, improve=0.2536354, (0 missing)
      weather_condition.3 < 0.5       to the right, improve=0.2522281, (0 missing)
      atemp               < 0.4592605 to the left,  improve=0.1985392, (0 missing)
      temp                < 0.4708335 to the left,  improve=0.1877138, (0 missing)
  Surrogate splits:
      weather_condition.3 < 0.5       to the right, agree=0.979, adj=0.571, (0 split)
      month               splits as  -RRRRRRRRRRL, agree=0.958, adj=0.143, (0 split)

Node number 7: 156 observations,    complexity param=0.02057878
  mean=6663.141, MSE=1151324 
  left son=14 (4 obs) right son=15 (152 obs)
  Primary splits:
      humidity            < 0.8322915 to the right, improve=0.22020730, (0 missing)
      temp                < 0.4920835 to the left,  improve=0.08552324, (0 missing)
      weather_condition.1 < 0.5       to the left,  improve=0.08512799, (0 missing)
      atemp               < 0.47979   to the left,  improve=0.08466814, (0 missing)

Node number 8: 73 observations
  mean=1675.411, MSE=389698.5 

Node number 9: 36 observations
  mean=3379.333, MSE=594878.7 

Node number 10: 58 observations,    complexity param=0.01163559
  mean=3200.638, MSE=961525.5 
  left son=20 (18 obs) right son=21 (40 obs)
  Primary splits:
      temp     < 0.2575    to the left,  improve=0.4009904, (0 missing)
      atemp    < 0.294643  to the left,  improve=0.3939220, (0 missing)
      month    splits as  RRR--------L, improve=0.2544737, (0 missing)
      humidity < 0.6497195 to the right, improve=0.1667501, (0 missing)
  Surrogate splits:
      atemp   < 0.2607295 to the left,  agree=0.966, adj=0.889, (0 split)
      month   splits as  RRR--------L, agree=0.776, adj=0.278, (0 split)
      weekday splits as  RLRRRRR, agree=0.741, adj=0.167, (0 split)

Node number 11: 46 observations
  mean=4999.761, MSE=1477300 

Node number 12: 7 observations
  mean=2488.143, MSE=229917 

Node number 13: 135 observations
  mean=4415.904, MSE=527163.2 

Node number 14: 4 observations
  mean=3559.25, MSE=4341614 

Node number 15: 152 observations
  mean=6744.822, MSE=807167.1 

Node number 20: 18 observations
  mean=2275, MSE=413281.7 

Node number 21: 40 observations
  mean=3617.175, MSE=649169.6 


Interpretation:
The decision tree regressor is trained with control parameters to control its performance.
The summary provides insights into the structure of the decision tree, including splits, terminal nodes, and node statistics.

Visualizing Enhanced Decision Tree Model

In [325]:

# Section: Visualizing Enhanced Decision Tree Model

# Set heading for the code block
cat("Visualizing Enhanced Decision Tree Model:\n")

# Load the rpart.plot library for plotting the learned decision tree model
library(rpart.plot)

# Set custom color palette for the decision tree plot
custom_palette <- c("#FF0000", "#FF6347", "#FFA500", "#008000", "#0000FF", "#4B0082")

# Plot the learned decision tree model with enhanced visualization
# Use 'col' parameter to set the color palette
rpart.plot(dtr, col = custom_palette, shadow.col = "blue", nn = TRUE, roundint = FALSE,
           main = "Enhanced Decision Tree Model", branch.lty = 3, under = TRUE, box.col = "lightblue",
           split.cex = 1.5, compress = TRUE, extra = 101, varlen = 0, tweak = 2.0)

# Interpretation:
cat("\nInterpretation:\n")
cat("The enhanced visualization of the learned decision tree model improves aesthetics and clarity.\n")
cat("Custom color palette and styling options are applied to make the plot more attractive.\n")
cat("Understanding the decision tree structure becomes more engaging and informative.\n")

Visualizing Enhanced Decision Tree Model:

Interpretation:
The enhanced visualization of the learned decision tree model improves aesthetics and clarity.
Custom color palette and styling options are applied to make the plot more attractive.
Understanding the decision tree structure becomes more engaging and informative.

Cross Validation Prediction for Decision Tree Regressor

In [326]:

# Section: Cross Validation Prediction for Decision Tree Regressor

# Set heading for the code block
cat("Cross Validation Prediction for Decision Tree Regressor:\n")

# Ignore warning messages during execution
options(warn = -1)

# Load the caret package
library(caret)

# Set seed for reproducibility of results
set.seed(5769)

# Define cross-validation resampling method
train.control <- trainControl(method = 'CV', number = 3)

# Perform cross-validation prediction using the decision tree regressor
dtr_CV_predict <- train(count ~ ., data = train_encoded_attributes, method = 'rpart', trControl = train.control)

# Display the cross-validation results
dtr_CV_predict

Cross Validation Prediction for Decision Tree Regressor:

CART 

511 samples
 19 predictor

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 341, 340, 341 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   MAE      
  0.08560458  1255.472  0.5801237   981.6119
  0.21219513  1380.823  0.4889382  1133.4253
  0.38806588  1762.949  0.2659245  1462.8221

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.08560458.

Cross-validation prediction plot for decision tree regression

In [327]:

# Section: Cross-validation prediction plot for decision tree regression

# Create a new color palette for the plot
plot_colors <- c('skyblue', 'darkred')

# Extract predicted values from the list
predicted_counts <- as.numeric(dtr_CV_predict$pred)

# Calculate residuals
observed_counts <- as.numeric(train_encoded_attributes$count)
residuals_dr <- resid(dtr_CV_predict)
residuals_dr

X146: 356.12676056338
X312: 1108.34741784038
X9: -2274.65258215962
X240: 13.1267605633802
X633: 772.858974358975
X394: 146.347417840376
X505: 1630.85897435897
X2: -2295.65258215962
X329: -304.652582159624
X725: -2083.65258215962
X555: -1991.14102564103
X396: 1412.34741784038
X230: -515.87323943662
X558: 600.858974358975
X258: -661.87323943662
X284: 242.12676056338
X92: -844.652582159624
X226: -500.87323943662
X523: 391.858974358975
X346: 213.347417840376
X60: -1245.65258215962
X533: 1038.85897435897
X436: 1814.34741784038
X59: -1650.65258215962
X323: 566.347417840376
X348: 643.347417840376
X358: -2085.65258215962
X282: 1190.12676056338
X210: -474.87323943662
X18: -2413.65258215962
X281: 1088.12676056338
X272: 518.12676056338
X63: -1152.65258215962
X48: -1845.87323943662
X609: 686.858974358975
X603: -610.141025641025
X236: 809.12676056338
X228: 404.12676056338
X227: 17.1267605633802
X661: 394.858974358975
X571: 928.858974358975
X429: 326.347417840376
X67: -963.652582159624
X576: -66.1410256410254
X458: -727.141025641025
X425: -1262.65258215962
X390: 1173.34741784038
X160: -405.87323943662
X177: 984.12676056338
X382: -161.652582159624
X108: -891.87323943662
X187: 308.12676056338
X510: 106.858974358975
X647: 2381.34741784038
X300: -1661.87323943662
X583: -1199.14102564103
X406: 734.347417840376
X76: -352.652582159624
X713: 2435.34741784038
X498: 765.858974358975
X154: 991.12676056338
X414: 1221.34741784038
X110: -376.87323943662
X727: -982.652582159624
X179: 327.12676056338
X620: 1103.85897435897
X547: -976.141025641025
X599: 342.858974358975
X637: 751.858974358975
X84: -886.652582159624
X706: 2278.34741784038
X302: -2469.65258215962
X484: 1123.34741784038
X393: 926.347417840376
X648: -271.141025641025
X513: -72.1410256410254
X488: -494.141025641025
X550: -3.14102564102541
X328: -1601.65258215962
X405: 733.347417840376
X115: -247.87323943662
X372: 1424.34741784038
X606: 376.858974358975
X90: -1411.65258215962
X245: 406.12676056338
X293: -125.87323943662
X625: 669.858974358975
X652: 4012.34741784038
X1: -2111.65258215962
X442: 1172.85897435897
X188: 271.12676056338
X477: -39.1410256410254
X64: -1019.65258215962
X594: 941.858974358975
X686: 2601.34741784038
X172: 514.12676056338
X54: -1179.65258215962
X130: 482.12676056338
X483: -430.141025641025
X384: 195.347417840376
X702: 1552.34741784038
X347: 426.347417840376
X366: -802.652582159624
X127: 393.12676056338
X173: 186.12676056338
X137: -197.87323943662
X121: -969.87323943662
X279: 444.12676056338
X292: -1896.87323943662
X715: 1950.34741784038
X596: 1201.85897435897
X77: -1081.87323943662
X556: -94.1410256410254
X685: 2348.34741784038
X654: -788.141025641025
X360: -1779.65258215962
X497: 366.858974358975
X259: 439.12676056338
X605: 253.858974358975
X569: 746.858974358975
X588: -877.141025641025
X731: -367.652582159624
X501: -1548.14102564103
X381: -798.652582159624
X598: -133.141025641025
X168: 523.12676056338
X23: -2110.65258215962
X540: 794.858974358975
X503: 720.858974358975
X206: -480.87323943662
X8: -2137.65258215962
X171: -310.87323943662
X24: -1680.65258215962
X145: 657.12676056338
X194: 21.1267605633802
X34: -1546.65258215962
X134: -911.87323943662
X5: -1496.65258215962
X255: 392.12676056338
X705: -934.141025641025
X570: 302.858974358975
X408: -1567.65258215962
X129: 41.1267605633802
X586: 870.858974358975
X493: -390.141025641025
X285: -1904.87323943662
X51: -1284.65258215962
X248: -969.87323943662
X354: 653.347417840376
X93: 152.347417840376
X464: -1494.14102564103
X270: -200.87323943662
X340: -1726.87323943662
X398: 664.347417840376
X592: 120.858974358975
X251: -2478.87323943662
X630: 1503.85897435897
X656: 797.858974358975
X162: 645.12676056338
X379: -603.652582159624
X289: 720.12676056338
X693: 813.347417840376
X412: -91.6525821596242
X400: -264.652582159624
X218: -26.8732394366198
X691: 2049.34741784038
X507: -2304.14102564103
X452: 2005.34741784038
X85: -600.652582159624
X618: 1563.85897435897
X552: -422.141025641025
X38: -1384.65258215962
X185: 1722.12676056338
X476: 626.858974358975
X529: -1691.14102564103
X402: 687.347417840376
X672: 2750.34741784038
X306: 1089.34741784038
X49: -1393.87323943662
X557: -373.141025641025
X388: -664.652582159624
X418: 1676.34741784038
X182: 1041.12676056338
X192: -234.87323943662
X699: 2226.34741784038
X508: -590.141025641025
X327: -1754.87323943662
X696: 1990.34741784038
X730: -1300.65258215962
X212: -18.8732394366198
X357: -887.652582159624
X332: -453.87323943662
X478: -2069.65258215962
X377: 1000.34741784038
X712: 2222.34741784038
X678: 2218.34741784038
X487: -923.141025641025
X395: 527.347417840376
X383: 279.347417840376
X615: -460.141025641025
X522: ⋯
X711: 161.858974358975
X724: 938.347417840376
X370: 796.858974358975
X17: -1555.87323943662
X721: 683.858974358975
X207: 93.3474178403758
X277: -962.652582159624
X184: 618.858974358975
X299: -1646.65258215962
X19: -745.141025641025
X447: -2590.14102564103
X657: -452.141025641025
X491: 97.3474178403758
X521: -611.652582159624
X94: -2536.14102564103
X509: -965.141025641025
X135: 359.347417840376
X345: 1021.12676056338
X448: -2000.65258215962
X250: 596.12676056338
X275: 2362.34741784038
X640: -1078.14102564103
X41: -975.652582159624
X723: 2469.34741784038
X698: -1249.87323943662
X551: -1365.14102564103
X708: 474.12676056338
X403: 5.12676056338023
X211: 1266.34741784038
X114: 630.347417840376
X356: -201.652582159624
X316: -1019.65258215962
X322: -163.652582159624
X660: -1755.65258215962
X180: 1029.85897435897
X338: -1053.87323943662
X111: -819.652582159624
X610: -436.141025641025
X468: 3439.34741784038
X317: 1492.85897435897
X241: -1461.65258215962
X304: -853.141025641025
X231: 166.858974358975
X524: -1625.65258215962
X710: 1072.85897435897
X40: 1069.85897435897
X136: -288.652582159624
X423: -2091.65258215962
X132: 366.12676056338
X278: 2162.34741784038
X174: 1488.34741784038
X542: -1350.65258215962
X530: -1205.65258215962
X268: 1188.85897435897
X222: -2115.65258215962
X422: 991.12676056338
X413: -1035.87323943662
X578: -1823.14102564103
X158: 218.12676056338
X224: 829.347417840376
X229: -625.652582159624
X565: -1473.65258215962
X664: 191.858974358975
X689: 877.347417840376
X526: 883.12676056338
X716: -1601.14102564103
X283: -242.141025641025
X336: -62.8732394366198
X352: 802.858974358975
X221: 225.347417840376
X475: 31.1267605633802
X203: -1179.87323943662
X197: -43.6525821596242
X99: 971.347417840376
X150: 464.12676056338
X151: -436.652582159624
X632: -1119.65258215962
X517: -1595.65258215962
X720: -65.1410256410254
X119: -904.652582159624
X622: 1102.12676056338
X443: -1239.14102564103
X496: -934.652582159624
X690: -357.652582159624
X113: -671.652582159624
X117: 1911.34741784038
X591: 874.858974358975
X650: -1934.65258215962
X701: -1635.65258215962
X95: -1169.65258215962
X308: 870.12676056338
X481: -1566.65258215962
X445: -57.1410256410254
X266: 840.858974358975
X156: 2889.34741784038
X373: -2473.65258215962
X585: -407.652582159624
X472: -433.141025641025
X374: 130.12676056338
X165: 137.12676056338
X589: -456.141025641025
X404: 1301.85897435897
X587: 1194.12676056338
X191: -1673.14102564103
X700: 305.858974358975
X584: 465.858974358975
X467: -1407.87323943662
X27: -1690.65258215962
X602: 325.347417840376
X683: 171.12676056338
X536: 1211.34741784038
X311: -1281.14102564103
X470: 524.12676056338
X325: -413.87323943662
X593: -714.87323943662
X344: 896.12676056338
X61: 2532.34741784038
X651: 2170.34741784038
X53: 2460.34741784038
X466: 80.1267605633802
X627: -1413.65258215962
X537: -667.652582159624
X427: -1119.65258215962
X365: 1056.85897435897
X518: -2084.14102564103
X453: 473.347417840376
X392: -125.87323943662
X155: 1284.34741784038
X30: -1281.65258215962
X140: 523.347417840376
X455: 339.12676056338
X465: -553.87323943662
X82: 861.858974358975
X670: -2114.14102564103
X331: 760.858974358975
X437: -1224.65258215962
X265: 711.858974358975
X220: -2000.65258215962
X424: -1564.14102564103
X335: 215.858974358975
X100: -1775.65258215962
X80: -24.1410256410254
X314: -1127.65258215962
X729: -206.141025641025
X663: 927.858974358975
X104: -2491.65258215962
X694: 1033.85897435897
X549: -1.65258215962422
X680: 859.347417840376
X644: -972.87323943662
X50: -860.652582159624
X611: 978.347417840376
X563: 835.858974358975
X98: 501.347417840376
X525: 3755.34741784038
X635: 825.347417840376
X96: -919.652582159624
X36: -2590.65258215962
X298: 427.12676056338
X675: 467.12676056338
X717: -816.141025641025
X42: -1833.65258215962
X68: -758.141025641025
X666: 1101.85897435897
X22: -304.141025641025
X120: 2514.34741784038
X204: -359.141025641025
X554: 545.12676056338
X262: -1989.65258215962
X309: -1111.65258215962
X79: 1242.34741784038
X37: -46.8732394366198
X516: 1574.12676056338
X307: -1617.87323943662
X242: -785.652582159624
X419: 468.858974358975
X489: 664.12676056338
X193: 265.12676056338
X662: -6641.14102564103
X342: 484.858974358975
X264: 1509.85897435897
X97: -2342.65258215962
X321: 670.12676056338
X305: -28.6525821596242
X257: 798.12676056338
X355: -1570.65258215962
X387: 1054.34741784038
X31: 671.858974358975
X527: 1426.85897435897
X75: 287.12676056338
X267: 21.8589743589746
X658: 699.12676056338
X103: 234.347417840376
X351: -632.141025641025
X692: -127.141025641025
X707: 340.12676056338
X634: 981.12676056338
X361: -394.141025641025
X56: 2536.34741784038
X20: 725.12676056338

In [328]:

# Cross-validation prediction plot
plot(observed_counts, residuals_dr,
     xlab = 'Observed Counts', ylab = 'Residuals',
     main = 'Cross Validation Residual Plot', col = plot_colors[1], pch = 16)

# Add a horizontal line at y = 0 for reference
abline(h = 0, col = 'gray', lty = 2)

# Add a smooth line to visualize trends
lines(lowess(y_train, residuals_dr), col = plot_colors[2], lwd = 2)

# Add informative labels
text(4000, 1200, 'Residuals should be randomly scattered around zero.',
     adj = c(0, 0), col = 'darkgreen', font = 1)
text(4000, -800, 'A clear pattern indicates a model misfit.',
     adj = c(0, 0), col = 'darkred', font = 1)
text(4000, 400, 'Lowess smoother line to identify trends.',
     adj = c(0, 0), col = plot_colors[2], font = 1)
text(4000, 200, 'Plot show some finite varience b/w them & for some not.',
     adj = c(0, 0), col = plot_colors[2], font = 1)
# Add legend
legend('topright', legend = c('Residuals', 'Lowess Smoother'),
       col = plot_colors, pch = 16, lwd = 2, cex = 0.8)

Model Performance on the Test Dataset

In [329]:

# Section: Model Performance on the Test Dataset

# Set seed for reproducibility
set.seed(7882)

# Predict using the trained decision tree regression model
dtr_predictions <- predict(dtr, test_encoded_attributes[,-c(6)])

# Display the first 10 predictions
head(dtr_predictions, 20)

3: 1675.41095890411
4: 1675.41095890411
6: 1675.41095890411
7: 1675.41095890411
12: 1675.41095890411
14: 1675.41095890411
15: 1675.41095890411
16: 1675.41095890411
21: 1675.41095890411
28: 1675.41095890411
29: 1675.41095890411
32: 1675.41095890411
35: 1675.41095890411
43: 1675.41095890411
44: 1675.41095890411
45: 1675.41095890411
47: 1675.41095890411
55: 1675.41095890411
58: 1675.41095890411
62: 1675.41095890411

Decision Tree Regressor - Model Performance Visualization

In [330]:

# Section: Decision Tree Regressor - Model Performance Visualization

# Generate a random sample of 10 indices from the test dataset
sample_indices <- sample(1:nrow(test_encoded_attributes), 20)

# Extract the corresponding actual and predicted values for the sampled data
actual_values_dtr <- test_encoded_attributes$count
predicted_values_dtr <- dtr_predictions

# Plot all actual and predicted values using Decision Tree Regressor
plot(1:length(actual_values_dtr), actual_values_dtr, type = "l", col = "blue",
     xlab = "Sample Index", ylab = "Count", main = "Actual vs Predicted Values (Decision Tree Regressor)",
     xlim = c(1, length(actual_values_dtr) + 20), ylim = c(0, max(actual_values_dtr, predicted_values_dtr)))
lines(length(actual_values_dtr) + 1:length(predicted_values_dtr), predicted_values_dtr, col = "orange")

# Add legend
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "orange"), lty = 1)

# Highlight the last observed data point
points(length(actual_values_dtr), actual_values_dtr[length(actual_values_dtr)], pch = 19, col = "red")

# Highlight the starting point of predicted values
points(length(actual_values_dtr) + 1, predicted_values_dtr[1], pch = 19, col = "green")

# Add information
text(length(actual_values_dtr) + 5, max(actual_values_dtr, predicted_values_dtr)/2,
     "Actual values and Predicted values  using Decision Tree Regressor",
     pos = 1, col = "darkgreen", cex = 0.8)

Evaluation of Decision tree regressor on Root mean squared error and mean absolute error

In [331]:

#Section: Evaluation of Decision tree regressor on Root mean squared error and mean absolute error

set.seed(6889)
#Root mean squared error
rmse<-RMSE(y_test,dtr_predictions)
print(rmse)
#Mean absolute error
mae<-MAE(y_test,dtr_predictions)
print(mae)

[1] 927.4832
[1] 684.4555

Residual Plot for Decision Tree Regressor

In [332]:

# Section: Residual Plot for Decision Tree Regressor

# Calculate residuals for Decision Tree Regressor predictions
residuals_dtr <- y_test- dtr_predictions

# Plot the Residual plot
plot(y_test, residuals_dtr, xlab = 'Observed', ylab = 'Residuals', main = 'Residual Plot for Decision Tree Regressor', col = 'blue', pch = 16)
abline(0, 0, col = 'red', lwd = 2)

# Add informative text
text(2500, 800, "Some data points have the same finite variance between them,\nwhile for others, it may not hold true.", col = "red", cex = 0.8)

Section: Random Forest Regression¶

In [333]:

# Section: Random Forest Regression
# Load the randomForest library
library(randomForest)
# Set seed for reproducibility
set.seed(6788271)

# Train the Random Forest model
rf_model <- randomForest(count ~ ., data = train_encoded_attributes, importance = TRUE, ntree = 200)
rf_model

Call:
 randomForest(formula = count ~ ., data = train_encoded_attributes,      importance = TRUE, ntree = 200) 
               Type of random forest: regression
                     Number of trees: 200
No. of variables tried at each split: 6

          Mean of squared residuals: 493390.5
                    % Var explained: 86.88

Cross Validation Prediction for Random Forest

In [334]:

# Section: Cross Validation Prediction for Random Forest

# Ignore warning messages during execution
options(warn = -1)

# Set seed for reproducibility
set.seed(6772)

# Load the ranger library for Random Forest CV
library(ranger)

# Define cross-validation resampling method
train.control <- trainControl(method = 'CV', number = 3)

# Cross-validation prediction
rf_CV_predict <- train(count ~ ., data = train_encoded_attributes, method = 'ranger', trControl = train.control)
rf_CV_predict

cat("The Random Forest model was trained on 511 samples with 19 predictors. The training utilized cross-validated resampling with 3 folds, resulting in sample sizes of 340, 342, and 340 across the folds. The model was tuned with different values for the number of variables randomly sampled at each split (mtry) and the split rule. The final model, selected based on the smallest Root Mean Squared Error (RMSE), had the parameters mtry = 18, splitrule = extratrees, and min.node.size = 5. The performance metrics for the selected model were RMSE = 733.25, Rsquared = 0.8505, and MAE = 503.57, indicating a good fit to the data.\n")

Random Forest 

511 samples
 19 predictor

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 340, 342, 340 
Resampling results across tuning parameters:

  mtry  splitrule   RMSE       Rsquared   MAE     
   2    variance     964.1860  0.8413592  761.2039
   2    extratrees  1041.8814  0.8222903  818.2320
  18    variance     729.9331  0.8598411  507.8067
  18    extratrees   730.2087  0.8599708  504.0057
  34    variance     742.8772  0.8542833  514.7405
  34    extratrees   734.6122  0.8578812  507.2129

Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 18, splitrule = variance
 and min.node.size = 5.

The Random Forest model was trained on 511 samples with 19 predictors. The training utilized cross-validated resampling with 3 folds, resulting in sample sizes of 340, 342, and 340 across the folds. The model was tuned with different values for the number of variables randomly sampled at each split (mtry) and the split rule. The final model, selected based on the smallest Root Mean Squared Error (RMSE), had the parameters mtry = 18, splitrule = extratrees, and min.node.size = 5. The performance metrics for the selected model were RMSE = 733.25, Rsquared = 0.8505, and MAE = 503.57, indicating a good fit to the data.

Cross Validation Prediction Plot for Random Forest

In [335]:

# Section: Cross Validation Prediction Plot for Random Forest

# Extracting residuals from Random Forest cross-validation predictions
residuals <- resid(rf_CV_predict)

# Plotting the Cross Validation Prediction Plot
plot(y_train, residuals, xlab = 'Observed', ylab = 'Residuals',
     main = 'Cross Validation Prediction Plot for Random Forest', col = 'blue', pch = 16, cex.main = 1.2)

# Adding a reference line at y = 0
abline(0, 0, col = 'red', lwd = 2)

# Adding informative labels and title
text(6000, 800, "Residuals from Random Forest cross-validation predictions",
     col = "darkgreen", cex = 0.8)

# Adding grid lines for better readability
grid()

# Adding a legend
legend('topright', legend = 'Residuals', col = 'blue', pch = 16, cex = 0.8)

# Adding more context to the observation
text(5500, 500, "Observation: The residuals should be randomly scattered around the reference line at y = 0.",
     col = "purple", cex = 0.8)
text(5500, 400, "Patterns or trends in the residuals may suggest areas where the model can be improved.",
     col = "purple", cex = 0.8)

cat("Observation: The residuals should ideally be randomly scattered around the reference line at y = 0,\n")
cat("indicating unbiased predictions. Patterns or trends in the residuals may suggest areas where the model can be improved.\n")

Observation: The residuals should ideally be randomly scattered around the reference line at y = 0,
indicating unbiased predictions. Patterns or trends in the residuals may suggest areas where the model can be improved.

Model Performance on Test Data for Random Forest

In [336]:

# Section: Model Performance on Test Data for Random Forest

# Set seed for reproducibility
set.seed(7889)

# Predict the Random Forest model on the test dataset
rf_predictions <- predict(rf_model, test_encoded_attributes[,-c(7)])

# Display the first 10 predictions
head(rf_predictions, 10)

3: 1378.27140567766
4: 1338.93530952381
6: 1329.46125208125
7: 1478.20645726496
12: 1393.55608821734
14: 1448.84804761905
15: 1447.3196504329
16: 1397.88702589078
21: 1440.23854059829
28: 1173.40045238095

Model performation visualisation based on random forest regressor

In [337]:

#Section: Model performation visualisation based on random forest regressor

# Generate a random sample of 10 indices from the test dataset
sample_indices <- sample(1:nrow(test_encoded_attributes), 10)

# Extract the corresponding actual and predicted values for the sampled data
actual_values_rf <- test_encoded_attributes$count
predicted_values_rf <- rf_predictions

# Plot all actual and predicted values using Random forest Regressor
plot(1:length(actual_values_rf), actual_values_rf, type = "l", col = "blue",
     xlab = "Sample Index", ylab = "Count", main = "Actual vs Predicted Values (Random forest Regressor)",
     xlim = c(1, length(actual_values_rf) + 10), ylim = c(0, max(actual_values_rf, predicted_values_rf)))
lines(length(actual_values_rf) + 1:length(predicted_values_rf), predicted_values_rf, col = "orange")

# Add legend
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "orange"), lty = 1)

# Highlight the last observed data point
points(length(actual_values_rf), actual_values_rf[length(actual_values_rf)], pch = 19, col = "red")

# Highlight the starting point of predicted values
points(length(actual_values_rf) + 1, predicted_values_rf[1], pch = 19, col = "green")

# Add information
text(length(actual_values_rf) + 5, max(actual_values_rf, predicted_values_rf)/2,
     "Actual values and Predicted values  using Random forest Regressor",
     pos = 1, col = "darkgreen", cex = 0.8)

Model Performance Evaluation of Random Forest regressor based on RMSE and MAE

In [338]:

# Section: Model Performance Evaluation of Random Forest regressor based on RMSE and MAE

# Set seed for reproducibility
set.seed(667)

# Root Mean Squared Error (RMSE) calculation
rmse <- RMSE(y_test, rf_predictions)
print(rmse)

# Mean Absolute Error (MAE) calculation
mae <- MAE(y_test, rf_predictions)
print(mae)

[1] 683.8066
[1] 467.8042

Residual Plot for Random Forest Regressor

In [339]:

# Section: Residual Plot for Random Forest Regressor

# Calculate residuals for Random Forest predictions
residuals <- y_test - rf_predictions

# Plot the Residual Plot
plot(y_test, residuals,
     xlab = 'Observed', ylab = 'Residuals',
     main = 'Residual Plot for Random Forest Regressor', col = 'blue', pch = 16)

# Add a reference line at y = 0
abline(0, 0, col = 'red', lty = 2)

# Enhance plot aesthetics
grid(col = "lightgray")
title(main = "Residual Plot for Random Forest Regressor", col.main = "darkgreen", font.main = 2)
axis(side = 1, col = "darkblue", col.axis = "darkblue")
axis(side = 2, col = "darkblue", col.axis = "darkblue")

# Add informative labels
text(3000, 1000, "The plot shows the residuals from the Random Forest predictions.",
     col = "red", cex = 0.8)

# Add observations
cat("Observation: In an ideal scenario, residuals should be randomly scattered around the reference line at y = 0.\n")
cat("Patterns or trends in the residuals may suggest areas where the model can be improved.\n")

Observation: In an ideal scenario, residuals should be randomly scattered around the reference line at y = 0.
Patterns or trends in the residuals may suggest areas where the model can be improved.

Section: Selecting best Model in all three for futher prediction¶

In [340]:

#Section: Selecting best Model in all three for futher prediction

# Calculate RMSE and MAE for each model
lm_rmse <- RMSE(lm_predictions, test_encoded_attributes$count)
lm_mae <- MAE(lm_predictions, test_encoded_attributes$count)

dtr_rmse <- RMSE(dtr_predictions, test_encoded_attributes$count)
dtr_mae <- MAE(dtr_predictions, test_encoded_attributes$count)

rf_rmse <- RMSE(rf_predictions, test_encoded_attributes$count)
rf_mae <- MAE(rf_predictions, test_encoded_attributes$count)

# Create an accuracy table
accuracy_table <- data.frame(
  Model = c("Linear Regression", "Decision Tree Regressor", "Random Forest Regressor"),
  RMSE = c(lm_rmse, dtr_rmse, rf_rmse),
  MAE = c(lm_mae, dtr_mae, rf_mae)
)

# Find the best model based on the minimum RMSE and MAE
best_model <- accuracy_table[which.min(accuracy_table$RMSE + accuracy_table$MAE), ]


# Print the accuracy table
print(accuracy_table)

                    Model     RMSE      MAE
1       Linear Regression 780.1233 574.8627
2 Decision Tree Regressor 927.4832 684.4555
3 Random Forest Regressor 683.8066 467.8042

The performance metrics for different regression models applied to the dataset:

Linear Regression:
- RMSE (Root Mean Squared Error): 872.4698
- MAE (Mean Absolute Error): 632.1726
Decision Tree Regressor:
- RMSE: 1010.8259
- MAE: 724.8092
Random Forest Regressor:
- RMSE: 722.8018
- MAE: 478.9133

Lower values of RMSE and MAE indicate better model performance. Therefore, the Random Forest Regressor model shows the best performance among the three models evaluated.

In [341]:

# Print the best model
print("Best Model:")
print(best_model)

[1] "Best Model:"
                    Model     RMSE      MAE
3 Random Forest Regressor 683.8066 467.8042

Section: Selecting Final Model as Random Forest Regressor for Prediction of Bike Rental Count¶

In [342]:

# Section: Selecting Final Model as Random Forest Regressor for Prediction of Bike Rental Count

# Combine observed and predicted values
Bike_predictions <- data.frame(Observed = y_test, Predicted = rf_predictions)

# Write predictions to a CSV file
write.csv(Bike_predictions, 'Bike_Renting_Predictions.csv', row.names = FALSE)

# Display the predictions
Bike_predictions

A data.frame: 220 × 2
	Observed	Predicted
	<dbl>	<dbl>
3	1349	1378.271
4	1562	1338.935
6	1606	1329.461
7	1510	1478.206
12	1162	1393.556
14	1421	1448.848
15	1248	1447.320
16	1204	1397.887
21	1543	1440.239
28	1167	1173.400
29	1098	1046.900
32	1360	1271.782
35	1708	1584.079
43	1472	1554.543
44	1589	1777.310
45	1913	2260.890
47	2115	1897.504
55	1807	1649.068
58	2402	1620.425
62	1685	1656.711
71	2132	2029.090
72	2417	2013.939
73	2046	2062.932
74	2056	1871.097
78	3117	3778.040
83	1865	2021.938
86	1693	2290.974
87	2028	2397.372
88	2425	2368.786
89	1536	2223.834
⋮	⋮	⋮
617	5976	5816.432
621	7870	7665.649
623	8009	7707.574
624	8714	7639.816
626	6869	6294.597
631	8395	7287.318
636	7393	7000.209
638	8555	7703.401
639	6889	7427.717
641	4639	4759.513
642	7572	6076.654
643	7328	6800.629
646	3510	5801.521
649	7691	7364.260
655	7534	6995.307
665	7444	6207.183
667	4459	6683.840
673	5138	4969.967
674	5107	5022.264
676	5686	4809.488
677	5035	4937.191
679	5992	5449.521
684	5495	5031.985
688	4669	4731.366
695	2424	3992.329
697	3959	3986.974
703	6234	6107.642
709	3228	4464.756
722	1749	2686.222
726	441	2211.657

Section: Conclusion¶

In [343]:

# Conclusion: Random Forest Model Outperforms
cat("When comparing RMSE and MAE of all 3 models, the random forest model shows the least errors.\n")
cat("Thus, the random forest model is considered the best for predicting daily bike rental counts.\n")

When comparing RMSE and MAE of all 3 models, the random forest model shows the least errors.
Thus, the random forest model is considered the best for predicting daily bike rental counts.