Bike rental prediction at its core represents an advanced application of predictive analytics and machine learning, employing a robust Random Forest model to forecast bicycle rental demand with unparalleled precision. This sophisticated model goes beyond traditional approaches by meticulously analyzing an array of factors, including seasonal patterns, weather conditions, and temporal trends, to provide nuanced insights into user behavior and rental dynamics.
By harnessing the power of the Random Forest algorithm, known for its ensemble of decision trees and enhanced accuracy, the predictive model enables rental service providers to make data-driven decisions. This includes optimizing inventory levels, tailoring pricing strategies, and streamlining operational processes. The Random Forest model excels at capturing complex relationships within the data, ensuring a more accurate prediction of bike rental counts.
This predictive tool serves as a strategic asset, not only anticipating demand fluctuations but also acting as a catalyst for informed decision-making. It empowers businesses to proactively adapt to changing market conditions, enhance resource allocation, and deliver an exceptional and responsive rental experience.
In summary, the bike rental prediction model, driven by the Random Forest algorithm, is a powerful and sophisticated solution that transforms data into actionable insights, fostering operational efficiency and elevating customer satisfaction in the dynamic landscape of bike rentals.
1. Conclusion
# Section: Loading Packages and Libraries for Visualizing Dataset and Performing ML Analysis
## Loading Important Libraries
#Before proceeding, ensure to install all the packages listed inside the `library()` function.
# Install ranger package if not already installed
# install.packages("ranger")
# ggplot2: A powerful and flexible plotting system for creating visualizations in R.
library(ggplot2)
# tidyverse: A collection of packages (including ggplot2) that provide a consistent and efficient data manipulation workflow.
library(tidyverse)
# explore: A package for exploratory data analysis, offering various visualizations and summary statistics.
library(explore)
# GGally: Extension to ggplot2, designed for creating ggplot2 visualizations with multiple plots.
library(GGally)
# ggridges: Creates ridge plots for visualizing the distribution of a numeric value by one or more categorical factors.
library(ggridges)
# Metrics: Provides various metrics for evaluating machine learning models, including RMSE and MAE.
library(Metrics)
# car: Companion to Applied Regression package, offering various regression-related functions.
library(car)
# corrgram: Package for creating correlograms to visualize correlation matrices.
library(corrgram)
# corrplot: Package for creating correlation plots.
library(corrplot)
# caret: Classification and Regression Training package, a comprehensive framework for building predictive models.
library(caret)
# randomForest: Implements random forest algorithms for classification and regression.
library(randomForest)
# ranger: Another implementation of random forest algorithms for faster performance.
library(ranger)
# DMwR2: Data Mining with R, provides functions and data sets for teaching data mining.
library(DMwR2)
# ipred: Improved Predictors package, provides bagging and bootstrapping algorithms for predictive modeling.
library(ipred)
# caTools: Provides various tools for data splitting and manipulation, often used in predictive modeling.
library(caTools)
# viridis: A colorblind-friendly color palette for data visualizations.
library(viridis)
# lubridate: Simplifies date and time handling in R.
library(lubridate)
# readxl: Package for reading Excel files.
library(readxl)
We're utilizing the readxl library in R to read data from an Excel file into a data frame. First, we load the readxl library using the library() function. Then, we specify the file path to the Excel file we want to read and store it in the variable file_path. Finally, we use the read_excel() function, passing the file path as an argument, to read the Excel file and create a data frame named bike_df. This process enables us to access and analyze the data from the Excel file within the R environment for further manipulation and exploration.
library(readxl)
# Provide the correct path to your Excel file
file_path <- "/content/1657875746_day.xlsx"
# Read the Excel file into a data frame
bike_df <- read_excel(file_path)
To display the first few rows of a data frame named bike_df
in R. It utilizes the head()
function, which is a commonly used command to provide a glimpse of the dataset's structure and contents. By executing head(bike_df)
, we print the initial rows of the data frame to the console, allowing us to inspect the data quickly. Alternatively, there's a comment indicating that the head.matrix()
function can also be used for the same purpose, though in this case, it's not being utilized. Overall, this code facilitates an initial exploration of the dataset, aiding in understanding its format and content. and we are doing same with the tail()
exploring last few row and columns of dataset.
# Print the first few rows of the data frame using the `head` function
head(bike_df)
# Alternatively, you can use `head.matrix` to print the first few rows
#head.matrix(bike_df)
# Explanation:
# The `head` function is commonly used to display the first few rows of a data frame.
# It helps to quickly inspect the structure and contents of the dataset.
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dttm> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
1 | 2011-01-01 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.1604460 | 331 | 654 | 985 |
2 | 2011-01-02 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.2485390 | 131 | 670 | 801 |
3 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.2483090 | 120 | 1229 | 1349 |
4 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.1602960 | 108 | 1454 | 1562 |
5 | 2011-01-05 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.1869000 | 82 | 1518 | 1600 |
6 | 2011-01-06 | 1 | 0 | 1 | 0 | 4 | 1 | 1 | 0.204348 | 0.233209 | 0.518261 | 0.0895652 | 88 | 1518 | 1606 |
The output displayed is a representation of the first 6 rows and 16 columns of a dataset stored in a tibble (a modern data frame object in R). Each row corresponds to a specific observation, while each column represents a different variable or attribute of those observations. Here's a breakdown of the columns:
instant
: A numeric identifier for each observation.dteday
: The date and time of each observation.season
: Represents the season (e.g., 1 for spring, 2 for summer, etc.).yr
: Indicates the year (0 for 2011, 1 for 2012).mnth
: Represents the month of the year.holiday
: Binary indicator for whether it's a holiday (0 for no holiday, 1 for holiday).weekday
: Indicates the day of the week (0 for Sunday, 6 for Saturday).workingday
: Binary indicator for whether it's a working day (0 for non-working day, 1 for working day).weathersit
: Represents the weather situation.temp
: Normalized temperature.atemp
: Normalized feeling temperature.hum
: Normalized humidity.windspeed
: Normalized wind speed.casual
: Count of casual users.registered
: Count of registered users.cnt
: Total count of bike rentals, including both casual and registered users.Each cell in the table contains the corresponding value of the variable for that observation. This output allows for a quick visual inspection of the dataset's structure and initial values, facilitating data exploration and understanding.
# Print the last few rows of the data frame using the `tail` function
#tail(bike_df)
# Alternatively, you can use `tail.matrix` to print the last few rows
tail.matrix(bike_df)
# Explanation:
# The `tail` function is commonly used to display the last few rows of a data frame.
# It helps to check the end of the dataset and ensures data has been read correctly.
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dttm> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
726 | 2012-12-26 | 1 | 1 | 12 | 0 | 3 | 1 | 3 | 0.243333 | 0.220333 | 0.823333 | 0.316546 | 9 | 432 | 441 |
727 | 2012-12-27 | 1 | 1 | 12 | 0 | 4 | 1 | 2 | 0.254167 | 0.226642 | 0.652917 | 0.350133 | 247 | 1867 | 2114 |
728 | 2012-12-28 | 1 | 1 | 12 | 0 | 5 | 1 | 2 | 0.253333 | 0.255046 | 0.590000 | 0.155471 | 644 | 2451 | 3095 |
729 | 2012-12-29 | 1 | 1 | 12 | 0 | 6 | 0 | 2 | 0.253333 | 0.242400 | 0.752917 | 0.124383 | 159 | 1182 | 1341 |
730 | 2012-12-30 | 1 | 1 | 12 | 0 | 0 | 0 | 1 | 0.255833 | 0.231700 | 0.483333 | 0.350754 | 364 | 1432 | 1796 |
731 | 2012-12-31 | 1 | 1 | 12 | 0 | 1 | 1 | 2 | 0.215833 | 0.223487 | 0.577500 | 0.154846 | 439 | 2290 | 2729 |
The dim(bike_df)
function returns a vector containing the number of rows and columns in the dataset. By combining this information with a descriptive string, "Dimension of dataset: "
, using paste
, the code generates a human-readable output that displays the dataset's dimensions.
# Using the paste function to concatenate strings and print information about the dataset dimensions
# dim(bikes_df) returns a vector with the number of rows and columns in the dataset
# The concatenated string includes "Dimension of dataset: " followed by the number of rows and columns
paste("Dimension of dataset: ", dim(bike_df))
The output "Dimension of dataset: 731 16" indicates that the dataset contains 731 rows and 16 columns.
In this code segment, several preprocessing tasks are performed on the dataset bike_df
to prepare it for exploratory data analysis (EDA). The steps involved are as follows:
bike_df
are renamed for better clarity and understanding of the data attributes.datetime
column is converted to the Date format, and categorical variables are converted to factors.str()
function, allowing for a comprehensive understanding of variable types and distributions post-conversion.Throughout the process, the output provides visibility into the changes made, including the updated column names, summary statistics, and the revised dataset structure, ensuring the data is appropriately prepared for subsequent analysis. Additionally, the code assesses missing values, highlighting potential data integrity issues that may require further investigation.
#Rename the columns
names(bike_df)<-c('record_id','datetime','season','year','month','holiday','weekday','workingday','weather_condition','temp','atemp','humidity','windspeed','casual_users','registerd_users','count')
head(bike_df)
record_id | datetime | season | year | month | holiday | weekday | workingday | weather_condition | temp | atemp | humidity | windspeed | casual_users | registerd_users | count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dttm> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
1 | 2011-01-01 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.1604460 | 331 | 654 | 985 |
2 | 2011-01-02 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.2485390 | 131 | 670 | 801 |
3 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.2483090 | 120 | 1229 | 1349 |
4 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.1602960 | 108 | 1454 | 1562 |
5 | 2011-01-05 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.1869000 | 82 | 1518 | 1600 |
6 | 2011-01-06 | 1 | 0 | 1 | 0 | 4 | 1 | 1 | 0.204348 | 0.233209 | 0.518261 | 0.0895652 | 88 | 1518 | 1606 |
Summary()
function provides a quick overview of key statistics for each variable in the dataset. The summary includes various descriptive statistics such as the minimum, 1st quartile, median (2nd quartile), mean, 3rd quartile, and maximum values for numerical variables. For categorical variables, it displays the frequency of each category. This summary serves as an initial exploration step to understand the distribution, central tendency, and spread of the data across different attributes. It helps in identifying potential issues such as outliers, skewed distributions, or missing values, thereby guiding further data preprocessing and analysis tasks.
# Summary of the dataset using the summary() function to get a quick overview of key statistics.
summary(bike_df)
record_id datetime season year Min. : 1.0 Min. :2011-01-01 00:00:00 Min. :1.000 Min. :0.0000 1st Qu.:183.5 1st Qu.:2011-07-02 12:00:00 1st Qu.:2.000 1st Qu.:0.0000 Median :366.0 Median :2012-01-01 00:00:00 Median :3.000 Median :1.0000 Mean :366.0 Mean :2012-01-01 00:00:00 Mean :2.497 Mean :0.5007 3rd Qu.:548.5 3rd Qu.:2012-07-01 12:00:00 3rd Qu.:3.000 3rd Qu.:1.0000 Max. :731.0 Max. :2012-12-31 00:00:00 Max. :4.000 Max. :1.0000 month holiday weekday workingday Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000 Median : 7.00 Median :0.00000 Median :3.000 Median :1.000 Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000 Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000 weather_condition temp atemp humidity Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200 Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267 Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302 Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725 windspeed casual_users registerd_users count Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152 Median :0.18097 Median : 713.0 Median :3662 Median :4548 Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956 Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
The summary output provides valuable insights into the dataset bike_df
, revealing key statistics for each variable:
record_id: The dataset consists of 731 records, with IDs ranging from 1 to 731. Each record represents a unique observation.
datetime: The observations span from January 1, 2011, to December 31, 2012, indicating data collected over two years. The median date falls on January 1, 2012, suggesting an equal distribution of observations across both years.
season: The dataset comprises four seasons, with each season having approximately the same number of observations. Season 3 has the highest frequency, indicating a relatively higher occurrence of observations during that season.
year: The dataset evenly represents both years 2011 and 2012, with approximately 50% of the observations belonging to each year.
month: Observations are spread across all twelve months, with a slightly higher median month (July), indicating a relatively higher number of observations during the summer months.
holiday: Most observations occur on non-holiday days, as indicated by the low median and mean values. However, there is a small proportion of observations on holidays.
weekday: Observations are evenly distributed across weekdays, with each weekday having a similar frequency of occurrences.
workingday: Approximately 68.4% of the observations represent working days, indicating a higher frequency of data collection during typical workdays.
weather_condition: The dataset captures three weather conditions. Condition 1 occurs most frequently, suggesting predominantly good weather conditions during the data collection period.
temp, atemp, humidity, windspeed: These numerical attributes exhibit varying distributions. Temperature and apparent temperature (atemp) range from 0.059 to 0.862, indicating a wide range of temperatures recorded. Humidity ranges from 0 to 0.973, while windspeed varies from 0.022 to 0.507, suggesting different weather conditions throughout the dataset.
casual_users, registered_users, count: The number of casual and registered users, as well as the total count of bike rentals, varies widely across observations. The mean count of bike rentals is approximately 4504, with a minimum of 22 and a maximum of 8714, highlighting significant variability in rental patterns.
Date type conversion of attributes
str(bike_df)
function offers a concise overview of the dataset's internal composition, showcasing the data types of each variable/column alongside a glimpse of their values. This succinct representation aids in understanding the dataset's structure, identifying any missing values, and grasping the distribution of variables.
# Display the structure of the dataset using the str() function to understand variable types and their distribution.
str(bike_df)
tibble [731 × 16] (S3: tbl_df/tbl/data.frame) $ record_id : num [1:731] 1 2 3 4 5 6 7 8 9 10 ... $ datetime : POSIXct[1:731], format: "2011-01-01" "2011-01-02" ... $ season : num [1:731] 1 1 1 1 1 1 1 1 1 1 ... $ year : num [1:731] 0 0 0 0 0 0 0 0 0 0 ... $ month : num [1:731] 1 1 1 1 1 1 1 1 1 1 ... $ holiday : num [1:731] 0 0 0 0 0 0 0 0 0 0 ... $ weekday : num [1:731] 6 0 1 2 3 4 5 6 0 1 ... $ workingday : num [1:731] 0 0 1 1 1 1 1 0 0 1 ... $ weather_condition: num [1:731] 2 2 1 1 1 1 2 2 1 1 ... $ temp : num [1:731] 0.344 0.363 0.196 0.2 0.227 ... $ atemp : num [1:731] 0.364 0.354 0.189 0.212 0.229 ... $ humidity : num [1:731] 0.806 0.696 0.437 0.59 0.437 ... $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ... $ casual_users : num [1:731] 331 131 120 108 82 88 148 68 54 41 ... $ registerd_users : num [1:731] 654 670 1229 1454 1518 ... $ count : num [1:731] 985 801 1349 1562 1600 ...
The output indicates the structure of the dataset, presenting it as a tibble with 731 rows and 16 columns. Each column is described along with its respective data type and example values.
record_id
, season
, year
, month
, holiday
, weekday
, workingday
, and weather_condition
are numeric variables with corresponding values.datetime
is of POSIXct type, representing date and time data, with dates ranging from "2011-01-01" to "2012-12-31".temp
, atemp
, humidity
, windspeed
, casual_users
, registered_users
, and count
are also numeric, with varying ranges of values.So Now, We will typecaste variables like we have a variable year in a dataset that represents years as numeric values (e.g., 2011, 2012), but we want to treat it as a categorical variable.
we have a variable datetime representing dates and times as character strings but want to treat them as dates.
#Typecasting the datetime and numerical attributes to category
# Convert 'datetime' to Date format
bike_df$datetime<- as.Date(bike_df$datetime)
bike_df$year<-as.factor(bike_df$year)
bike_df$month<-as.factor(bike_df$month)
bike_df$season <- as.factor(bike_df$season)
bike_df$holiday<- as.factor(bike_df$holiday)
bike_df$weekday<- as.factor(bike_df$weekday)
bike_df$workingday<- as.factor(bike_df$workingday)
bike_df$weather_condition<- as.factor(bike_df$weather_condition)
# Display the structure of the dataset using the str() function to understand variable types and their distribution.
str(bike_df)
tibble [731 × 16] (S3: tbl_df/tbl/data.frame) $ record_id : num [1:731] 1 2 3 4 5 6 7 8 9 10 ... $ datetime : Date[1:731], format: "2011-01-01" "2011-01-02" ... $ season : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ... $ year : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ month : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ holiday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ weekday : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ... $ workingday : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ... $ weather_condition: Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ... $ temp : num [1:731] 0.344 0.363 0.196 0.2 0.227 ... $ atemp : num [1:731] 0.364 0.354 0.189 0.212 0.229 ... $ humidity : num [1:731] 0.806 0.696 0.437 0.59 0.437 ... $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ... $ casual_users : num [1:731] 331 131 120 108 82 88 148 68 54 41 ... $ registerd_users : num [1:731] 654 670 1229 1454 1518 ... $ count : num [1:731] 985 801 1349 1562 1600 ...
Specifically, we convert the datetime column to Date format and the numerical attributes (year, month, season, holiday, weekday, workingday, and weather_condition) to factor variables. This conversion helps in treating these variables appropriately for analysis, ensuring they are interpreted correctly and facilitating categorical data manipulation.
Section: Missing Value ANalysis:
Missing value analysis involves identifying and quantifying missing values in a dataset, visualizing their patterns, and deciding on appropriate strategies for handling them, such as imputation or deletion, based on the analysis goals and the nature of the data.
#Missing values in dataset
missing_val<-data.frame(apply(bike_df,2,function(x){sum(is.na(x))}))
names(missing_val)[1]='missing_val'
missing_val
missing_val | |
---|---|
<int> | |
record_id | 0 |
datetime | 0 |
season | 0 |
year | 0 |
month | 0 |
holiday | 0 |
weekday | 0 |
workingday | 0 |
weather_condition | 0 |
temp | 0 |
atemp | 0 |
humidity | 0 |
windspeed | 0 |
casual_users | 0 |
registerd_users | 0 |
count | 0 |
Now, After handling missing values we don't have any missing values present in dataset.
Visualization of Numerical Variables
# Section: Visualization of Numerical Variables
# Visualize the relationships and distributions of key numerical variables using a pairs plot.
# This plot includes temperature ('temp'), "feels-like" temperature ('atemp'), humidity ('hum'),
# windspeed ('windspeed'), casual rentals ('casual'), registered rentals ('registered'),
# and the total number of rentals ('cnt'). The plot helps identify potential patterns, correlations,
# and outliers among these variables.
# Reduce the outer margins to avoid "figure margins too large" error
par(mar = c(0.1, 0.1, 0.1, 0.1))
num_vars <- c("temp", "atemp", "humidity", "windspeed", "casual_users", "registerd_users", "count")
pairs(bike_df[, num_vars],
main = "Pairs Plot of Numerical Variables",
col = bike_df$season, # Color points based on the season for additional insights
pch = 16) # Use filled circles for better visibility
# Adding custom titles for each panel of the pairs plot.
for (i in 1:length(num_vars)) {
for (j in 1:length(num_vars)) {
panel_var1 <- num_vars[i]
panel_var2 <- num_vars[j]
panel_title <- paste("Relationship between", panel_var1, "and", panel_var2)
panel_subtitle <- ifelse(i == j, "Distribution", paste("Colored by Season:", unique(bike_df$season)))
title(panel_title, line = 2.5, cex.main = 0.8)
title(panel_subtitle, line = 4, cex.sub = 0.6)
}
}
# Reset the outer margins to default after creating the plot
par(mar = c(5, 4, 4, 2) + 0.1)
Section: Exploring Bike Rentals Distribution
# Section: Exploring Bike Rentals Distribution
# Explore the distribution of bike rentals using a histogram.
# This plot provides an overview of the distribution of total bike rentals ('cnt') per observation.
# Plotting the histogram to visualize the distribution of bike rentals.
hist(bike_df$count,
main = "Distribution of Bike Rentals",
xlab = "Number of Rentals",
col = "#75AADB", # Custom color for better visibility
border = "#333333", # Border color
breaks = 30, # Adjust the number of bins
xlim = c(0, max(bike_df$count) + 50), # Set x-axis limits for better readability
las = 1, # Keep axis labels horizontal
cex.main = 1.5, # Increase main title size
cex.lab = 1.2, # Increase axis label size
cex.axis = 1.2) # Increase axis tick label size
# Adding informative elements to the plot.
abline(v = mean(bike_df$count), col = "red", lwd = 2, lty = 2) # Add a vertical line for mean
text(mean(bike_df$count) + 10, 250, "Mean", col = "red", font = 2, cex = 1.2) # Label for the mean
# Adding additional information inside the plot
mtext(" The distribution is slightly right-skewed.", side = 1, line = 2, cex = 1.1)
mtext("The mean number of rentals is marked by a red dashed line.", side = 3, line = 3, cex = 1.1)
Histogram of Target Variable- "count"
# Load the required library for data visualization
library(ggplot2)
# Create a histogram to explore the distribution of the target variable 'count'.
ggplot(bike_df, aes(x = count)) +
geom_histogram(bins = 30, colour = "black", fill = "#56B4E9") +
ggtitle("Distribution of Bike Rentals ('count')") +
xlab("Count Variable") + ylab("Density of the Sample") +
theme(
plot.title = element_text(color = "blue", size = 18, face = "bold"),
axis.title.x = element_text(color = "blue", size = 14),
axis.title.y = element_text(color = "blue", size = 14)
) +
# Adding informative elements on the plot
geom_vline(aes(xintercept = mean(bike_df$count)),
color = "red", linetype = "dashed", linewidth = 1.2) +
annotate("text", x = mean(bike_df$count) + 10, y = 30,
label = "Mean", color = "red", size = 5)
Log Transformation of Bike Rentals
# Section: Log Transformation of Bike Rentals
# Create a histogram and density plot after applying log transformation to 'count'
ggplot(bike_df, aes(x = log(count))) +
geom_histogram(aes(y = after_stat(density)), bins = 30, colour = "black", fill = "grey") +
geom_density(alpha = 0.2, fill = "cyan") +
ggtitle("Distribution of Log-Transformed Bike Rentals ('count')") +
xlab("Log-Transformed Count Variable") + ylab("Density of the Sample") +
theme(
plot.title = element_text(color = "black", size = 18, face = "bold"),
axis.title.x = element_text(color = "blue", size = 14),
axis.title.y = element_text(color = "blue", size = 14)
) +
# Adding additional information on the plot
geom_vline(aes(xintercept = mean(log(bike_df$count))),
color = "red", linetype = "dashed", linewidth = 1.2) +
annotate("text", x = mean(log(bike_df$count)) + 0.1, y = 0.2,
label = "Mean (log-transformed)", color = "red", size = 4) +
geom_vline(aes(xintercept = median(log(bike_df$count))),
color = "green", linetype = "dashed", linewidth = 1.2) +
annotate("text", x = median(log(bike_df$count)) + 0.1, y = 0.4,
label = "Median (log-transformed)", color = "green", size = 4) +
# Adding insights on the plot
annotate("text", x = 6, y = 0.6,
label = "Insights:", color = "red", size = 4, fontface = "bold") +
annotate("text", x = 6, y = 0.55,
label = "1. Applying a log transformation has helped in reducing skewness.", color = "red", size = 2) +
annotate("text", x = 6, y = 0.5,
label = "2. The mean and median values are marked for reference.", color = "red", size = 2)
Creating Correlogram with ggpairs
# Section: Creating Correlogram with ggpairs
# Create a new plotting device
#dev.new()
#install.packages("viridis")
# Load necessary libraries
library(GGally)
library(viridis)
# Create a correlogram with ggpairs
ggpairs(bike_df, title = "Correlogram with ggpairs()")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggcorr(bike_df, method = c("everything", "pearson"), label=TRUE) +
ggtitle("Plot of correlation variables") +
theme(plot.title = element_text(color="blue", size=18, face="bold"))
cat("With the correlation plot, we can see how the variables that most interfere/correlate with the target are temperature, apparent temperature (atemp), weather, year, season, and finally, wind.\n")
With the correlation plot, we can see how the variables that most interfere/correlate with the target are temperature, apparent temperature (atemp), weather, year, season, and finally, wind.
Exploring Data with 'explore' Package
# Load required packages
library(explore)
library(lubridate)
# Extract the year from the 'datetime' variable
bike_df$year_from_date <- year(bike_df$datetime)
# Set display format to show full numeric values
options(scipen = 999)
# Use 'explore' to analyze the dataset, focusing on the 'cnt' variable with the extracted 'year_from_date' as the target
explore(bike_df, count, target = year_from_date)
Exploring Correlation Among Numerical Variables
# Section: Exploring Correlation Among Numerical Variables
# Explore the correlation matrix for numerical variables.
# The matrix and the subsequent heatmap help identify strong correlations among variables.
# Calculate the correlation matrix
cor_matrix <- cor(bike_df[, c("temp", "atemp", "humidity", "windspeed", "casual_users", "registerd_users", "count")])
# Display the correlation matrix
cat("Correlation Matrix:\n")
print(cor_matrix)
Correlation Matrix: temp atemp humidity windspeed casual_users temp 1.0000000 0.9917016 0.12696294 -0.1579441 0.54328466 atemp 0.9917016 1.0000000 0.13998806 -0.1836430 0.54386369 humidity 0.1269629 0.1399881 1.00000000 -0.2484891 -0.07700788 windspeed -0.1579441 -0.1836430 -0.24848910 1.0000000 -0.16761335 casual_users 0.5432847 0.5438637 -0.07700788 -0.1676133 1.00000000 registerd_users 0.5400120 0.5441918 -0.09108860 -0.2174490 0.39528245 count 0.6274940 0.6310657 -0.10065856 -0.2345450 0.67280443 registerd_users count temp 0.5400120 0.6274940 atemp 0.5441918 0.6310657 humidity -0.0910886 -0.1006586 windspeed -0.2174490 -0.2345450 casual_users 0.3952825 0.6728044 registerd_users 1.0000000 0.9455169 count 0.9455169 1.0000000
This output is a correlation matrix, displaying the correlation coefficients between pairs of variables in the dataset. Each cell in the matrix represents the correlation coefficient between two variables, ranging from -1 to 1.
In this specific output:
temp
and atemp
have a very high positive correlation coefficient of around 0.99, indicating they are highly correlated.Humidity
and windspeed
have a negative correlation, implying that as humidity increases, windspeed tends to decrease, and vice versa.count
, representing the total bike rentals, shows a strong positive correlation with casual_users
and registered_users
, indicating that as the number of casual or registered users increases, the total count of bike rentals tends to increase as well.# Create a correlation heatmap for a visual representation of variable relationships.
# Strong positive or negative correlations are visually highlighted.
library(corrplot)
# Customize the correlation plot
corrplot(
cor_matrix,
method = "color",
font = 4,
col = colorRampPalette(c("#4575b4", "#91bfdb", "#e0f3f8", "#fee090", "#d73027"))(100), # Attractive color scheme
tl.col = "black",
tl.srt = 40,
addCoef.col = "black",
order = "hclust",
addrect = 3 # Highlight cells with correlations above a certain threshold
)
# Add informative elements to the plot
title("Correlation heatmap", line = 0.5, cex.main = 1.8) # Lowercase title
mtext("Strong Positive Correlation", side = 3, line = 2, col = "#d73027", cex = 0.8)
mtext("Strong Negative Correlation", side = 1, line = 3, col = "#4575b4", cex = 0.8)
Season and Weekday Analysis
# Load the required library
library(dplyr)
# Plot season-wise monthly distribution of counts
season_month_plot <- ggplot(bike_df, aes(x = month, y = count, fill = season)) +
theme_minimal() +
geom_col(position = "dodge", color = "black", alpha = 0.8) +
labs(x = 'Month', y = 'Total Count',
title = 'Season-wise Monthly Distribution of Counts',
subtitle = 'Comparison of counts across different seasons',
fill = 'Season') +
scale_fill_manual(values = c('#E41A1C', '#377EB8', '#4DAF4A', '#FF7F00')) # Custom colors
# Add data labels on top of each bar
season_month_plot <- season_month_plot +
geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = -0.5, size = 3)
# Highlight the peak month for each season
peak_months <- bike_df %>%
group_by(season) %>%
slice(which.max(count))
season_month_plot <- season_month_plot +
geom_point(data = peak_months, aes(x = month, y = count), color = "red", size = 3) +
geom_text(data = peak_months, aes(x = month, y = count, label = paste("Peak\nMonth")),
vjust = -0.5, hjust = 1, color = "red", size = 3)
# Customize legend
season_month_plot <- season_month_plot +
guides(fill = guide_legend(title = "Season"))
# Print the enhanced season-wise plot
print(season_month_plot)
** Plot weekday-wise monthly distribution of counts**
# Plot weekday-wise monthly distribution of counts
weekday_month_plot <- ggplot(bike_df, aes(x = month, y = count, fill = factor(weekday))) +
theme_minimal() +
geom_col(position = "dodge", color = "white", alpha = 0.8) +
labs(x = 'Month', y = 'Total Count',
title = 'Weekday-wise Monthly Distribution of Counts',
subtitle = 'Comparison of counts across different weekdays',
fill = 'Weekday') +
scale_fill_discrete(name = 'Weekday', labels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")) + # Customize weekday labels
# Add data labels on top of each bar
geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = -0.5, size = 3)
# Highlight the peak day of the week
peak_days <- bike_df %>%
group_by(weekday) %>%
slice_max(order_by = count)
# Customize legend
weekday_month_plot <- weekday_month_plot +
guides(fill = guide_legend(title = "Weekday"))
# Print the enhanced weekday-wise plot
print(weekday_month_plot)
Bike Rentals by Season
# Section: Bike Rentals by Season
# Explore the distribution of bike rentals based on the season using a boxplot.
# This boxplot shows how the total number of rentals ('cnt') varies across different seasons.
# Create a boxplot to visualize bike rentals by season
ggplot(bike_df, aes(x = season, y = count, fill = as.factor(season))) +
geom_boxplot() +
ggtitle("Bike Rentals by Season") +
xlab("Season") + ylab("Number of Rentals") +
scale_fill_manual(values = c("#FF6666", "#FFCC66", "#99FF99", "#6666FF"), name = "Season") +
theme_minimal() +
theme(
plot.title = element_text(color = "blue", size = 18, face = "bold"),
axis.title.x = element_text(color = "blue", size = 14),
axis.title.y = element_text(color = "blue", size = 14),
legend.position = "top",
legend.title = element_text(color = "blue", size = 14)
) +
# Adding additional information on the plot
annotate("text", x = c(1, 2, 3, 4), y = 6000,
label = c("Spring", "Summer", "Fall", "Winter"),
color = "blue", size = 4, fontface = "bold") +
annotate("text", x = 1, y = 7000, label = "Median", color = "red", size = 4) +
annotate("text", x = 2, y = 7000, label = "Median", color = "red", size = 4) +
annotate("text", x = 3, y = 7000, label = "Median", color = "red", size = 4) +
annotate("text", x = 4, y = 7000, label = "Median", color = "red", size = 4) +
geom_hline(yintercept = median(bike_df$count), linetype = "dashed", color = "red", linewidth = 1.2)
Violin Plot for Yearly Distribution of Counts
# Section: Violin Plot for Yearly Distribution of Counts
library(ggplot2)
# Violin plot for Yearly wise distribution of counts
yearly_violin_plot <- ggplot(bike_df, aes(x = year, y = count, fill = as.factor(year))) +
geom_violin() +
theme_bw() +
labs(x = 'Year', y = 'Total Count',
title = 'Yearly Distribution of Counts',
subtitle = 'Comparison of counts across different years',
fill = 'Year') +
# Add data points on top of the violin plot for better insight
geom_jitter(aes(color = as.factor(year)), width = 0.2, alpha = 0.5) +
# Customize legend
scale_fill_manual(values = c('#1F78B4', '#33A02C'), name = 'Year') +
scale_color_manual(values = c('#1F78B4', '#33A02C'), name = 'Year') +
# Add a box plot inside the violin plot
geom_boxplot(width = 0.1, fill = 'white', color = 'black', alpha = 0.7) +
# Improve plot appearance
theme(legend.position = 'top') +
# Add informative annotations
annotate('text', x = c(0.8, 1.2), y = c(6000, 6000), label = c('2011', '2012'), size = 4, color = 'black') +
annotate('text', x = 1, y = 8000, label = 'Data Distribution', size = 6, color = 'black', fontface = 'bold') +
annotate('text', x = 1, y = 7000, label = 'Comparison of counts across different years using violin plots and box plots.', size = 3, color = 'black') +
annotate('text', x = 0.8, y = 6800, label = 'Data points are added for better insight.', size = 3, color = 'black')
# Print the Violin plot
print(yearly_violin_plot)
cat("From the voilin plot, we can analysis that the bike rental count distribution is highest in year 2012 then the previous year.\nIn the graph, year 0 represent 2011 and year 1 represent 2012 respectively\n")
From the voilin plot, we can analysis that the bike rental count distribution is highest in year 2012 then the previous year. In the graph, year 0 represent 2011 and year 1 represent 2012 respectively
Exploring Bike Rentals During Holidays
# Section: Exploring Bike Rentals During Holidays
# Explore summary statistics for the 'count' variable based on the 'holiday' status.
# This analysis provides insights into whether there are notable differences in rentals during holidays.
# Use tapply to calculate summary statistics for 'count' grouped by 'holiday' status
cnt_summary_by_holiday <- tapply(bike_df$count, bike_df$holiday, summary)
# Display the summary statistics
cat("Summary Statistics for 'count' Variable Based on 'holiday' Status:\n")
print(cnt_summary_by_holiday)
Summary Statistics for 'count' Variable Based on 'holiday' Status: $`0` Min. 1st Qu. Median Mean 3rd Qu. Max. 22 3214 4558 4527 5933 8714 $`1` Min. 1st Qu. Median Mean 3rd Qu. Max. 1000 1951 3351 3735 6034 7403
This output presents summary statistics for the variable 'count', representing the total bike rentals, categorized by 'holiday' status. Here's a breakdown:
For days classified as non-holidays (holiday status = 0):
During holidays (holiday status = 1):
Create an analytical plot to visualize the distribution of rentals during holidays and non-holidays
# Create an analytical plot to visualize the distribution of rentals during holidays and non-holidays
par(mfrow = c(1, 3)) # Set up a 1x3 grid for side-by-side plots
# Boxplot for Rentals by Holiday Status
boxplot(bike_df$count ~ bike_df$holiday, main = "Boxplot of Rentals by Holiday Status", xlab = "Holiday", ylab = "Number of Rentals", col = "#75AADB")
# Add statistics on the boxplot
# Calculate boxplot statistics
stats <- boxplot.stats(bike_df$count[bike_df$holiday == 1])
stats_text <- paste("Median:", round(stats$stats[3], 2), "\nIQR:", round(IQR(bike_df$count[bike_df$holiday == 1]), 2))
mtext(stats_text, side = 3, line = 0.5, at = 1.8, adj = 0, font = 1, cex = 0.8)
# Histogram for Distribution of Rentals on Non-Holidays
hist(bike_df$count[bike_df$holiday == 0], main = "Distribution of Rentals on Non-Holidays", xlab = "Number of Rentals", col = "#75AADB", border = "#333333", xlim = c(0, max(bike_df$count) + 50))
# Add statistics on the histogram
hist_stats_text <- paste("Mean:", round(mean(bike_df$count[bike_df$holiday == 0]), 2), "\nSD:", round(sd(bike_df$count[bike_df$holiday == 0]), 2))
mtext(hist_stats_text, side = 3, line = 0.5, at = max(bike_df$count[bike_df$holiday == 0]) * 0.8, adj = 0, font = 1, cex = 0.8)
# Add mean and sd lines
abline(v = mean(bike_df$count[bike_df$holiday == 0]), col = "red", lty = 2, lwd = 2)
abline(v = mean(bike_df$count[bike_df$holiday == 0]) - sd(bike_df$count[bike_df$holiday == 0]), col = "purple", lty = 2, lwd = 2)
abline(v = mean(bike_df$count[bike_df$holiday == 0]) + sd(bike_df$count[bike_df$holiday == 0]), col = "purple", lty = 2, lwd = 2)
# Mark the observation points on the histogram
rug(bike_df$count[bike_df$holiday == 0], col = "blue", lwd = 1.5)
# Histogram for Distribution of Rentals on Holidays
hist(bike_df$count[bike_df$holiday == 1], main = "Distribution of Rentals on Holidays", xlab = "Number of Rentals", col = "#75AADB", border = "#333333", xlim = c(0, max(bike_df$count) + 50))
# Add statistics on the histogram
hist_stats_text_holiday <- paste("Mean:", round(mean(bike_df$count[bike_df$holiday == 1]), 2), "\nSD:", round(sd(bike_df$count[bike_df$holiday == 1]), 2))
mtext(hist_stats_text_holiday, side = 3, line = 0.5, at = max(bike_df$count[bike_df$holiday == 1]) * 0.8, adj = 0, font = 1, cex = 0.8)
# Add mean and sd lines
abline(v = mean(bike_df$count[bike_df$holiday == 1]), col = "red", lty = 2, lwd = 2)
abline(v = mean(bike_df$count[bike_df$holiday == 1]) - sd(bike_df$count[bike_df$holiday == 1]), col = "purple", lty = 2, lwd = 2)
abline(v = mean(bike_df$count[bike_df$holiday == 1]) + sd(bike_df$count[bike_df$holiday == 1]), col = "purple", lty = 2, lwd = 2)
# Mark the observation points on the histogram
rug(bike_df$count[bike_df$holiday == 1], col = "blue", lwd = 1.5)
Analyzing Bike Rentals on Holidays
# Section: Analyzing Bike Rentals on Holidays
# Column plot for holiday-wise distribution of counts
holiday_plot <- ggplot(bike_df, aes(x = factor(holiday), y = count, fill = season)) +
geom_col(position = "dodge", color = "black", alpha = 0.8) +
theme_minimal() +
labs(
x = 'Holiday',
y = 'Total Count',
title = 'Bike Rentals Analysis: Holidays vs. Non-Holidays',
subtitle = 'Comparing counts on holidays and non-holidays across different seasons',
fill = 'Season'
) +
scale_fill_manual(values = c('#E41A1C', '#377EB8', '#4DAF4A', '#FF7F00')) + # Custom colors +
geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
scale_x_discrete(labels = c("Non-Holiday", "Holiday")) + # Adding scale for 0 and 1 +
theme(legend.position = "top") +
guides(fill = guide_legend(title = "Season")) +
annotate("text", x = 1.5, y = max(bike_df$count) * 1.2,
label = "Insight: During non-holidays, bike rental counts are highest compared to holidays for different seasons.",
color = "red", size = 3, fontface = "bold")
# Print plot
print(holiday_plot)
Analyzing Bike Rentals on neiher weekend nor holiday vs other days
Column plot for working day-wise distribution of counts
# Column plot for working day-wise distribution of counts
workingday_plot <- ggplot(bike_df, aes(x = factor(workingday), y = count, fill = season)) +
geom_col(position = "dodge", color = "black", alpha = 0.8) +
theme_minimal() +
labs(
x = 'Working Day',
y = 'Total Count',
title = 'Bike Rentals Analysis: neiher weekend nor holiday vs. other days',
subtitle = 'Comparing counts on working days and non-working days across different seasons',
fill = 'Season'
) +
scale_fill_manual(values = c('#E41A1C', '#377EB8', '#4DAF4A', '#FF7F00')) + # Custom colors +
geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
scale_x_discrete(labels = c("neiher weekend nor holiday", "other days")) + # Adding scale for 0 and 1 +
theme(legend.position = "top") +
guides(fill = guide_legend(title = "Season")) +
annotate("text", x = 1.5, y = max(bike_df$count) * 1.1,
label = "Insight: neiher weekend nor holiday - Bike rental count is higher.\n other day - Bike rental count is lower",
color = "red", size = 3, fontface = "bold")
print(workingday_plot)
Impact of Weather Conditions on Bike Rentals
# Section: Impact of Weather Conditions on Bike Rentals
# Column plot for weather condition-wise distribution of counts
weather_condition_plot <- ggplot(bike_df, aes(x = factor(weather_condition), y = count, fill = season)) +
geom_col(position = "dodge", color = "black", alpha = 1) +
theme_minimal() +
labs(
x = 'Weather Condition',
y = 'Total Count',
title = 'Impact of Weather Conditions on Bike Rentals',
subtitle = 'Comparing bike rental counts under different weather conditions across different seasons',
fill = 'Season'
) +
scale_fill_manual(values = c('#E41A1C', '#377EB8', '#4DAF4A', '#FF7F00')) + # Custom colors +
geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
scale_x_discrete(labels = c("Clear", "Mist + Cloudy", "Light Snow/Light Rain", "Heavy Rain")) + # Adding scale +
theme(legend.position = "top") +
guides(fill = guide_legend(title = "Season")) +
annotate(
"text",
x = 3.0,
y = max(bike_df$count) * 1.1,
label = "Insight: 1. Clear - Bike rental count is very high.\n 2. Mist + Cloudy - Bike rental count is second highest.\n 3. Light Snow/Light Rain - Bike rental count is third highest.\n 4. Heavy Rain - there is no data of Bike renting.",
color = "red",
size = 3,
fontface = "bold"
)
# Print plot
print(weather_condition_plot)
# Convert temperature variables to Celsius
bike_df$temp_celsius <- bike_df$temp * 41 # Assuming temp is in normalized units, adjust accordingly
bike_df$atemp_celsius <- bike_df$atemp * 50 # Assuming atemp is in normalized units, adjust accordingly
# Section: Combined Temperature Analysis
# Convert temperature variables to Celsius
bike_df$temp_celsius <- bike_df$temp * 41 # Assuming temp is in normalized units, adjust accordingly
bike_df$atemp_celsius <- bike_df$atemp * 50 # Assuming atemp is in normalized units, adjust accordingly
Scatter plot for bike rentals against temperature and apparent temperature in Celsius
# Scatter plot for bike rentals against temperature and apparent temperature in Celsius
combined_temp_plot <- ggplot(bike_df, aes(x = temp_celsius, y = count, color = "Temperature")) +
geom_point() +
geom_point(aes(x = atemp_celsius, color = "Apparent Temperature"), alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black",
aes(group = 1), formula = y ~ x) + # Add a linear trend line
theme_minimal() +
labs(
x = 'Temperature (Celsius)',
y = 'Total Count',
title = 'Bike Rentals vs. Temperature and Apparent Temperature',
subtitle = 'Scatter plot showing the relationship between bike rentals and temperature variables',
color = 'Variable'
) +
scale_color_manual(values = c("Temperature" = "blue", "Apparent Temperature" = "red")) +
annotate("text", x = 25, y = 8000, label = "Trend Line: Linear Regression", color = "black", size = 4) +
annotate("text", x = 5, y = 3000, label = "Temperature", color = "blue", size = 4) +
annotate("text", x = 30, y = 6000, label = "Apparent Temperature", color = "red", size = 4)
# Print the combined temperature plot
print(combined_temp_plot)
Boxplot for Bike Rental Count with Outliers
# Section: Boxplot for Bike Rental Count with Outliers
# Boxplot for bike rental count with outliers
boxplot(bike_df$count, main = 'Bike Rental Count', sub = ifelse(length(boxplot.stats(bike_df$count)$out) == 0, "No Outliers", paste("Outliers: ", boxplot.stats(bike_df$count)$out)),
ylab = 'Count', col = "cyan", border = "blue")
# Add statistical values
text(1, boxplot.stats(bike_df$count)$stats[1], paste("Min:", round(boxplot.stats(bike_df$count)$stats[1], 2)), pos = 4, cex = 1)
text(1, boxplot.stats(bike_df$count)$stats[2], paste("1st Quartile:", round(boxplot.stats(bike_df$count)$stats[2], 2)), pos = 4, cex = 1)
text(1, boxplot.stats(bike_df$count)$stats[3], paste("Median:", round(boxplot.stats(bike_df$count)$stats[3], 2)), pos = 4, cex = 1)
text(1, boxplot.stats(bike_df$count)$stats[4], paste("Mean:", round(mean(bike_df$count), 2)), pos = 4, cex = 1)
text(1, boxplot.stats(bike_df$count)$stats[5], paste("3rd Quartile:", round(boxplot.stats(bike_df$count)$stats[5], 2)), pos = 4, cex = 1)
text(1, boxplot.stats(bike_df$count)$stats[6], paste("Max:", round(boxplot.stats(bike_df$count)$stats[6], 2)), pos = 4, cex = 1)
# Set up the layout for multiple boxplots
par(mfrow = c(2, 2))
# Box plot for temperature outliers
boxplot(bike_df$temp, main = "Temperature", sub = ifelse(length(boxplot.stats(bike_df$temp)$out) == 0, "No Outliers", paste("Outliers: ", boxplot.stats(bike_df$temp)$out)),
col = "#FF6347", border = "#8B0000", notch = TRUE, outline = FALSE)
# Box plot for temperature outliers
boxplot(bike_df$atemp, main = "Feellike-Temperature", sub = ifelse(length(boxplot.stats(bike_df$atemp)$out) == 0, "No Outliers", paste("Outliers: ", boxplot.stats(bike_df$atemp)$out)),
col = "pink", border = "red", notch = TRUE, outline = FALSE)
# Box plot for humidity outliers
boxplot(bike_df$humidity, main = "Humidity", sub = ifelse(length(boxplot.stats(bike_df$humidity)$out) == 0, "No Outliers", paste("Outliers: ", boxplot.stats(bike_df$humidity)$out)),
col = "#87CEEB", border = "#1E90FF", notch = TRUE, outline = FALSE)
# Box plot for windspeed outliers
boxplot(bike_df$windspeed, main = "Windspeed", sub = ifelse(length(boxplot.stats(bike_df$windspeed)$out) == 0, "No Outliers", paste("Outliers: ", boxplot.stats(bike_df$windspeed)$out)),
col = "#98FB98", border = "#008000", notch = TRUE, outline = FALSE)
Replacing and Imputing Outliers in Humidity and Windspeed
# Outlier Replacement and Imputation
# Section: Replacing and Imputing Outliers in Humidity and Windspeed
# Install the dplyr package if not installed
# Load the required libraries
library(DMwR2)
library(dplyr)
# Create a subset for windspeed and humidity variables
wind_hum <- subset(bike_df, select = c('windspeed', 'humidity'))
# Function to replace outliers with NA
replace_outliers <- function(x) {
q <- quantile(x, c(0.25, 0.75))
iqr <- q[2] - q[1]
lower_bound <- q[1] - 1.5 * iqr
upper_bound <- q[2] + 1.5 * iqr
x[x < lower_bound | x > upper_bound] <- NA
return(x)
}
# Apply the function to each column
wind_hum <- wind_hum %>% mutate(across(everything(), replace_outliers))
wind_hum
windspeed | humidity |
---|---|
<dbl> | <dbl> |
0.1604460 | 0.805833 |
0.2485390 | 0.696087 |
0.2483090 | 0.437273 |
0.1602960 | 0.590435 |
0.1869000 | 0.436957 |
0.0895652 | 0.518261 |
0.1687260 | 0.498696 |
0.2668040 | 0.535833 |
0.3619500 | 0.434167 |
0.2232670 | 0.482917 |
0.1221320 | 0.686364 |
0.3046270 | 0.599545 |
0.3010000 | 0.470417 |
0.1265480 | 0.537826 |
0.1579630 | 0.498750 |
0.1884330 | 0.483750 |
0.1940170 | 0.537500 |
0.1467750 | 0.861667 |
0.2083170 | 0.741739 |
0.1959040 | 0.538333 |
0.3532420 | 0.457083 |
0.1719700 | 0.400000 |
0.2466000 | 0.436522 |
0.1583300 | 0.491739 |
0.1297960 | 0.616957 |
0.2938500 | 0.862500 |
0.1138370 | 0.687500 |
0.1233000 | 0.793043 |
0.1453650 | 0.651739 |
0.0739826 | 0.722174 |
⋮ | ⋮ |
0.1243790 | 0.823333 |
0.0827208 | 0.767500 |
0.1741290 | 0.733750 |
0.3240210 | 0.485000 |
0.1747540 | 0.508750 |
0.1306000 | 0.764167 |
0.1013790 | 0.911250 |
0.1579750 | 0.905417 |
0.1903080 | 0.925000 |
0.2960370 | 0.596667 |
0.1629370 | 0.538333 |
0.1741290 | 0.485833 |
0.1312290 | 0.642917 |
0.1063500 | 0.650417 |
0.1007420 | 0.838750 |
0.0982583 | 0.907083 |
0.2214040 | 0.666250 |
0.1840920 | 0.625417 |
0.1324630 | 0.667917 |
0.3743830 | 0.556667 |
NA | 0.441250 |
0.1330830 | 0.515417 |
0.0772304 | 0.791304 |
0.1687260 | 0.734783 |
0.3165460 | 0.823333 |
0.3501330 | 0.652917 |
0.1554710 | 0.590000 |
0.1243830 | 0.752917 |
0.3507540 | 0.483333 |
0.1548460 | 0.577500 |
# Impute missing values using mean imputation method
wind_hum$windspeed[is.na(wind_hum$windspeed)] <- mean(wind_hum$windspeed, na.rm = TRUE)
wind_hum$humidity[is.na(wind_hum$humidity)] <- mean(wind_hum$humidity, na.rm = TRUE)
#Section : Combining the imputed dataset and original dataset
new_df <- subset(bike_df, select = -c(windspeed, humidity)) # Remove original windspeed and humidity
bike_df <- cbind(new_df, wind_hum) # Combine new_df and wind_hum data frames
# Display the first 5 rows of the updated dataset
head(bike_df)
record_id | datetime | season | year | month | holiday | weekday | workingday | weather_condition | temp | atemp | casual_users | registerd_users | count | year_from_date | temp_celsius | atemp_celsius | windspeed | humidity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <date> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 331 | 654 | 985 | 2011 | 14.110847 | 18.18125 | 0.1604460 | 0.805833 |
2 | 2 | 2011-01-02 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 131 | 670 | 801 | 2011 | 14.902598 | 17.68695 | 0.2485390 | 0.696087 |
3 | 3 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 120 | 1229 | 1349 | 2011 | 8.050924 | 9.47025 | 0.2483090 | 0.437273 |
4 | 4 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 108 | 1454 | 1562 | 2011 | 8.200000 | 10.60610 | 0.1602960 | 0.590435 |
5 | 5 | 2011-01-05 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 82 | 1518 | 1600 | 2011 | 9.305237 | 11.46350 | 0.1869000 | 0.436957 |
6 | 6 | 2011-01-06 | 1 | 0 | 1 | 0 | 4 | 1 | 1 | 0.204348 | 0.233209 | 88 | 1518 | 1606 | 2011 | 8.378268 | 11.66045 | 0.0895652 | 0.518261 |
summary(bike_df)
record_id datetime season year month holiday Min. : 1.0 Min. :2011-01-01 1:181 0:365 1 : 62 0:710 1st Qu.:183.5 1st Qu.:2011-07-02 2:184 1:366 3 : 62 1: 21 Median :366.0 Median :2012-01-01 3:188 5 : 62 Mean :366.0 Mean :2012-01-01 4:178 7 : 62 3rd Qu.:548.5 3rd Qu.:2012-07-01 8 : 62 Max. :731.0 Max. :2012-12-31 10 : 62 (Other):359 weekday workingday weather_condition temp atemp 0:105 0:231 1:463 Min. :0.05913 Min. :0.07907 1:105 1:500 2:247 1st Qu.:0.33708 1st Qu.:0.33784 2:104 3: 21 Median :0.49833 Median :0.48673 3:104 Mean :0.49538 Mean :0.47435 4:104 3rd Qu.:0.65542 3rd Qu.:0.60860 5:104 Max. :0.86167 Max. :0.84090 6:105 casual_users registerd_users count year_from_date Min. : 2.0 Min. : 20 Min. : 22 Min. :2011 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152 1st Qu.:2011 Median : 713.0 Median :3662 Median :4548 Median :2012 Mean : 848.2 Mean :3656 Mean :4504 Mean :2012 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956 3rd Qu.:2012 Max. :3410.0 Max. :6946 Max. :8714 Max. :2012 temp_celsius atemp_celsius windspeed humidity Min. : 2.424 Min. : 3.953 Min. :0.02239 Min. :0.2542 1st Qu.:13.820 1st Qu.:16.892 1st Qu.:0.13495 1st Qu.:0.5223 Median :20.432 Median :24.337 Median :0.18097 Median :0.6292 Mean :20.311 Mean :23.718 Mean :0.18639 Mean :0.6294 3rd Qu.:26.872 3rd Qu.:30.430 3rd Qu.:0.22979 3rd Qu.:0.7302 Max. :35.328 Max. :42.045 Max. :0.37811 Max. :0.9725
plot for numerical variables in combined dataset
# Load the required libraries
library(car)
library(ggplot2)
# Select numerical columns for probability plots in combined dataset
numerical_columns <- sapply(bike_df, is.numeric)
for (column in names(bike_df[, numerical_columns])) {
hist(bike_df[, column], main = paste("Histogram for", column),
xlab = column, col = "skyblue", border = "black")
}
# Create normal probability plots for numerical variables in combined dataset
for (column in names(bike_df[, numerical_columns])) {
qqnorm(bike_df[, column], main = paste("Normal Probability Plot for", column))
qqline(bike_df[, column], col = 2)
# Add insight annotation
annotation <- "Some data points are deviating from normality in a good way."
text(quantile(bike_df[, column], 1.0), quantile(bike_df[, column], 0.1), annotation, adj = c(0, 1), cex = 0.8, col = "darkgreen")
}
Correlation Analysis of in combined dataset
# Section: Correlation Analysis of in combined dataset
# Load the corrgram package for correlation analysis
library(corrgram)
# Identify numeric columns for correlation analysis
numeric_columns <- sapply(bike_df[, 8:19], is.numeric)
# Create a correlation plot
corrgram(bike_df[, 8:19][, numeric_columns], order = FALSE, upper.panel = panel.pie, text.panel = panel.txt, main = 'Correlation Plot')
# Add insight on positive and negative correlations
cat("Positive Correlations: temp, atemp, and year have positive correlations with the target variable.\n")
cat("Negative Correlations: weather_condition, humidity, and windspeed have negative correlations with the target variable.\n")
Positive Correlations: temp, atemp, and year have positive correlations with the target variable. Negative Correlations: weather_condition, humidity, and windspeed have negative correlations with the target variable.
Identify variables that may not be needed for further analysis based on correlation
# Identify variables that may not be needed for further analysis based on correlation
cat("\nVariables with weak correlation (abs(correlation) <= 0.1) with the target variable:\n")
weak_corr_vars <- names(bike_df[, 8:19][, numeric_columns])[sapply(bike_df[, 8:19][, numeric_columns], function(x) abs(cor(x, bike_df$count)) <= 0.1)]
print(weak_corr_vars)
Variables with weak correlation (abs(correlation) <= 0.1) with the target variable: character(0)
# Load the purrr library for functions and vectors
library(purrr)
# Split the dataset based on simple random resampling
train_index <- sample(1:nrow(bike_df), 0.7 * nrow(bike_df))
train_data <- bike_df[train_index,]
test_data <- bike_df[-train_index,]
# Display dimensions of the training and testing datasets
cat("Dimensions of Training Data:", dim(train_data), "\n")
cat("Dimensions of Testing Data:", dim(test_data), "\n")
Dimensions of Training Data: 511 19 Dimensions of Testing Data: 220 19
#splitted data exploration
head(train_data)
record_id | datetime | season | year | month | holiday | weekday | workingday | weather_condition | temp | atemp | casual_users | registerd_users | count | year_from_date | temp_celsius | atemp_celsius | windspeed | humidity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <date> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
146 | 146 | 2011-05-26 | 2 | 0 | 5 | 0 | 4 | 1 | 1 | 0.708333 | 0.654688 | 758 | 3919 | 4677 | 2011 | 29.041653 | 32.73440 | 0.1996420 | 0.677500 |
312 | 312 | 2011-11-08 | 4 | 0 | 11 | 0 | 2 | 1 | 1 | 0.408333 | 0.412246 | 376 | 3829 | 4205 | 2011 | 16.741653 | 20.61230 | 0.0690375 | 0.721667 |
9 | 9 | 2011-01-09 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0.138333 | 0.116175 | 54 | 768 | 822 | 2011 | 5.671653 | 5.80875 | 0.3619500 | 0.434167 |
240 | 240 | 2011-08-28 | 3 | 0 | 8 | 0 | 0 | 0 | 1 | 0.707059 | 0.647959 | 1415 | 2919 | 4334 | 2011 | 28.989419 | 32.39795 | 0.3046590 | 0.561765 |
633 | 633 | 2012-09-24 | 4 | 1 | 9 | 0 | 1 | 1 | 1 | 0.514167 | 0.502513 | 1001 | 6435 | 7436 | 2012 | 21.080847 | 25.12565 | 0.1424040 | 0.492917 |
394 | 394 | 2012-01-29 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0.282500 | 0.272721 | 558 | 2685 | 3243 | 2012 | 11.582500 | 13.63605 | 0.2400500 | 0.311250 |
head(test_data)
record_id | datetime | season | year | month | holiday | weekday | workingday | weather_condition | temp | atemp | casual_users | registerd_users | count | year_from_date | temp_celsius | atemp_celsius | windspeed | humidity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <date> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
3 | 3 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 120 | 1229 | 1349 | 2011 | 8.050924 | 9.47025 | 0.2483090 | 0.437273 |
4 | 4 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 108 | 1454 | 1562 | 2011 | 8.200000 | 10.60610 | 0.1602960 | 0.590435 |
6 | 6 | 2011-01-06 | 1 | 0 | 1 | 0 | 4 | 1 | 1 | 0.204348 | 0.233209 | 88 | 1518 | 1606 | 2011 | 8.378268 | 11.66045 | 0.0895652 | 0.518261 |
7 | 7 | 2011-01-07 | 1 | 0 | 1 | 0 | 5 | 1 | 2 | 0.196522 | 0.208839 | 148 | 1362 | 1510 | 2011 | 8.057402 | 10.44195 | 0.1687260 | 0.498696 |
12 | 12 | 2011-01-12 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.172727 | 0.160473 | 25 | 1137 | 1162 | 2011 | 7.081807 | 8.02365 | 0.3046270 | 0.599545 |
14 | 14 | 2011-01-14 | 1 | 0 | 1 | 0 | 5 | 1 | 1 | 0.160870 | 0.188413 | 54 | 1367 | 1421 | 2011 | 6.595670 | 9.42065 | 0.1265480 | 0.537826 |
Creating Subsets for Training and Testing
# Section: Creating Subsets for Training and Testing
# Create a new subset for train attributes
train <- subset(train_data, select = c('season', 'year', 'month', 'holiday', 'weekday', 'workingday', 'weather_condition', 'temp','atemp', 'humidity', 'windspeed', 'count'))
# Create a new subset for test attributes
test <- subset(test_data, select = c('season', 'year', 'month', 'holiday', 'weekday', 'workingday', 'weather_condition', 'temp', 'atemp','humidity', 'windspeed', 'count'))
# Display the first few rows of the training subset
cat("Training Subset:")
head(train)
Training Subset:
season | year | month | holiday | weekday | workingday | weather_condition | temp | atemp | humidity | windspeed | count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
<fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
146 | 2 | 0 | 5 | 0 | 4 | 1 | 1 | 0.708333 | 0.654688 | 0.677500 | 0.1996420 | 4677 |
312 | 4 | 0 | 11 | 0 | 2 | 1 | 1 | 0.408333 | 0.412246 | 0.721667 | 0.0690375 | 4205 |
9 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0.138333 | 0.116175 | 0.434167 | 0.3619500 | 822 |
240 | 3 | 0 | 8 | 0 | 0 | 0 | 1 | 0.707059 | 0.647959 | 0.561765 | 0.3046590 | 4334 |
633 | 4 | 1 | 9 | 0 | 1 | 1 | 1 | 0.514167 | 0.502513 | 0.492917 | 0.1424040 | 7436 |
394 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0.282500 | 0.272721 | 0.311250 | 0.2400500 | 3243 |
# Display the first few rows of the testing subset
cat("\nTesting Subset:")
head(test)
Testing Subset:
season | year | month | holiday | weekday | workingday | weather_condition | temp | atemp | humidity | windspeed | count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
<fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
3 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.2483090 | 1349 |
4 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.1602960 | 1562 |
6 | 1 | 0 | 1 | 0 | 4 | 1 | 1 | 0.204348 | 0.233209 | 0.518261 | 0.0895652 | 1606 |
7 | 1 | 0 | 1 | 0 | 5 | 1 | 2 | 0.196522 | 0.208839 | 0.498696 | 0.1687260 | 1510 |
12 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.172727 | 0.160473 | 0.599545 | 0.3046270 | 1162 |
14 | 1 | 0 | 1 | 0 | 5 | 1 | 1 | 0.160870 | 0.188413 | 0.537826 | 0.1265480 | 1421 |
# Section: Creating Subsets for Training and Testing
# Create a new subset for train categorical attributes
train_cat <- subset(train, select = c('season', 'holiday', 'workingday', 'weather_condition', 'year'))
# Create a new subset for test categorical attributes
test_cat <- subset(test, select = c('season', 'holiday', 'workingday', 'weather_condition', 'year'))
# Create a new subset for train numerical attributes
train_num <- subset(train, select = c('weekday', 'month', 'temp', 'atemp','humidity', 'windspeed', 'count'))
# Create a new subset for test numerical attributes
test_num <- subset(test, select = c('weekday', 'month', 'temp', 'atemp','humidity', 'windspeed', 'count'))
# Display the first few rows of the training categorical attributes subset
cat("Training Categorical Attributes Subset:")
head(train_cat)
Training Categorical Attributes Subset:
season | holiday | workingday | weather_condition | year | |
---|---|---|---|---|---|
<fct> | <fct> | <fct> | <fct> | <fct> | |
146 | 2 | 0 | 1 | 1 | 0 |
312 | 4 | 0 | 1 | 1 | 0 |
9 | 1 | 0 | 0 | 1 | 0 |
240 | 3 | 0 | 0 | 1 | 0 |
633 | 4 | 0 | 1 | 1 | 1 |
394 | 1 | 0 | 0 | 1 | 1 |
# Display the first few rows of the testing categorical attributes subset
cat("\nTesting Categorical Attributes Subset:")
head(test_cat)
Testing Categorical Attributes Subset:
season | holiday | workingday | weather_condition | year | |
---|---|---|---|---|---|
<fct> | <fct> | <fct> | <fct> | <fct> | |
3 | 1 | 0 | 1 | 1 | 0 |
4 | 1 | 0 | 1 | 1 | 0 |
6 | 1 | 0 | 1 | 1 | 0 |
7 | 1 | 0 | 1 | 2 | 0 |
12 | 1 | 0 | 1 | 1 | 0 |
14 | 1 | 0 | 1 | 1 | 0 |
# Display the first few rows of the training numerical attributes subset
cat("\nTraining Numerical Attributes Subset:")
head(train_num)
Training Numerical Attributes Subset:
weekday | month | temp | atemp | humidity | windspeed | count | |
---|---|---|---|---|---|---|---|
<fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
146 | 4 | 5 | 0.708333 | 0.654688 | 0.677500 | 0.1996420 | 4677 |
312 | 2 | 11 | 0.408333 | 0.412246 | 0.721667 | 0.0690375 | 4205 |
9 | 0 | 1 | 0.138333 | 0.116175 | 0.434167 | 0.3619500 | 822 |
240 | 0 | 8 | 0.707059 | 0.647959 | 0.561765 | 0.3046590 | 4334 |
633 | 1 | 9 | 0.514167 | 0.502513 | 0.492917 | 0.1424040 | 7436 |
394 | 0 | 1 | 0.282500 | 0.272721 | 0.311250 | 0.2400500 | 3243 |
# Display the first few rows of the testing numerical attributes subset
cat("\nTesting Numerical Attributes Subset:")
head(test_num)
Testing Numerical Attributes Subset:
weekday | month | temp | atemp | humidity | windspeed | count | |
---|---|---|---|---|---|---|---|
<fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
3 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.2483090 | 1349 |
4 | 2 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.1602960 | 1562 |
6 | 4 | 1 | 0.204348 | 0.233209 | 0.518261 | 0.0895652 | 1606 |
7 | 5 | 1 | 0.196522 | 0.208839 | 0.498696 | 0.1687260 | 1510 |
12 | 3 | 1 | 0.172727 | 0.160473 | 0.599545 | 0.3046270 | 1162 |
14 | 5 | 1 | 0.160870 | 0.188413 | 0.537826 | 0.1265480 | 1421 |
Encoding Categorical Features (Train Dataset)
# Load the required libraries
library(caret)
# Define variables for dummy encoding
othervars <- c('month', 'weekday', 'temp', 'atemp', 'humidity', 'windspeed', 'count')
set.seed(2626)
# Identify categorical variables
vars <- setdiff(colnames(train), c(train$count, othervars))
vars
# Create a formula for encoding
f <- paste('~', paste(vars, collapse = ' + '))
# Use dummyVars to encode categorical variables
encoder <- dummyVars(as.formula(f), train)
encode_attributes <- predict(encoder, train)
# Combine numerical and encoded attributes
train_encoded_attributes <- cbind(train_num, encode_attributes)
# Display the head of the encoded dataset
head(train_encoded_attributes)
weekday | month | temp | atemp | humidity | windspeed | count | season.1 | season.2 | season.3 | season.4 | year.0 | year.1 | holiday.0 | holiday.1 | workingday.0 | workingday.1 | weather_condition.1 | weather_condition.2 | weather_condition.3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
146 | 4 | 5 | 0.708333 | 0.654688 | 0.677500 | 0.1996420 | 4677 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
312 | 2 | 11 | 0.408333 | 0.412246 | 0.721667 | 0.0690375 | 4205 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
9 | 0 | 1 | 0.138333 | 0.116175 | 0.434167 | 0.3619500 | 822 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
240 | 0 | 8 | 0.707059 | 0.647959 | 0.561765 | 0.3046590 | 4334 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
633 | 1 | 9 | 0.514167 | 0.502513 | 0.492917 | 0.1424040 | 7436 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
394 | 0 | 1 | 0.282500 | 0.272721 | 0.311250 | 0.2400500 | 3243 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
Encoding Categorical Features (Test Dataset)
# Section: Encoding Categorical Features (Test Dataset)
# Load the required libraries
library(caret)
# Define variables for dummy encoding
othervars <- c('month', 'weekday', 'temp', 'atemp', 'humidity', 'windspeed', 'count')
set.seed(5662)
# Identify categorical variables in the test dataset
vars <- setdiff(colnames(test), c(test$count, othervars))
vars
# Create a formula for encoding
f <- paste('~', paste(vars, collapse = ' + '))
# Use dummyVars to encode categorical variables
encoder <- dummyVars(as.formula(f), test)
encode_attributes <- predict(encoder, test)
# Combine numerical and encoded attributes for the test dataset
test_encoded_attributes <- cbind(test_num, encode_attributes)
# Display the head of the encoded test dataset
head(test_encoded_attributes)
weekday | month | temp | atemp | humidity | windspeed | count | season.1 | season.2 | season.3 | season.4 | year.0 | year.1 | holiday.0 | holiday.1 | workingday.0 | workingday.1 | weather_condition.1 | weather_condition.2 | weather_condition.3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
3 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.2483090 | 1349 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
4 | 2 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.1602960 | 1562 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
6 | 4 | 1 | 0.204348 | 0.233209 | 0.518261 | 0.0895652 | 1606 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
7 | 5 | 1 | 0.196522 | 0.208839 | 0.498696 | 0.1687260 | 1510 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
12 | 3 | 1 | 0.172727 | 0.160473 | 0.599545 | 0.3046270 | 1162 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
14 | 5 | 1 | 0.160870 | 0.188413 | 0.537826 | 0.1265480 | 1421 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
# Section: Modeling the Training Dataset with Linear Regression
# Set seed to reproduce the results of random sampling
set.seed(672)
# Train the Linear Regression model
lr_model <- lm(count ~ ., data = train_encoded_attributes[, -c(6)])
# Display the summary of the Linear Regression model
summary(lr_model)
Call: lm(formula = count ~ ., data = train_encoded_attributes[, -c(6)]) Residuals: Min 1Q Median 3Q Max -4085.8 -393.7 94.2 450.0 3005.3 Coefficients: (6 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 1862.06 534.81 3.482 0.000543 *** weekday1 181.92 137.31 1.325 0.185816 weekday2 226.49 133.63 1.695 0.090754 . weekday3 356.27 136.90 2.602 0.009540 ** weekday4 290.07 135.11 2.147 0.032292 * weekday5 348.66 141.28 2.468 0.013936 * weekday6 401.81 136.16 2.951 0.003320 ** month2 59.83 177.23 0.338 0.735819 month3 530.76 211.40 2.511 0.012374 * month4 42.72 322.36 0.133 0.894628 month5 562.02 347.59 1.617 0.106547 month6 183.59 360.41 0.509 0.610713 month7 -368.76 401.71 -0.918 0.359086 month8 195.75 384.50 0.509 0.610904 month9 595.97 334.97 1.779 0.075841 . month10 217.26 294.40 0.738 0.460886 month11 -449.11 285.24 -1.575 0.116020 month12 -305.80 222.82 -1.372 0.170575 temp 1583.91 1564.30 1.013 0.311789 atemp 3356.17 1622.48 2.069 0.039121 * humidity -1236.96 379.84 -3.256 0.001207 ** season.1 -1891.38 220.11 -8.593 < 0.0000000000000002 *** season.2 -767.10 274.21 -2.797 0.005356 ** season.3 -670.54 249.38 -2.689 0.007418 ** season.4 NA NA NA NA year.0 -2027.98 74.08 -27.374 < 0.0000000000000002 *** year.1 NA NA NA NA holiday.0 470.72 225.67 2.086 0.037513 * holiday.1 NA NA NA NA workingday.0 NA NA NA NA workingday.1 NA NA NA NA weather_condition.1 2298.63 269.87 8.518 < 0.0000000000000002 *** weather_condition.2 1882.27 253.93 7.413 0.000000000000559 *** weather_condition.3 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 809.7 on 483 degrees of freedom Multiple R-squared: 0.8352, Adjusted R-squared: 0.826 F-statistic: 90.69 on 27 and 483 DF, p-value: < 0.00000000000000022
Cross Validation Prediction with Linear Regression
# Section: Cross Validation Prediction with Linear Regression
# Ignore warning messages
options(warn = -1)
# Set seed to reproduce results of random sampling
set.seed(623)
# Cross validation resampling method
train_control <- trainControl(method = 'cv', number = 3)
# Cross validation prediction
CV_predict <- train(count ~ .,data = train_encoded_attributes[, -c(6)],
method = 'lm', trControl = train_control)
# Display summary of cross validation prediction
summary(CV_predict)
# Observations from the output
cat("\n**Observations:**\n")
cat("- The model's residuals range from -3713.0 to 3165.0, indicating the prediction performance.\n")
cat("- Coefficients and p-values provide insights into variable significance.\n")
cat("- The adjusted R-squared (0.8316) indicates the model's goodness of fit.\n")
cat("- Variables with smaller p-values are considered statistically significant.\n")
cat("- Overall, the linear regression model captures relationships and offers insights into predictor importance and model fit.\n")
Call: lm(formula = .outcome ~ ., data = dat) Residuals: Min 1Q Median 3Q Max -4085.8 -393.7 94.2 450.0 3005.3 Coefficients: (6 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 1862.06 534.81 3.482 0.000543 *** weekday1 181.92 137.31 1.325 0.185816 weekday2 226.49 133.63 1.695 0.090754 . weekday3 356.27 136.90 2.602 0.009540 ** weekday4 290.07 135.11 2.147 0.032292 * weekday5 348.66 141.28 2.468 0.013936 * weekday6 401.81 136.16 2.951 0.003320 ** month2 59.83 177.23 0.338 0.735819 month3 530.76 211.40 2.511 0.012374 * month4 42.72 322.36 0.133 0.894628 month5 562.02 347.59 1.617 0.106547 month6 183.59 360.41 0.509 0.610713 month7 -368.76 401.71 -0.918 0.359086 month8 195.75 384.50 0.509 0.610904 month9 595.97 334.97 1.779 0.075841 . month10 217.26 294.40 0.738 0.460886 month11 -449.11 285.24 -1.575 0.116020 month12 -305.80 222.82 -1.372 0.170575 temp 1583.91 1564.30 1.013 0.311789 atemp 3356.17 1622.48 2.069 0.039121 * humidity -1236.96 379.84 -3.256 0.001207 ** season.1 -1891.38 220.11 -8.593 < 0.0000000000000002 *** season.2 -767.10 274.21 -2.797 0.005356 ** season.3 -670.54 249.38 -2.689 0.007418 ** season.4 NA NA NA NA year.0 -2027.98 74.08 -27.374 < 0.0000000000000002 *** year.1 NA NA NA NA holiday.0 470.72 225.67 2.086 0.037513 * holiday.1 NA NA NA NA workingday.0 NA NA NA NA workingday.1 NA NA NA NA weather_condition.1 2298.63 269.87 8.518 < 0.0000000000000002 *** weather_condition.2 1882.27 253.93 7.413 0.000000000000559 *** weather_condition.3 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 809.7 on 483 degrees of freedom Multiple R-squared: 0.8352, Adjusted R-squared: 0.826 F-statistic: 90.69 on 27 and 483 DF, p-value: < 0.00000000000000022
**Observations:** - The model's residuals range from -3713.0 to 3165.0, indicating the prediction performance. - Coefficients and p-values provide insights into variable significance. - The adjusted R-squared (0.8316) indicates the model's goodness of fit. - Variables with smaller p-values are considered statistically significant. - Overall, the linear regression model captures relationships and offers insights into predictor importance and model fit.
Cross Validation Prediction Plot with Linear Regression
#Section: Cross Validation Prediction Plot with Linear Regression
# Increase the size of the plot
par(mfrow=c(1, 1), mar=c(5, 5, 2, 2))
# Cross-validation prediction plot
residuals <- resid(CV_predict)
y_train <- train_encoded_attributes$count
# Scatter plot with residuals
plot(y_train, residuals, ylab = 'Residuals', xlab = 'Observed', main = 'Cross Validation Prediction Plot', pch = 16, col = 'blue')
# Add a reference line at zero
abline(h = 0, col = 'red', lwd = 2)
# Observations from the plot
text(5000, -1500, 'Residuals spread around zero indicates a good model fit', col = 'darkgreen', cex = 0.8)
text(5000, -2500, 'Scatter points should be random and evenly distributed', col = 'darkgreen', cex = 0.8)
Model Performance on Test Data Analysis
# Section: Model Performance on Test Data Analysis
# Set seed for reproducibility
set.seed(6872)
# Suppress warning messages
options(warn = -1)
# Predict using the lr_model on the test_encoded_attributes
lm_predictions <- predict(lr_model, test_encoded_attributes[, -c(6)])
# Display the first 10 predictions
head(lm_predictions, 10)
Model Performance Visualization using Linear Regressor Model
# Section: Model Performance Visualization using Linear Regressor Model
# Extract actual and predicted values
actual_values <- test_encoded_attributes$count
predicted_values <- lm_predictions
# Plot past actual values and future 10 predicted values
plot(1:length(actual_values), actual_values, type = "l", col = "blue",
xlab = "Sample Index", ylab = "Count", main = "Actual vs Future Predicted Values",
xlim = c(1, length(actual_values) + 10), ylim = c(0, max(actual_values, predicted_values)))
lines(length(actual_values) + 1:length(predicted_values), predicted_values, col = "orange")
# Add lines for future predicted values
lines(length(actual_values) + 1:length(predicted_values), predicted_values, col = "orange")
# Add legend
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "orange"), lty = 1)
# Highlight the last observed data point
points(length(actual_values), actual_values[length(actual_values)], pch = 19, col = "red")
# Highlight the starting point of predicted values
points(length(actual_values) + 1, predicted_values[1], pch = 19, col = "green")
Model Evaluation Metrics on Rootmean squred error and mean abosolute error
# Section: Model Evaluation Metrics on Rootmean squred error and mean abosolute error
# Set seed for reproducibility
set.seed(688)
# Root mean squared error (RMSE)
rmse <- RMSE(lm_predictions, test_encoded_attributes$count)
print(paste("Root Mean Squared Error (RMSE):", round(rmse, 2)))
[1] "Root Mean Squared Error (RMSE): 780.12"
# Mean absolute error (MAE)
mae <- MAE(lm_predictions, test_encoded_attributes$count)
print(paste("Mean Absolute Error (MAE):", round(mae, 2)))
[1] "Mean Absolute Error (MAE): 574.86"
# Section: Residual Analysis
# Set heading for the code block
cat("Residual :\n")
# Calculate residuals
y_test <- test_encoded_attributes$count
residuals_lm <- y_test - lm_predictions
residuals_lm
Residual :
# Create an informative residual plot
plot(y_test, residuals_lm, xlab = 'Observed Count', ylab = 'Residuals', main = 'Residual Plot',
col = ifelse(residuals_lm >= 0, 'blue', 'red'), pch = 16, cex = 1.2)
# Add a reference line at y = 0
abline(h = 0, col = 'black', lty = 2, lw = 2)
# Add legend for positive and negative residuals
legend('topright', legend = c('Positive Residuals', 'Negative Residuals'), col = c('blue', 'red'), pch = 16)
# Interpretation:
cat("\nInterpretation:\n")
cat("The residual plot shows the difference between observed and predicted counts.\n")
cat("Positive residuals (in blue) indicate underpredictions, while negative residuals (in red) indicate overpredictions.\n")
cat("Overall, the model seems to perform well, with residuals centered around zero.\n")
Interpretation: The residual plot shows the difference between observed and predicted counts. Positive residuals (in blue) indicate underpredictions, while negative residuals (in red) indicate overpredictions. Overall, the model seems to perform well, with residuals centered around zero.
# Section: Decision Tree Regressor
# Set heading for the code block
cat("Decision Tree Regressor:\n")
# Set seed for reproducibility
set.seed(568)
# Load the rpart library for decision trees
library(rpart)
# Set control parameters for rpart
rpart.control <- rpart.control(minbucket = 2, cp = 0.01, maxcompete = 3, maxsurrogate = 4,
usesurrogate = 2, xval = 3, surrogatestyle = 0, maxdepth = 10)
# Train the decision tree regressor model
dtr <- rpart(train_encoded_attributes$count ~ ., data = train_encoded_attributes[,-c(6)],
control = rpart.control, method = 'anova', cp = 0.01)
# Display summary of the decision tree model
summary(dtr)
# Interpretation:
cat("\nInterpretation:\n")
cat("The decision tree regressor is trained with control parameters to control its performance.\n")
cat("The summary provides insights into the structure of the decision tree, including splits, terminal nodes, and node statistics.\n")
Decision Tree Regressor: Call: rpart(formula = train_encoded_attributes$count ~ ., data = train_encoded_attributes[, -c(6)], method = "anova", control = rpart.control, cp = 0.01) n= 511 CP nsplit rel error xerror xstd 1 0.38806588 0 1.0000000 1.0017384 0.04827550 2 0.21219513 1 0.6119341 0.6456161 0.03328136 3 0.08560458 2 0.3997390 0.4196777 0.03247223 4 0.04320558 3 0.3141344 0.3413733 0.02918864 5 0.03642203 4 0.2709288 0.3147070 0.03012851 6 0.02057878 5 0.2345068 0.2899876 0.03228089 7 0.01286813 6 0.2139280 0.2853027 0.03243160 8 0.01163559 7 0.2010599 0.2786163 0.03319573 9 0.01000000 8 0.1894243 0.2709887 0.03002257 Variable importance atemp temp month year.0 year.1 season.1 season.4 season.3 19 18 16 13 13 11 3 3 humidity 3 Node number 1: 511 observations, complexity param=0.3880659 mean=4525.638, MSE=3761088 left son=2 (213 obs) right son=3 (298 obs) Primary splits: temp < 0.432174 to the left, improve=0.3880659, (0 missing) atemp < 0.4308565 to the left, improve=0.3841343, (0 missing) season.1 < 0.5 to the right, improve=0.3307881, (0 missing) year.0 < 0.5 to the right, improve=0.3157981, (0 missing) Surrogate splits: atemp < 0.427125 to the left, agree=0.996, adj=0.991, (0 split) month splits as LLLRRRRRRRLL, agree=0.894, adj=0.746, (0 split) season.1 < 0.5 to the right, agree=0.789, adj=0.493, (0 split) season.3 < 0.5 to the left, agree=0.661, adj=0.188, (0 split) Node number 2: 213 observations, complexity param=0.08560458 mean=3096.653, MSE=2305870 left son=4 (109 obs) right son=5 (104 obs) Primary splits: year.1 < 0.5 to the left, improve=0.3349786, (0 missing) year.0 < 0.5 to the right, improve=0.3349786, (0 missing) season.4 < 0.5 to the left, improve=0.3221560, (0 missing) season.1 < 0.5 to the right, improve=0.2837440, (0 missing) Surrogate splits: year.0 < 0.5 to the right, agree=1.000, adj=1.000, (0 split) month splits as RRLL-----LRR, agree=0.582, adj=0.144, (0 split) atemp < 0.2578895 to the left, agree=0.582, adj=0.144, (0 split) temp < 0.2766665 to the left, agree=0.573, adj=0.125, (0 split) Node number 3: 298 observations, complexity param=0.2121951 mean=5547.027, MSE=2298442 left son=6 (142 obs) right son=7 (156 obs) Primary splits: year.1 < 0.5 to the left, improve=0.5954153, (0 missing) year.0 < 0.5 to the right, improve=0.5954153, (0 missing) humidity < 0.834375 to the right, improve=0.1504638, (0 missing) weather_condition.3 < 0.5 to the right, improve=0.1310491, (0 missing) Surrogate splits: year.0 < 0.5 to the right, agree=1.000, adj=1.000, (0 split) humidity < 0.6954165 to the right, agree=0.581, adj=0.120, (0 split) month splits as -RRRRLRRRRLR, agree=0.544, adj=0.042, (0 split) atemp < 0.6682585 to the right, agree=0.540, adj=0.035, (0 split) Node number 4: 109 observations, complexity param=0.03642203 mean=2238.174, MSE=1099667 left son=8 (73 obs) right son=9 (36 obs) Primary splits: season.4 < 0.5 to the left, improve=0.5839973, (0 missing) season.1 < 0.5 to the right, improve=0.5208747, (0 missing) month splits as LLLL-----RRR, improve=0.4553324, (0 missing) atemp < 0.2443185 to the left, improve=0.2553346, (0 missing) Surrogate splits: month splits as LLLL-----RRR, agree=0.927, adj=0.778, (0 split) season.1 < 0.5 to the right, agree=0.872, adj=0.611, (0 split) atemp < 0.379725 to the left, agree=0.679, adj=0.028, (0 split) Node number 5: 104 observations, complexity param=0.04320558 mean=3996.404, MSE=1988094 left son=10 (58 obs) right son=11 (46 obs) Primary splits: season.1 < 0.5 to the right, improve=0.4016095, (0 missing) season.4 < 0.5 to the left, improve=0.3570016, (0 missing) atemp < 0.2862915 to the left, improve=0.3428914, (0 missing) temp < 0.2804165 to the left, improve=0.3328967, (0 missing) Surrogate splits: season.4 < 0.5 to the left, agree=0.933, adj=0.848, (0 split) month splits as LLLR-----RRR, agree=0.894, adj=0.761, (0 split) temp < 0.295 to the left, agree=0.702, adj=0.326, (0 split) atemp < 0.296882 to the left, agree=0.702, adj=0.326, (0 split) Node number 6: 142 observations, complexity param=0.01286813 mean=4320.873, MSE=686675.4 left son=12 (7 obs) right son=13 (135 obs) Primary splits: humidity < 0.9041665 to the right, improve=0.2536354, (0 missing) weather_condition.3 < 0.5 to the right, improve=0.2522281, (0 missing) atemp < 0.4592605 to the left, improve=0.1985392, (0 missing) temp < 0.4708335 to the left, improve=0.1877138, (0 missing) Surrogate splits: weather_condition.3 < 0.5 to the right, agree=0.979, adj=0.571, (0 split) month splits as -RRRRRRRRRRL, agree=0.958, adj=0.143, (0 split) Node number 7: 156 observations, complexity param=0.02057878 mean=6663.141, MSE=1151324 left son=14 (4 obs) right son=15 (152 obs) Primary splits: humidity < 0.8322915 to the right, improve=0.22020730, (0 missing) temp < 0.4920835 to the left, improve=0.08552324, (0 missing) weather_condition.1 < 0.5 to the left, improve=0.08512799, (0 missing) atemp < 0.47979 to the left, improve=0.08466814, (0 missing) Node number 8: 73 observations mean=1675.411, MSE=389698.5 Node number 9: 36 observations mean=3379.333, MSE=594878.7 Node number 10: 58 observations, complexity param=0.01163559 mean=3200.638, MSE=961525.5 left son=20 (18 obs) right son=21 (40 obs) Primary splits: temp < 0.2575 to the left, improve=0.4009904, (0 missing) atemp < 0.294643 to the left, improve=0.3939220, (0 missing) month splits as RRR--------L, improve=0.2544737, (0 missing) humidity < 0.6497195 to the right, improve=0.1667501, (0 missing) Surrogate splits: atemp < 0.2607295 to the left, agree=0.966, adj=0.889, (0 split) month splits as RRR--------L, agree=0.776, adj=0.278, (0 split) weekday splits as RLRRRRR, agree=0.741, adj=0.167, (0 split) Node number 11: 46 observations mean=4999.761, MSE=1477300 Node number 12: 7 observations mean=2488.143, MSE=229917 Node number 13: 135 observations mean=4415.904, MSE=527163.2 Node number 14: 4 observations mean=3559.25, MSE=4341614 Node number 15: 152 observations mean=6744.822, MSE=807167.1 Node number 20: 18 observations mean=2275, MSE=413281.7 Node number 21: 40 observations mean=3617.175, MSE=649169.6 Interpretation: The decision tree regressor is trained with control parameters to control its performance. The summary provides insights into the structure of the decision tree, including splits, terminal nodes, and node statistics.
Visualizing Enhanced Decision Tree Model
# Section: Visualizing Enhanced Decision Tree Model
# Set heading for the code block
cat("Visualizing Enhanced Decision Tree Model:\n")
# Load the rpart.plot library for plotting the learned decision tree model
library(rpart.plot)
# Set custom color palette for the decision tree plot
custom_palette <- c("#FF0000", "#FF6347", "#FFA500", "#008000", "#0000FF", "#4B0082")
# Plot the learned decision tree model with enhanced visualization
# Use 'col' parameter to set the color palette
rpart.plot(dtr, col = custom_palette, shadow.col = "blue", nn = TRUE, roundint = FALSE,
main = "Enhanced Decision Tree Model", branch.lty = 3, under = TRUE, box.col = "lightblue",
split.cex = 1.5, compress = TRUE, extra = 101, varlen = 0, tweak = 2.0)
# Interpretation:
cat("\nInterpretation:\n")
cat("The enhanced visualization of the learned decision tree model improves aesthetics and clarity.\n")
cat("Custom color palette and styling options are applied to make the plot more attractive.\n")
cat("Understanding the decision tree structure becomes more engaging and informative.\n")
Visualizing Enhanced Decision Tree Model: Interpretation: The enhanced visualization of the learned decision tree model improves aesthetics and clarity. Custom color palette and styling options are applied to make the plot more attractive. Understanding the decision tree structure becomes more engaging and informative.
Cross Validation Prediction for Decision Tree Regressor
# Section: Cross Validation Prediction for Decision Tree Regressor
# Set heading for the code block
cat("Cross Validation Prediction for Decision Tree Regressor:\n")
# Ignore warning messages during execution
options(warn = -1)
# Load the caret package
library(caret)
# Set seed for reproducibility of results
set.seed(5769)
# Define cross-validation resampling method
train.control <- trainControl(method = 'CV', number = 3)
# Perform cross-validation prediction using the decision tree regressor
dtr_CV_predict <- train(count ~ ., data = train_encoded_attributes, method = 'rpart', trControl = train.control)
# Display the cross-validation results
dtr_CV_predict
Cross Validation Prediction for Decision Tree Regressor:
CART 511 samples 19 predictor No pre-processing Resampling: Cross-Validated (3 fold) Summary of sample sizes: 341, 340, 341 Resampling results across tuning parameters: cp RMSE Rsquared MAE 0.08560458 1255.472 0.5801237 981.6119 0.21219513 1380.823 0.4889382 1133.4253 0.38806588 1762.949 0.2659245 1462.8221 RMSE was used to select the optimal model using the smallest value. The final value used for the model was cp = 0.08560458.
Cross-validation prediction plot for decision tree regression
# Section: Cross-validation prediction plot for decision tree regression
# Create a new color palette for the plot
plot_colors <- c('skyblue', 'darkred')
# Extract predicted values from the list
predicted_counts <- as.numeric(dtr_CV_predict$pred)
# Calculate residuals
observed_counts <- as.numeric(train_encoded_attributes$count)
residuals_dr <- resid(dtr_CV_predict)
residuals_dr
# Cross-validation prediction plot
plot(observed_counts, residuals_dr,
xlab = 'Observed Counts', ylab = 'Residuals',
main = 'Cross Validation Residual Plot', col = plot_colors[1], pch = 16)
# Add a horizontal line at y = 0 for reference
abline(h = 0, col = 'gray', lty = 2)
# Add a smooth line to visualize trends
lines(lowess(y_train, residuals_dr), col = plot_colors[2], lwd = 2)
# Add informative labels
text(4000, 1200, 'Residuals should be randomly scattered around zero.',
adj = c(0, 0), col = 'darkgreen', font = 1)
text(4000, -800, 'A clear pattern indicates a model misfit.',
adj = c(0, 0), col = 'darkred', font = 1)
text(4000, 400, 'Lowess smoother line to identify trends.',
adj = c(0, 0), col = plot_colors[2], font = 1)
text(4000, 200, 'Plot show some finite varience b/w them & for some not.',
adj = c(0, 0), col = plot_colors[2], font = 1)
# Add legend
legend('topright', legend = c('Residuals', 'Lowess Smoother'),
col = plot_colors, pch = 16, lwd = 2, cex = 0.8)
Model Performance on the Test Dataset
# Section: Model Performance on the Test Dataset
# Set seed for reproducibility
set.seed(7882)
# Predict using the trained decision tree regression model
dtr_predictions <- predict(dtr, test_encoded_attributes[,-c(6)])
# Display the first 10 predictions
head(dtr_predictions, 20)
Decision Tree Regressor - Model Performance Visualization
# Section: Decision Tree Regressor - Model Performance Visualization
# Generate a random sample of 10 indices from the test dataset
sample_indices <- sample(1:nrow(test_encoded_attributes), 20)
# Extract the corresponding actual and predicted values for the sampled data
actual_values_dtr <- test_encoded_attributes$count
predicted_values_dtr <- dtr_predictions
# Plot all actual and predicted values using Decision Tree Regressor
plot(1:length(actual_values_dtr), actual_values_dtr, type = "l", col = "blue",
xlab = "Sample Index", ylab = "Count", main = "Actual vs Predicted Values (Decision Tree Regressor)",
xlim = c(1, length(actual_values_dtr) + 20), ylim = c(0, max(actual_values_dtr, predicted_values_dtr)))
lines(length(actual_values_dtr) + 1:length(predicted_values_dtr), predicted_values_dtr, col = "orange")
# Add legend
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "orange"), lty = 1)
# Highlight the last observed data point
points(length(actual_values_dtr), actual_values_dtr[length(actual_values_dtr)], pch = 19, col = "red")
# Highlight the starting point of predicted values
points(length(actual_values_dtr) + 1, predicted_values_dtr[1], pch = 19, col = "green")
# Add information
text(length(actual_values_dtr) + 5, max(actual_values_dtr, predicted_values_dtr)/2,
"Actual values and Predicted values using Decision Tree Regressor",
pos = 1, col = "darkgreen", cex = 0.8)
Evaluation of Decision tree regressor on Root mean squared error and mean absolute error
#Section: Evaluation of Decision tree regressor on Root mean squared error and mean absolute error
set.seed(6889)
#Root mean squared error
rmse<-RMSE(y_test,dtr_predictions)
print(rmse)
#Mean absolute error
mae<-MAE(y_test,dtr_predictions)
print(mae)
[1] 927.4832 [1] 684.4555
Residual Plot for Decision Tree Regressor
# Section: Residual Plot for Decision Tree Regressor
# Calculate residuals for Decision Tree Regressor predictions
residuals_dtr <- y_test- dtr_predictions
# Plot the Residual plot
plot(y_test, residuals_dtr, xlab = 'Observed', ylab = 'Residuals', main = 'Residual Plot for Decision Tree Regressor', col = 'blue', pch = 16)
abline(0, 0, col = 'red', lwd = 2)
# Add informative text
text(2500, 800, "Some data points have the same finite variance between them,\nwhile for others, it may not hold true.", col = "red", cex = 0.8)
# Section: Random Forest Regression
# Load the randomForest library
library(randomForest)
# Set seed for reproducibility
set.seed(6788271)
# Train the Random Forest model
rf_model <- randomForest(count ~ ., data = train_encoded_attributes, importance = TRUE, ntree = 200)
rf_model
Call: randomForest(formula = count ~ ., data = train_encoded_attributes, importance = TRUE, ntree = 200) Type of random forest: regression Number of trees: 200 No. of variables tried at each split: 6 Mean of squared residuals: 493390.5 % Var explained: 86.88
Cross Validation Prediction for Random Forest
# Section: Cross Validation Prediction for Random Forest
# Ignore warning messages during execution
options(warn = -1)
# Set seed for reproducibility
set.seed(6772)
# Load the ranger library for Random Forest CV
library(ranger)
# Define cross-validation resampling method
train.control <- trainControl(method = 'CV', number = 3)
# Cross-validation prediction
rf_CV_predict <- train(count ~ ., data = train_encoded_attributes, method = 'ranger', trControl = train.control)
rf_CV_predict
cat("The Random Forest model was trained on 511 samples with 19 predictors. The training utilized cross-validated resampling with 3 folds, resulting in sample sizes of 340, 342, and 340 across the folds. The model was tuned with different values for the number of variables randomly sampled at each split (mtry) and the split rule. The final model, selected based on the smallest Root Mean Squared Error (RMSE), had the parameters mtry = 18, splitrule = extratrees, and min.node.size = 5. The performance metrics for the selected model were RMSE = 733.25, Rsquared = 0.8505, and MAE = 503.57, indicating a good fit to the data.\n")
Random Forest 511 samples 19 predictor No pre-processing Resampling: Cross-Validated (3 fold) Summary of sample sizes: 340, 342, 340 Resampling results across tuning parameters: mtry splitrule RMSE Rsquared MAE 2 variance 964.1860 0.8413592 761.2039 2 extratrees 1041.8814 0.8222903 818.2320 18 variance 729.9331 0.8598411 507.8067 18 extratrees 730.2087 0.8599708 504.0057 34 variance 742.8772 0.8542833 514.7405 34 extratrees 734.6122 0.8578812 507.2129 Tuning parameter 'min.node.size' was held constant at a value of 5 RMSE was used to select the optimal model using the smallest value. The final values used for the model were mtry = 18, splitrule = variance and min.node.size = 5.
The Random Forest model was trained on 511 samples with 19 predictors. The training utilized cross-validated resampling with 3 folds, resulting in sample sizes of 340, 342, and 340 across the folds. The model was tuned with different values for the number of variables randomly sampled at each split (mtry) and the split rule. The final model, selected based on the smallest Root Mean Squared Error (RMSE), had the parameters mtry = 18, splitrule = extratrees, and min.node.size = 5. The performance metrics for the selected model were RMSE = 733.25, Rsquared = 0.8505, and MAE = 503.57, indicating a good fit to the data.
Cross Validation Prediction Plot for Random Forest
# Section: Cross Validation Prediction Plot for Random Forest
# Extracting residuals from Random Forest cross-validation predictions
residuals <- resid(rf_CV_predict)
# Plotting the Cross Validation Prediction Plot
plot(y_train, residuals, xlab = 'Observed', ylab = 'Residuals',
main = 'Cross Validation Prediction Plot for Random Forest', col = 'blue', pch = 16, cex.main = 1.2)
# Adding a reference line at y = 0
abline(0, 0, col = 'red', lwd = 2)
# Adding informative labels and title
text(6000, 800, "Residuals from Random Forest cross-validation predictions",
col = "darkgreen", cex = 0.8)
# Adding grid lines for better readability
grid()
# Adding a legend
legend('topright', legend = 'Residuals', col = 'blue', pch = 16, cex = 0.8)
# Adding more context to the observation
text(5500, 500, "Observation: The residuals should be randomly scattered around the reference line at y = 0.",
col = "purple", cex = 0.8)
text(5500, 400, "Patterns or trends in the residuals may suggest areas where the model can be improved.",
col = "purple", cex = 0.8)
cat("Observation: The residuals should ideally be randomly scattered around the reference line at y = 0,\n")
cat("indicating unbiased predictions. Patterns or trends in the residuals may suggest areas where the model can be improved.\n")
Observation: The residuals should ideally be randomly scattered around the reference line at y = 0, indicating unbiased predictions. Patterns or trends in the residuals may suggest areas where the model can be improved.
Model Performance on Test Data for Random Forest
# Section: Model Performance on Test Data for Random Forest
# Set seed for reproducibility
set.seed(7889)
# Predict the Random Forest model on the test dataset
rf_predictions <- predict(rf_model, test_encoded_attributes[,-c(7)])
# Display the first 10 predictions
head(rf_predictions, 10)
Model performation visualisation based on random forest regressor
#Section: Model performation visualisation based on random forest regressor
# Generate a random sample of 10 indices from the test dataset
sample_indices <- sample(1:nrow(test_encoded_attributes), 10)
# Extract the corresponding actual and predicted values for the sampled data
actual_values_rf <- test_encoded_attributes$count
predicted_values_rf <- rf_predictions
# Plot all actual and predicted values using Random forest Regressor
plot(1:length(actual_values_rf), actual_values_rf, type = "l", col = "blue",
xlab = "Sample Index", ylab = "Count", main = "Actual vs Predicted Values (Random forest Regressor)",
xlim = c(1, length(actual_values_rf) + 10), ylim = c(0, max(actual_values_rf, predicted_values_rf)))
lines(length(actual_values_rf) + 1:length(predicted_values_rf), predicted_values_rf, col = "orange")
# Add legend
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "orange"), lty = 1)
# Highlight the last observed data point
points(length(actual_values_rf), actual_values_rf[length(actual_values_rf)], pch = 19, col = "red")
# Highlight the starting point of predicted values
points(length(actual_values_rf) + 1, predicted_values_rf[1], pch = 19, col = "green")
# Add information
text(length(actual_values_rf) + 5, max(actual_values_rf, predicted_values_rf)/2,
"Actual values and Predicted values using Random forest Regressor",
pos = 1, col = "darkgreen", cex = 0.8)
Model Performance Evaluation of Random Forest regressor based on RMSE and MAE
# Section: Model Performance Evaluation of Random Forest regressor based on RMSE and MAE
# Set seed for reproducibility
set.seed(667)
# Root Mean Squared Error (RMSE) calculation
rmse <- RMSE(y_test, rf_predictions)
print(rmse)
# Mean Absolute Error (MAE) calculation
mae <- MAE(y_test, rf_predictions)
print(mae)
[1] 683.8066 [1] 467.8042
Residual Plot for Random Forest Regressor
# Section: Residual Plot for Random Forest Regressor
# Calculate residuals for Random Forest predictions
residuals <- y_test - rf_predictions
# Plot the Residual Plot
plot(y_test, residuals,
xlab = 'Observed', ylab = 'Residuals',
main = 'Residual Plot for Random Forest Regressor', col = 'blue', pch = 16)
# Add a reference line at y = 0
abline(0, 0, col = 'red', lty = 2)
# Enhance plot aesthetics
grid(col = "lightgray")
title(main = "Residual Plot for Random Forest Regressor", col.main = "darkgreen", font.main = 2)
axis(side = 1, col = "darkblue", col.axis = "darkblue")
axis(side = 2, col = "darkblue", col.axis = "darkblue")
# Add informative labels
text(3000, 1000, "The plot shows the residuals from the Random Forest predictions.",
col = "red", cex = 0.8)
# Add observations
cat("Observation: In an ideal scenario, residuals should be randomly scattered around the reference line at y = 0.\n")
cat("Patterns or trends in the residuals may suggest areas where the model can be improved.\n")
Observation: In an ideal scenario, residuals should be randomly scattered around the reference line at y = 0. Patterns or trends in the residuals may suggest areas where the model can be improved.
#Section: Selecting best Model in all three for futher prediction
# Calculate RMSE and MAE for each model
lm_rmse <- RMSE(lm_predictions, test_encoded_attributes$count)
lm_mae <- MAE(lm_predictions, test_encoded_attributes$count)
dtr_rmse <- RMSE(dtr_predictions, test_encoded_attributes$count)
dtr_mae <- MAE(dtr_predictions, test_encoded_attributes$count)
rf_rmse <- RMSE(rf_predictions, test_encoded_attributes$count)
rf_mae <- MAE(rf_predictions, test_encoded_attributes$count)
# Create an accuracy table
accuracy_table <- data.frame(
Model = c("Linear Regression", "Decision Tree Regressor", "Random Forest Regressor"),
RMSE = c(lm_rmse, dtr_rmse, rf_rmse),
MAE = c(lm_mae, dtr_mae, rf_mae)
)
# Find the best model based on the minimum RMSE and MAE
best_model <- accuracy_table[which.min(accuracy_table$RMSE + accuracy_table$MAE), ]
# Print the accuracy table
print(accuracy_table)
Model RMSE MAE 1 Linear Regression 780.1233 574.8627 2 Decision Tree Regressor 927.4832 684.4555 3 Random Forest Regressor 683.8066 467.8042
The performance metrics for different regression models applied to the dataset:
Linear Regression:
Decision Tree Regressor:
Random Forest Regressor:
Lower values of RMSE and MAE indicate better model performance. Therefore, the Random Forest Regressor model shows the best performance among the three models evaluated.
# Print the best model
print("Best Model:")
print(best_model)
[1] "Best Model:" Model RMSE MAE 3 Random Forest Regressor 683.8066 467.8042
# Section: Selecting Final Model as Random Forest Regressor for Prediction of Bike Rental Count
# Combine observed and predicted values
Bike_predictions <- data.frame(Observed = y_test, Predicted = rf_predictions)
# Write predictions to a CSV file
write.csv(Bike_predictions, 'Bike_Renting_Predictions.csv', row.names = FALSE)
# Display the predictions
Bike_predictions
Observed | Predicted | |
---|---|---|
<dbl> | <dbl> | |
3 | 1349 | 1378.271 |
4 | 1562 | 1338.935 |
6 | 1606 | 1329.461 |
7 | 1510 | 1478.206 |
12 | 1162 | 1393.556 |
14 | 1421 | 1448.848 |
15 | 1248 | 1447.320 |
16 | 1204 | 1397.887 |
21 | 1543 | 1440.239 |
28 | 1167 | 1173.400 |
29 | 1098 | 1046.900 |
32 | 1360 | 1271.782 |
35 | 1708 | 1584.079 |
43 | 1472 | 1554.543 |
44 | 1589 | 1777.310 |
45 | 1913 | 2260.890 |
47 | 2115 | 1897.504 |
55 | 1807 | 1649.068 |
58 | 2402 | 1620.425 |
62 | 1685 | 1656.711 |
71 | 2132 | 2029.090 |
72 | 2417 | 2013.939 |
73 | 2046 | 2062.932 |
74 | 2056 | 1871.097 |
78 | 3117 | 3778.040 |
83 | 1865 | 2021.938 |
86 | 1693 | 2290.974 |
87 | 2028 | 2397.372 |
88 | 2425 | 2368.786 |
89 | 1536 | 2223.834 |
⋮ | ⋮ | ⋮ |
617 | 5976 | 5816.432 |
621 | 7870 | 7665.649 |
623 | 8009 | 7707.574 |
624 | 8714 | 7639.816 |
626 | 6869 | 6294.597 |
631 | 8395 | 7287.318 |
636 | 7393 | 7000.209 |
638 | 8555 | 7703.401 |
639 | 6889 | 7427.717 |
641 | 4639 | 4759.513 |
642 | 7572 | 6076.654 |
643 | 7328 | 6800.629 |
646 | 3510 | 5801.521 |
649 | 7691 | 7364.260 |
655 | 7534 | 6995.307 |
665 | 7444 | 6207.183 |
667 | 4459 | 6683.840 |
673 | 5138 | 4969.967 |
674 | 5107 | 5022.264 |
676 | 5686 | 4809.488 |
677 | 5035 | 4937.191 |
679 | 5992 | 5449.521 |
684 | 5495 | 5031.985 |
688 | 4669 | 4731.366 |
695 | 2424 | 3992.329 |
697 | 3959 | 3986.974 |
703 | 6234 | 6107.642 |
709 | 3228 | 4464.756 |
722 | 1749 | 2686.222 |
726 | 441 | 2211.657 |
# Conclusion: Random Forest Model Outperforms
cat("When comparing RMSE and MAE of all 3 models, the random forest model shows the least errors.\n")
cat("Thus, the random forest model is considered the best for predicting daily bike rental counts.\n")
When comparing RMSE and MAE of all 3 models, the random forest model shows the least errors. Thus, the random forest model is considered the best for predicting daily bike rental counts.