#!/usr/bin/env python
# coding: utf-8

# # Analysis of Airbnb bookings across Seattle
# 
# In this project, I will try to analyse and answer a few questions on the airbnb seattle dataset available on kaggle. This implementation is following CRISP-DM process i.e., the project is organized in the following steps,
# 
# 1. Business understanding
# 2. Data Understanding
# 3. Preperation of the data
# 4. Modelling (If Necessary)
# 5. Evaluation of the results
# 6. Conclusion
# 
# The questions that would be answered in this project are listed as follows,
# 1. Predicting the suitable price based on the features of the room {property_type, room_type, no.of bedrooms, no. of bathrooms etc.}?
# 2. What is the probable price for a room based on the neighbourhood?
# 3. Which type of properties are common in each neighbourhood?
# 4. What is the effect of selecting "Strict" cancellation policy on the frequency of bookings?
# 5. What is the neighbourhood where the price is highest despite providing minimum amenities?

# In[1]:


import pandas as pd
import warnings
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error


warnings.filterwarnings('ignore')

pio.templates.default = "plotly_white"

listings = pd.read_csv('./Data/listings.csv')
calendar = pd.read_csv('./Data/calendar.csv')

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)


# In[2]:


listings.describe()


# In[3]:


listings.head(2)


# Among the plethora of columns, we will first try and select the columns which are relavent for our question.

# # Processing the Data
# 
# Since the data is presented in a straight forward way, we can directly skip to the step where we prepare the data to answer our questions. This step is said to take a relatively long time to finish, as there are a lot of aspects that needs to be focused on. 
# 
# For our project, we will generally be working with aggregations, hence the 2 aspects the important aspects that we need to make sure is to
# - make sure that there are no null values.
# - After that, we need to check for any outlier, and try to exclude them from the data. 
# - Finally before sending it for the next step i.e., modelling, we need to process all the categorical variables into their respective types. 
# 
# Since most of the questions can be answered after procesing the outliers, the final step is done before answering the last question, i.e., the question which requires us to be able to predict from the data available.

# In[4]:


relavent_columns = ['id','neighbourhood_group_cleansed', 'property_type',
                    'room_type', 'accommodates','bathrooms',
                    'square_feet', 'bedrooms', 'beds',
                    'bed_type', 'amenities', 'cancellation_policy',
                    'minimum_nights', 'instant_bookable', 'review_scores_rating',
                    'review_scores_accuracy', 'review_scores_cleanliness','review_scores_checkin',
                    'review_scores_communication','review_scores_location', 'review_scores_value',
                    'price']

listings_new = listings[relavent_columns]


# In[5]:


listings_new.head()


# Now, it is time to reformat the price.

# In[6]:


listings_new.loc[:,"price"] = listings_new.price.str.replace("[$, ]", "", regex=True).astype("float")


# In[7]:


listings_new.head()


# ### Tackling the null values

# In[8]:


100*listings_new.isnull().sum().sort_values(ascending=False)/listings_new.shape[0]


# The information of squarefeet is not provided in most of the listings, hence we can assume that this information has very little effect on the other parameters. Hence, we can drop them.  However, the reviews cannot be dealt in the same way, As, we have the values for most of the listings, but for the entries which donot have any entries in the related field. 

# In[9]:


listings_new.drop('square_feet', axis=1, inplace=True)
listings_new.dtypes


# In[10]:


review_cols = [x for x in listings_new.columns if 'review' in x]

listings_new[review_cols].describe()


# It is assumed that the values at 50th percentile, could be considered a good value to be added in the place values for null values.

# In[11]:


percentile50 = listings_new[review_cols].quantile(0.5)
listings_new.loc[:,review_cols]=listings_new[review_cols].fillna(percentile50)


# In[12]:


listings_new.isnull().sum().sort_values(ascending=False)


# Now it is time to deal with the null values which are quantitative. 

# In[13]:


listings_new.loc[:,['bathrooms','bedrooms','beds']] = listings_new.loc[:,['bathrooms','bedrooms','beds']].fillna(listings_new.loc[:,['bathrooms','bedrooms','beds']].quantile(0.50))


# For the property type, let us check if there is a way we can infer the property_type from the room_type.

# In[14]:


listings_new[listings_new.property_type.isnull()]


# Based on the room_type, we can say that the property_type is apartment. 

# In[15]:


listings_new.iloc[2184, 2] = "Apartment"


# In[16]:


listings_new.isnull().sum().sort_values(ascending=False)


# ### Checking for outliers.

# To check for the outliers we try to use the pre-defined `.describe()` method. or we can use graphs. In this case, I will be using the `describe()` method.

# In[17]:


#For Reference...
listings_new.head(2)


# In[18]:


quant_features = listings_new[['accommodates','bathrooms', 'bedrooms',
             'beds','minimum_nights','review_scores_rating',
             'review_scores_accuracy',	'review_scores_cleanliness','review_scores_checkin',
             'review_scores_communication', 'review_scores_location','review_scores_value']]
description = quant_features.describe()

#computing higher quantiles for better information
higher_quantiles = quant_features.quantile([0.9,0.99,0.999]).rename(index={0.9:'90%',0.99:'99%',0.999:'99.9%'})

#Combining all the dataframes to get a complete description
description = pd.concat([description, higher_quantiles])


# In[19]:


description.index


# In[20]:


description.loc[['min','mean', 'max','25%','50%', '75%','90%', '99%','99.9%'],:]


# We can identify the outliers and remove them, by observing the difference between the mean and the min/max values. The other rows of the quantiles are just to get additional information.
# 
# From the above description, the following can be stated to be outliers:
# * In the fields that contain the parameters regarding the room {accommodates, bathrooms, bedrooms, minimum nights} The outliers are those which have very high requirements.
# * In the fields that describe the reviews, the ones which have low rating/scores, can be identified as outliers, as even the 25th quantile are having very high difference with the minimum values.
# 
# These can be figuratively represented using histograms, Boxplots, Scatterplots.

# In[21]:


#boxplots
px.box(listings_new, y=['accommodates'])


# So, any airbnb with accomodates more than 7 can be declared as outliers.

# In[22]:


original_entries = listings_new.shape[0]
listings_new = listings_new[listings_new.accommodates<=7]
print(f'No. of excluded entries = {original_entries-listings_new.shape[0]}')


# Since, the stays which require more than 7 accomodates are excluded, the other columns must also be automatically corrected. This has be done in the interest of maintaining the accommodates as similar to the general conditions as possible. Once we correct this parameter, the other parameters that describe the room would become subjective to the host, and would become necessary details, and cannot be further processed. 

# In[23]:


quant_features_rooms = listings_new[['accommodates','bathrooms', 'bedrooms',
                                     'beds','minimum_nights']]
higher_quantiles = quant_features_rooms.quantile([0.9,0.99,0.999]).rename(index={0.9:'90%',0.99:'99%',0.999:'99.9%'})


pd.concat([quant_features_rooms.describe(), higher_quantiles])


# Though the assumption was correct to certain extent, Let us look at the boxplots once again after correcting the minimum nights.

# In[24]:


px.box(listings_new, y='minimum_nights')


# By utilizing the advantages of plotly, we can zoom into the plot and find the upper fence, which is 3, but based on the context of the data, it is possible that a few of the hosts, require a minimum stay of a month, hence after taking that also into consideration the upper limit for minimum_nights is determined to be 31.

# In[25]:


original_entries = listings_new.shape[0]
listings_new = listings_new[listings_new.minimum_nights<=31]
print(f'No. of excluded entries = {original_entries-listings_new.shape[0]}')


# So, there was only 1 entry that was excluded, and the remaining values seem logical, as not all would like to let people stay for 1 or 2 days, and would prefer people who stay for a month.

# Now, it is time to remove outliers from the **review** features.

# In[26]:


quant_reviews = listings_new[['review_scores_rating','review_scores_accuracy',	
                              'review_scores_cleanliness','review_scores_checkin',
                              'review_scores_communication', 'review_scores_location',
                              'review_scores_value']]

#computing higher quantiles for better information
lower_quantiles = quant_reviews.quantile([0.1,0.01]).rename(index={0.1:'10%',0.01:'1%'})

#Combining all the dataframes to get a complete description
pd.concat([quant_reviews.describe(), lower_quantiles]).loc[['min','1%','10%','25%'],:]


# From the above table, we can set the limit to be values at 1% quantile.

# In[27]:


original_entries = listings_new.shape[0]
listings_new = listings_new[(listings_new.review_scores_rating>=72) & (listings_new.review_scores_accuracy>=7.32) &
                            (listings_new.review_scores_cleanliness>=6) & (listings_new.review_scores_checkin>=8) & 
                            (listings_new.review_scores_communication>=8) & (listings_new.review_scores_location>=8) &
                            (listings_new.review_scores_value>=7)]
print(f'No. of excluded entries = {original_entries-listings_new.shape[0]}')


# Hence from the initial state to the final stage, a total of 283 entries were excluded so as to get a more general structure of the data. Now all that is left is to 
# * Reset the index to get continuous values,
# * Removing the excluded listings from the calendar dataframe 
# * Answer the questions.

# In[28]:


#Resetting the index
listings_new = listings_new.reset_index().drop('index',axis=1)

listings_new.shape


# In[29]:


listings_new.head()


# In[30]:


# Formatting the calendar dataframe.
calendar.head()


# In[31]:


valid_listings = listings_new.id.to_list()
calendar = calendar[calendar.listing_id.isin(valid_listings)]


# In[32]:


calendar.listing_id.nunique()


# Now that the null values and the ouliers are dealt with, there is a small process that needs to be done before we can start answering the questions. Which is nothing but unpacking the amenities in the listings. Since we can also use this in the upcoming model, we are directly generating dummies for the amenities here itself.

# In[33]:


listings_new.amenities = listings_new.amenities.str.replace('"', '', regex=False)
listings_new.amenities = listings_new.amenities.str.replace('{', '', regex=False)
listings_new.amenities = listings_new.amenities.str.replace('}', '', regex=False).str.split(",")
new_df = listings_new.amenities
new_df.head(10)


# In[34]:


temp_list = []
for x in new_df:
    temp_list+=x

net_contents = set(temp_list)
net_contents = [x.replace('"', '') for x in net_contents]
net_contents=[x for x in net_contents if x != '']


# In[35]:


'\n' in net_contents


# Now that we got all the categories, it is now time to get the dummies.

# In[36]:


amenity_cols = ['amenity_'+x for x in net_contents]

am_dummies = pd.DataFrame(columns = amenity_cols, index =new_df.index)


# In[37]:


am_dummies.head(10)


# In[38]:


#This is the straight-forward approach, Need to be updated with a better, efficient method.
for i in range(len(new_df)):
    for j in range(len(net_contents)):
        if net_contents[j] in new_df[i]:
            am_dummies.iloc[i,j] = 1
        else: 
            am_dummies.iloc[i,j] = 0

# Adding the total amenities presented by the stay
am_dummies['Total_amenities'] = am_dummies.sum(axis=1)

#Removing the source column
listings_new.drop('amenities', axis=1, inplace=True)

# Adding the amenities dataframe to the listings_new dataframe
listings_new = pd.concat([listings_new,am_dummies], axis=1)

#Confirming that the datatypes are int
am_cols = [col for col in listings_new.columns if  'amenity_' in col]
listings_new.loc[:,am_cols] = listings_new.loc[:,am_cols].astype('int')


# In[39]:


[len(x) for x in new_df][:10]


# In[40]:


listings_new.head(10)


# # Answering the Questions
# 
# In the querstions that are mentioned in the above 

# ## What is the probable price for a room based on the neighbourhood?
# To answer this question, we need a new column, ratio of price and total amenities.

# In[41]:


mean_price_per_nbhood = listings_new.groupby(['neighbourhood_group_cleansed','property_type'])['price','Total_amenities'].mean().reset_index()
mean_price_per_nbhood.Total_amenities = round(mean_price_per_nbhood.Total_amenities)
priceVsnbhood = mean_price_per_nbhood.groupby('neighbourhood_group_cleansed').price.mean().reset_index()


# In[42]:


fig = go.Figure()


fig = make_subplots(
    rows=1, cols=2,
    column_widths=[0.95, 0.05],
    subplot_titles=("Neighourhood analysis", "Box_plot of prices "))

fig.add_trace(go.Box(x =mean_price_per_nbhood['neighbourhood_group_cleansed'],
                     y = mean_price_per_nbhood['price']))


fig.add_trace(go.Scatter(x=priceVsnbhood.neighbourhood_group_cleansed,
                         y = priceVsnbhood.price,
                         mode='lines+markers',
                         name='mean price by neighbourhood', 
                        marker_color='green'), 
                        row=1, col=1)

fig.add_box(y=mean_price_per_nbhood.price, name='Avgprice',
            row=1, col=2)

fig.show()


# In[43]:


fig = go.Figure()

fig.add_trace(go.Scatter(x = mean_price_per_nbhood['neighbourhood_group_cleansed'],
                         y = mean_price_per_nbhood['price'],
                         marker_color = mean_price_per_nbhood['Total_amenities'],
                         text = mean_price_per_nbhood.Total_amenities,
                         mode='markers', name='Price while considering qty of amenities'))
fig.show()


# In[44]:


listings_new[(listings_new.Total_amenities==0) & (listings_new.neighbourhood_group_cleansed == 'Beacon Hill')]


# In[45]:


mean_price_per_nbhood = listings_new.groupby(['neighbourhood_group_cleansed','property_type'])['price','Total_amenities'].mean().reset_index()
mean_price_per_nbhood.Total_amenities = round(mean_price_per_nbhood.Total_amenities)


fig = px.scatter(mean_price_per_nbhood,
                x = 'property_type',
                y = 'price',
                color = 'Total_amenities',
                hover_data=['neighbourhood_group_cleansed'],
                  template='plotly_dark')
fig.show()


# ### Results:
# So, from the above 2 graphs, we can decide on the probable price of a house type in each neighbourhood. The results can be listed as follows,
# 
# - If you want to set the price solely based on the neighbourhood, then the prices observed is the highest for the Magnolia region, followed by West Seattle, Capital Hill, Queen Anne and Downtown regions, with the least price for listings available in Northgate, University District and Delridge.
# 
# By adding more information like the property_type, we can even gather some more observations. These observations lead to the following results,
# - Listing a "Boat" in some of the other neighbourhoods could help us demand a high price. While that can be followed by listing a camper in Beacon Hill.
# - It is observed that, condos in West Seattle, Magnolia offer a high price in the listings which can't be considered as vehicles.  While the least possible price listing is a "Tent" in some of the other neighbourhoods.
# 
# Overall, one would be able to book a listing in seattle for a day, if his budget fits in the range of \\$ 84.00 to \\$132.00, in most of the neighbourhoods. 
# 
# 

# ## What is the neighbourhood where the price is highest while providing minimum amenities?

# If we observe correctly, there are some cases where there are no amenities providedm but the price of the listing is high, one such example is the listing of the camper at Beacon hill. Another listing with no amenities provided are  **tent** and **tree houses**. Let us first take a look at them. 

# In[46]:


no_amenity_list = list(listings_new[listings_new.Total_amenities==0].property_type.unique())
print(no_amenity_list)


# So, there are listings of apartments, houses with none of the amenities provided. As we can directly infer that the rentals of the tent can not be high, we are excluding them from further analysis. 

# In[47]:


no_amenity_list.remove('Tent')
no_amenity_list


# Now our goal is to find the neighbourhoods vs price for this property types.

# In[48]:


#Exctracting the info needed.
df = listings_new[(listings_new.property_type.isin(no_amenity_list)) & (listings_new.Total_amenities==0)].loc[:,['neighbourhood_group_cleansed', 'property_type' ,'room_type','price']]
df.shape


# In[49]:


fig = px.scatter(df,
                x = 'neighbourhood_group_cleansed',
                y = 'price',
                color='property_type',
                hover_data=['room_type'],)
fig.show()


# So, by using the advantage of **plotly**, we can find the highest price of each property type in the neighbourhood. The results are obtained in the following table.
# 
# | Neighbourhood | Property Type   | Price |
# |---------------|-----------------|-------|
# | Downtown      | Apartment       | 300   |
# | West Seattle  | House           | 350   |
# | West Seattle  | Treehouse       | 55    |
# | Beacon Hill   | Camper\RV       | 375   |
# | Capitol Hill  | Bed & Breakfast | 165   |
# 
# But one thing that is to be noted that there might be the influence of some excluded factors like Square_feet of the houses or areas that might play a big role in this parameters. However, to limit the scope of the analysis, they are not being considered.

# ## Which type of properties are common in each neighbourhood?

# In[50]:


nbhood_listings = listings_new.loc[:,['neighbourhood_group_cleansed', 'property_type']]
nbhood_listings = nbhood_listings.groupby('neighbourhood_group_cleansed').value_counts()
nbhood_listings = nbhood_listings.reset_index().rename(columns={0:'count'})
nbhood_listings.head()


# In[51]:


total_listings = listings_new.loc[:,['neighbourhood_group_cleansed', 'property_type']].groupby('neighbourhood_group_cleansed').count().reset_index()
total_listings = total_listings.rename(columns={'property_type':'tot_count'})


# In[52]:


total_listings


# In[53]:


temp = pd.Series(index = range(nbhood_listings.shape[0]))
temp.fillna(0)
for neighbourhood in total_listings.neighbourhood_group_cleansed.to_list():
    bool1 = nbhood_listings.neighbourhood_group_cleansed == neighbourhood
    bool2 = total_listings.neighbourhood_group_cleansed == neighbourhood
    temp[bool1] = nbhood_listings[bool1]['count']/int(total_listings[bool2]['tot_count'])
    
nbhood_listings['percentage'] = round(100*temp,3)


# In[54]:


nbhood_listings.head()


# Since, we have to find the common properties of each neighbourhood, we can exclude those which contribute less than 5% of the count.

# In[55]:


nbhood_listings = nbhood_listings[nbhood_listings.percentage>5]


# In[56]:


fig = px.histogram(nbhood_listings,
                   x = 'neighbourhood_group_cleansed',
                   y = 'percentage',
                   color = 'property_type',
                   barmode='group',
                   histfunc='avg')

fig.show()


# **Observations**:
# - In all the neighbourhoods, the most common types of listings are House, Apartment, except for Downtown, where most of the listings are apartments.
# - Following the House and apartment listings, the next more common kind are the Townhouse listings, which can be found in Beacon Hill, Delridge, Central Area, Magnolia.
# - In Interbay, the Loft listings can be said as more frequent than the remaining neighbourhoods.
# 

# ## What is the effect of selecting "Strict" cancellation policy on the frequency of bookings?

# In[57]:


calendar.drop(['price'], axis=1, inplace=True)

pd.get_dummies(calendar.available, drop_first=True).head()


# In[58]:


calendar.available = pd.get_dummies(calendar.available, drop_first=True)

calendar.head()


# In[59]:


booked_data = calendar.groupby('listing_id').available.sum()
listings_new = listings_new.set_index('id')

# Combining the dataframe and the series along the listing ids
listings_new = pd.concat([listings_new,booked_data], axis=1).reset_index().rename(columns={'index':'id'})


# In[60]:


round(listings_new.groupby('cancellation_policy').available.mean())


# **Generalization**
# 
# From the above result, we can see that, though the cancellation policy is shown to have very less effects, we can get some generalizations like,
# - The places with **strict** cancellation policy are slightly free i.e., slightly booked less than the places with remaining cancellation policy.
# - There is close to no effect of the flexible and moderate cancellation policies on the frequency to book the listings.
# 
# Now that we have reached this conclusion for the entire Seattle, now is the time to try and look a bit deeper into the differences in each neighbourhood, and try to correlate our earlier answers to this answer.

# In[61]:


nbhood_cancel_df = round(listings_new.groupby(['neighbourhood_group_cleansed', 'cancellation_policy']).available.mean()).reset_index()
nbhood_cancel_df['booked_days'] = round(100*(365 - nbhood_cancel_df.available)/365,3)

fig = px.histogram(nbhood_cancel_df,
                   x = 'neighbourhood_group_cleansed',
                   y='booked_days',
                   color = 'cancellation_policy',
                   barmode='group',
                   histfunc='avg')

fig.show()


# **OBSERVATIONS**:
# Based on the graph, most of the places follow our intuitive understanding **except** for *Interbay, Delridge, Rainer Valley, Seward Park*. There might be many possible factors based on which the bookings are high, hence an easy conclusion cannot be drawn for these cases.

# ## Predicting the suitable price based on the features of the room {property_type, room_type, no.of bedrooms, no. of bathrooms etc.}?
# 
# This question is answered by building a multi variate linear regression model, where the predicting value is y, and the features are all the remaining values.

# In[62]:


listings_new.head()


# In[63]:


listings_new.dtypes


# In[64]:


listings_new.drop('available', inplace=True, axis=1)
categorical_cols = listings_new.select_dtypes(include=['object']).columns
categorical_cols


# In[65]:


for col in categorical_cols:
    listings_new = pd.concat([listings_new.drop(col, axis=1), pd.get_dummies(listings_new[col], prefix=col, drop_first=True)], axis=1)

listings_new.head()


# In[66]:


x = listings_new.drop('price', axis=1)
y = listings_new['price']


# Since we seprated the inputs and outputs, it is now time to split the datasets. To do that, we can simply use the train_test_split() method. Then we can directly initialize the model.

# In[67]:


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=35)

#initializing the model and training it.
model = LinearRegression(normalize=True)
model.fit(x_train,y_train)


# ## Evaluation of the model
# 
# To evaluate the model, the following metrics are selected.
# 1. R2 Score - Measure of he proportion of the variance in the dependent variable(price) that is predictable from the independent variable(s)(the features)
# 2. Root Mean Squared Error - Difference in the predicted values and the true values.

# In[68]:


#Predict and score the model
y_train_preds = model.predict(x_train)
print(f"Score on train data :\n R2-Score:{r2_score(y_train, y_train_preds):.2f} \n RMSE: {mean_squared_error(y_train, y_train_preds):.2f}")

y_test_preds = model.predict(x_test)
print(f"Score on test data :\n R2-Score:{r2_score(y_test, y_test_preds):.2f} \n RMSE: {mean_squared_error(y_test, y_test_preds):.2f}")


# The model has ended up doing slightly worse on the train data than the test data. which is better, but we can see that there is a scope of improving the R2 score. This might be improved probably by adding some more features, which are relevant to the price. Let us try to get more information regarding the predictions on the test set.

# In[69]:


fig = go.Figure()

fig.add_trace(go.Scatter(x = y_test,y = y_test_preds,
                         mode='markers'))
fig.add_trace(go.Scatter(x = y_test, y=y_test,
                         line=dict(color='blue', dash='dash')))

fig.update_xaxes(range=[0,400])
fig.update_yaxes(range=[0,400])
fig.show()


# So, according to the model, we can infer that, there is a huge difference in predictions and actual prices as the actual price increases.

# # Conclusions
# 
# The main takeaways from the analysis are listed as follows,
# - In most of the cases, moderate strict policy is shown to be beneficial, except in the cases of a few regions.
# - Most of the listings are priced between \\$84 and \\$133, while there are some outliers in all the cases.
# - Houses and apartments are common listings, while the other listings are completely region specific. For example, 'Townhouses' are found easily in the central area rather than in the other less common areas.While it is better to note that **Downtown** is an exception in the case of house listings, as less than 5% of the listings are houses.
# - Irrespective of the property type, the highest prices are more often observed in **Mognolia** while the least prices are observed in **University District**.
# - The linear regression model can be considered a good example as, from our analysis of the prices, we got to know that most of the prices are in the range of 84 to 133, which fall in the region where the predictions can be said as good approximations.
#