Google Play Store and formerly Android Market, is a digital distribution service operated and developed by Google. It serves as the official app store for certified devices running on the Android operating system.
Android Market was announced by Google on August 28, 2008
Applications are available through Google Play either free of charge or at a cost. They can be downloaded directly on an Android device through the proprietary Play Store mobile app or by deploying the application to a device from the Google Play website.
Google Play was launched on March 6, 2012, bringing together Android Market.
By 2017, Google Play featured more than 3.5 million Android applications.
This subject facinates me, and I would like to drill into the dataset and get a better grasp of the statistics, moreover in my humble opinion I think it's essential phase for each company who is planning to develop an app to invastigate and apply some research on the android market app. since billions of dollars are invested in this domain.
Therefore I have decided to expolre on my own the data, and to come to vital conclusions about this industry.
The source of the data is from KAGGLE, The information comes in the table, The table is full of details about apps in the Google Store, such as In most columns the information is represented as numeric numbers In some columns the information is represented as a string. (therefore, in preparing the data we converted them to int)
I have applied EDA and preprocessed the data for future statistics analytics. The reasearch questions are listed below.
Question #1: Can we predict the popularity of an app based on given few features?
Question #2: Can we predict the number of apps in a given future year?
Question #3: Can we predict the price of an individual app by given few features?
Question #4: For maximizing the profit a company what would be the suggested Category for an app?
!pip install plotly &> /dev/null
!pip install scikit-learn &> /dev/null
!pip uninstall scikit-learn -y &> /dev/null
!pip install -U scikit-learn &> /dev/null
!mkdir census_package &> /dev/null
!pip install geocoder &> /dev/null
!pip install squarify &> /dev/null
!pip install shap &> /dev/nul
import sklearn
from sklearn import metrics
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn import datasets
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
import seaborn as sns
from google.colab import drive
import random
#py.init_notebook_mode(connected=True)
# importing visualization libraries
import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as py
from plotly.offline import iplot
%matplotlib inline
from IPython.display import HTML
import shap
!wget https://raw.githubusercontent.com/jasonchang0/kaggle-google-apps/master/google-play-store-apps/googleplaystore.csv
--2022-06-10 11:21:55-- https://raw.githubusercontent.com/jasonchang0/kaggle-google-apps/master/google-play-store-apps/googleplaystore.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1360155 (1.3M) [text/plain] Saving to: ‘googleplaystore.csv’ googleplaystore.csv 100%[===================>] 1.30M --.-KB/s in 0.05s 2022-06-10 11:21:55 (23.8 MB/s) - ‘googleplaystore.csv’ saved [1360155/1360155]
df1 = pd.read_csv("/content/googleplaystore.csv")
App: Application name
Category: Category the app belongs to
Rating: Overall user rating of the app (as when scraped)
Reviews: Number of user reviews for the app (as when scraped)
Size: Size of the app (as when scraped)
Installs: Number of user downloads/installs for the app (as when scraped)
Type: Paid or Free
Price: Price of the app (as when scraped)
Genres: An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
Last Updated: Date when the app was last updated on Play Store (as when scraped)
Current Ver: Current version of the app available on Play Store (as when scraped)
Android Ver: Min required Android version (as when scraped)
For the following steps, in order to process the data in the machine learning algorithms, we need to first convert it from text to numbers, as from what i understand, most algorithms run better that way. From most of the books I've read, data cleaning/preprocessing is THE most important part of any machine learning process, as high quality data translates to high quality predictions and models.
Checking out the info, there's a not of null values that need to be addressed. Since my main objective is predicting the ratings of the apps, I deleted all the NaN values, just for simplicity sake.
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10841 non-null object 1 Category 10841 non-null object 2 Rating 9367 non-null float64 3 Reviews 10841 non-null object 4 Size 10841 non-null object 5 Installs 10841 non-null object 6 Type 10840 non-null object 7 Price 10841 non-null object 8 Content Rating 10840 non-null object 9 Genres 10841 non-null object 10 Last Updated 10841 non-null object 11 Current Ver 10833 non-null object 12 Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB
df1.dropna(inplace = True)
df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9360 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 9360 non-null object 1 Category 9360 non-null object 2 Rating 9360 non-null float64 3 Reviews 9360 non-null object 4 Size 9360 non-null object 5 Installs 9360 non-null object 6 Type 9360 non-null object 7 Price 9360 non-null object 8 Content Rating 9360 non-null object 9 Genres 9360 non-null object 10 Last Updated 9360 non-null object 11 Current Ver 9360 non-null object 12 Android Ver 9360 non-null object dtypes: float64(1), object(12) memory usage: 1023.8+ KB
For the following steps, in order to process the data in the machine learning algorithms, we need to first convert it from text to numbers, as from what i understand, most algorithms run better that way. From most of the books I've read, data cleaning/preprocessing is THE most important part of any machine learning process, as high quality data translates to high quality predictions and models.
From the categorical column, I converted each category into an individual number. In the later sections when we do apply machine learning, two methods will be applied to the code, being integer encoding(which we are doing now) and one-hot encoding, aka dummy variables.
The main reason as to why I understand we do this transformation is mainly because integer encoding relies on the fact that there's a relationship between each category(e.g. think age range vs types of animals). In this case however, it's hard to really determine such a relationship, hence dummy/one-hot encoding might help provide better predictive accuracy.
# Cleaning Categories into integers
CategoryString = df1["Category"]
categoryVal = df1["Category"].unique()
categoryValCount = len(categoryVal)
category_dict = {}
for i in range(0,categoryValCount):
category_dict[categoryVal[i]] = i
df1["Category_c"] = df1["Category"].map(category_dict).astype(int)
Technically when doing the cleaning of genres, one-hot should also be applied in this instance. However, I did not as firstly, it's a subset of the categorical column and secondly, application of a dummy variable would significantly increase the number of independent variables.
So to combat this instead, we ran two seperate regressions, one including and one excluding such genre data. When including the data, we only considered in the impact/information provided via the genre section purely based on it's numeric value.
#Cleaning of genres
GenresL = df1.Genres.unique()
GenresDict = {}
for i in range(len(GenresL)):
GenresDict[GenresL[i]] = i
df1['Genres_c'] = df1['Genres'].map(GenresDict).astype(int)
Cleaning of the prices of the apps to floats
df1['Price'] = df1['Price'].apply(lambda x: str(x).replace('$', '') if '$' in str(x) else str(x))
df1['Price'] = df1['Price'].apply(lambda x: float(x))
# Removing punctuation and plus marks. So that the quantities become from string to int.
def get_number_from_string(my_string):
if isinstance(my_string, str):
my_string = my_string.replace(",","")
my_string = my_string.replace("+","")
number = int(my_string)
return number
Cleaning of sizes of the apps and also filling up the missing values using ffill
# This function is responsible for converting measurments in column "Size" from either Mega and Kilo to Bytes.
def handle_size(str):
if str[-1] == "M":
return (float(str[:-1])*10**6)
elif str[-1] == "k":
return (float(str[:-1])*10**3)
This function is responsible for measuring the number of days passed from 2010 for each row in the column "Last Updated"
# This function is responsible for measuring the number of days passed from 2010 for each row in the column "Last Updated"
def handle_last_updated(str) :
input_data = datetime.strptime(str, "%B %d, %Y")
lower_bound = datetime(2010, 1, 1, 0, 0)
return (input_data - lower_bound).days
relevant_rows = df1['Installs']!= "Free"
relevant_rows
df1 = df1.loc[relevant_rows,:]
# 4.2 Invoking the preprocessing functions
# Run the command across the entire column
# get_number_from_string("80,000+")
df1["Installs"] = df1["Installs"].apply(lambda x:get_number_from_string(x))
#handle_size(df1.loc[0,"Size"])
df1["Size"] = df1["Size"].apply(lambda x:handle_size(x))
df1["Last Updated"] = df1["Last Updated"].apply(lambda x:handle_last_updated(x))
# convert reviews to numeric
df1['Reviews'] = df1['Reviews'].astype(int)
df1.isnull().sum()
App 0 Category 0 Rating 0 Reviews 0 Size 1637 Installs 0 Type 0 Price 0 Content Rating 0 Genres 0 Last Updated 0 Current Ver 0 Android Ver 0 Category_c 0 Genres_c 0 dtype: int64
After we saw that the measure column is missing values in 1637 cells. We will fill those cells with the average value of all the values in the column.
df1['Size'] = df1.Size.fillna(df1.Size.median())
df1.isnull().sum()
App 0 Category 0 Rating 0 Reviews 0 Size 0 Installs 0 Type 0 Price 0 Content Rating 0 Genres 0 Last Updated 0 Current Ver 0 Android Ver 0 Category_c 0 Genres_c 0 dtype: int64
print("The data table size is:", df1.shape)
print("*"*100)
print("The columns name are:", df1.columns)
print("*"*100)
print("The distibution values is:", df1["Type"].value_counts())
print("*"*100)
print("The average rating score of all apps is: ", df1["Rating"].mean())
print("*"*100)
print("The min rating score of all apps is: ", df1["Rating"].min())
print("*"*100)
print("The max rating score of all apps is: ", df1["Rating"].max())
print("*"*100)
print("Printing the first 5 rows of the table: ")
print(df1.head(n=5))
print("Printing the last 5 rows of the table: ")
print(df1.tail(n=5))
The data table size is: (9360, 15) **************************************************************************************************** The columns name are: Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver', 'Category_c', 'Genres_c'], dtype='object') **************************************************************************************************** The distibution values is: Free 8715 Paid 645 Name: Type, dtype: int64 **************************************************************************************************** The average rating score of all apps is: 4.191837606837612 **************************************************************************************************** The min rating score of all apps is: 1.0 **************************************************************************************************** The max rating score of all apps is: 5.0 **************************************************************************************************** Printing the first 5 rows of the table: App Category Rating \ 0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 1 Coloring book moana ART_AND_DESIGN 3.9 2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 Reviews Size Installs Type Price Content Rating \ 0 159 19000000.0 10000 Free 0.0 Everyone 1 967 14000000.0 500000 Free 0.0 Everyone 2 87510 8700000.0 5000000 Free 0.0 Everyone 3 215644 25000000.0 50000000 Free 0.0 Teen 4 967 2800000.0 100000 Free 0.0 Everyone Genres Last Updated Current Ver Android Ver \ 0 Art & Design 2928 1.0.0 4.0.3 and up 1 Art & Design;Pretend Play 2936 2.0.0 4.0.3 and up 2 Art & Design 3134 1.2.4 4.0.3 and up 3 Art & Design 3080 Varies with device 4.2 and up 4 Art & Design;Creativity 3092 1.1 4.4 and up Category_c Genres_c 0 0 0 1 0 1 2 0 0 3 0 0 4 0 2 Printing the last 5 rows of the table: App Category \ 10834 FR Calculator FAMILY 10836 Sya9a Maroc - FR FAMILY 10837 Fr. Mike Schmitz Audio Teachings FAMILY 10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE Rating Reviews Size Installs Type Price Content Rating \ 10834 4.0 7 2600000.0 500 Free 0.0 Everyone 10836 4.5 38 53000000.0 5000 Free 0.0 Everyone 10837 5.0 4 3600000.0 100 Free 0.0 Everyone 10839 4.5 114 14000000.0 1000 Free 0.0 Mature 17+ 10840 4.5 398307 19000000.0 10000000 Free 0.0 Everyone Genres Last Updated Current Ver \ 10834 Education 2725 1.0.0 10836 Education 2762 1.48 10837 Education 3108 1.0 10839 Books & Reference 1844 Varies with device 10840 Lifestyle 3127 Varies with device Android Ver Category_c Genres_c 10834 4.1 and up 18 12 10836 4.1 and up 18 12 10837 4.1 and up 18 12 10839 Varies with device 3 5 10840 Varies with device 16 28
This is the basic exploratory analysis to look for any evident patterns or relationships between the features.
x = df1['Rating'].dropna()
y = df1['Size'].dropna()
z = df1['Installs'][df1.Installs!=0].dropna()
p = df1['Reviews'][df1.Reviews!=0].dropna()
t = df1['Type'].dropna()
price = df1['Price']
p = sns.pairplot(pd.DataFrame(list(zip(x, y, np.log(z), np.log10(p), t, price)),
columns=['Rating','Size', 'Installs', 'Reviews', 'Type', 'Price']), hue='Type', palette="Set2")
From the initial results we can immediately see an interesting result regarding columns: installations and ratings - there is a clear linear relationship - as the number of installations increases so does the amount of ratings and vice versa)
let's look at how many percent of paid and free apps
column = 'Type'
grouped = df1[column].value_counts().reset_index()
grouped = grouped.rename(columns={column:'count','index':column})
print(grouped)
# Now plot the data
trace = go.Pie(labels=grouped[column],values=grouped['count'],pull=[0.05,0])
layout = {'title':'The Distribution of paid and not paid apps in the app store'}
fig = go.Figure(data=[trace],layout=layout)
iplot(fig)
# show it
plt.tight_layout()
plt.show()
Type count 0 Free 8715 1 Paid 645
<Figure size 432x288 with 0 Axes>
From the distribution we see that the percentage of free apps is about 93, compared to a total of 7 percent of paid apps
vc=df1["Content Rating"].value_counts().reset_index()
vc.rename(columns={'Content Rating': 'count','index':"type" }, inplace=True)
vc['percent']=vc['count'].apply(lambda x : 100*x/sum(vc['count']))
vc=vc.sort_values("percent")
vc
trace = go.Bar(x=vc["type"], y=vc["percent"], name="Group", marker=dict(color="#6ad49b"))
#layout={'title':"The number of ",'xaxis':{'title':"x title"}}
layout={'title':'The size of each ranking group','xaxis':{'title':"Group name"}}
fig = go.Figure(data=trace, layout=layout)
iplot(fig)
# show it
plt.tight_layout()
plt.show()
<Figure size 432x288 with 0 Axes>
We see from the graph that the "everyone" type rating group is the most popular.
Which category has the highest share of (active) apps in the market?
number_of_apps_in_category = df1['Category'].value_counts().sort_values(ascending=True)
data = [go.Pie(
labels = number_of_apps_in_category.index,
values = number_of_apps_in_category.values,
hoverinfo = 'label+value'
)]
plotly.offline.iplot(data, filename='active_category')
# show it
plt.tight_layout()
plt.show()
<Figure size 432x288 with 0 Axes>
Lets make a treemap to see this a litte more
import squarify #for making treemap, we need squarify
plt.figure(figsize=(20,8))
labels = df1['Rating'].value_counts().index.tolist()
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
squarify.plot(sizes = df1['Rating'].value_counts(), label = labels, color = colors, alpha = 0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x7f04b3cf5590>
From the treemap, we can see that most of the apps had 4.4 as rating and 1.2 as leas
Do any apps perform really good or really bad?
data = [go.Histogram(
x = df1.Rating,
xbins = {'start': 1, 'size': 0.1, 'end' :5}
)]
print('Average app rating = ', np.mean(df1['Rating']))
plotly.offline.iplot(data, filename='overall_rating_distribution')
Average app rating = 4.191837606837612
Generally, most apps do well with an average rating of 4.17.
Let's break this down and inspect if we have categories which perform exceptionally good or bad.
How do app sizes impact the app rating?
groups = df1.groupby('Category').filter(lambda x: len(x) >= 50).reset_index()
# sns.set_style('ticks')
# fig, ax = plt.subplots()
# fig.set_size_inches(8, 8)
sns.set_style("darkgrid")
ax = sns.jointplot(df1['Size'], df1['Rating'])
#ax.set_title('Rating Vs Size')
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
We can easily notice apps which weight in the range of between 0 to 0.4x10^8 (40MB) the rating varies between 0 to 5.0, which means there is no guarantee the app will recieve high rating. On the other hand we can easily notice, apps which weight above 0.8x10^8 ~(80MB) have a higher probablity to recive high scores from the users (the variance is much low).
subset_df = df1[df1.Size > 40]
groups_temp = subset_df.groupby('Category').filter(lambda x: len(x) >20)
groups_temp['Category'].value_counts().head(n=8)
FAMILY 1746 GAME 1097 TOOLS 733 PRODUCTIVITY 351 MEDICAL 350 COMMUNICATION 328 FINANCE 323 SPORTS 319 Name: Category, dtype: int64
How do app prices impact app rating?
paid_apps = df1[df1.Price>0]
p = sns.jointplot( "Price", "Rating", paid_apps)
Pass the following variables as keyword args: x, y, data. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Most top rated apps are optimally priced between ~1$ to ~30$. There are only a very few apps priced above 20$.
Suprisingly we would expect that people who have paid over 350$ would be pleased with the app, but we can see crystal clear that consumers have been dissatisfied with their purchase, no one has left a higher rating higher than 4.5.
subset_df = df1[df1.Category.isin(['GAME', 'FAMILY', 'PHOTOGRAPHY', 'MEDICAL', 'TOOLS', 'FINANCE',
'LIFESTYLE','BUSINESS'])]
sns.set_style('darkgrid')
fig, ax = plt.subplots()
fig.set_size_inches(15, 8)
p = sns.stripplot(x="Price", y="Category", data=subset_df, jitter=True, linewidth=1)
title = ax.set_title('App pricing trend across categories')
I am Shocking to discover...Apps priced above 250$ !!!
subset_df.loc[subset_df['Price']>250,'Category'].value_counts()
FINANCE 6 LIFESTYLE 5 FAMILY 4 Name: Category, dtype: int64
Let's focus on cheap apps and drill inside those apps, which the price tag is less than 100$:
fig, ax = plt.subplots()
fig.set_size_inches(15, 8)
subset_df_price = subset_df[subset_df.Price<100]
p = sns.stripplot(x="Price", y="Category", data=subset_df_price, jitter=True, linewidth=1)
title = ax.set_title('App pricing trend across categories - after filtering for junk apps')
Clearly, Medical and Family apps are the most expensive. Some medical apps extend even upto 80$.
All other apps are priced under 30$.
Surprisingly, all game apps are reasonably priced below 20$.
#print(df1.head(n=5))
df1.Type.value_counts()
# groups = df1.groupby(['Category', 'Type'])
# for category_type, group in groups:
# print("category_type: ", category_type)
# print("group size: ", group.shape[0])
Free 8715 Paid 645 Name: Type, dtype: int64
# Stacked bar graph for top 5-10 categories - Ratio of paid and free apps
#fig, ax = plt.subplots(figsize=(15,10))
new_df = df1.groupby(['Category', 'Type']).agg({'App' : 'count'}).reset_index()
#print(new_df)
# outer_group_names = df1['Category'].sort_values().value_counts()[:5].index
# outer_group_values = df1['Category'].sort_values().value_counts()[:5].values
outer_group_names = ['GAME', 'FAMILY', 'MEDICAL', 'TOOLS']
outer_group_values = [len(df1.App[df1.Category == category]) for category in outer_group_names]
a, b, c, d=[plt.cm.Blues, plt.cm.Reds, plt.cm.Greens, plt.cm.Purples]
inner_group_names = ['Paid', 'Free'] * 4
inner_group_values = []
#inner_colors = ['#58a27c','#FFD433']
for category in outer_group_names:
for t in ['Paid', 'Free']:
x = new_df[new_df.Category == category]
try:
#print(x.App[x.Type == t].values[0])
inner_group_values.append(int(x.App[x.Type == t].values[0]))
except:
#print(x.App[x.Type == t].values[0])
inner_group_values.append(0)
explode = (0.025,0.025,0.025,0.025)
# First Ring (outside)
fig, ax = plt.subplots(figsize=(10,10))
ax.axis('equal')
mypie, texts, _ = ax.pie(outer_group_values, radius=1.2, labels=outer_group_names, autopct='%1.1f%%', pctdistance=1.1,
labeldistance= 0.75, explode = explode, colors=[a(0.6), b(0.6), c(0.6), d(0.6)], textprops={'fontsize': 16})
plt.setp( mypie, width=0.5, edgecolor='black')
# Second Ring (Inside)
mypie2, _ = ax.pie(inner_group_values, radius=1.2-0.5, labels=inner_group_names, labeldistance= 0.7,
textprops={'fontsize': 12}, colors = [a(0.4), a(0.2), b(0.4), b(0.2), c(0.4), c(0.2), d(0.4), d(0.2)])
plt.setp( mypie2, width=0.5, edgecolor='black')
plt.margins(0,0)
# show it
plt.title("The inner distribution of paid and free apps among Game,Tools,Medical and Family")
plt.tight_layout()
plt.show()
trace0 = go.Box(
y=np.log10(df1['Installs'][df1.Type=='Paid']),
name = 'Paid',
marker = dict(
color = 'rgb(214, 12, 140)',
),
boxpoints='all'
)
trace1 = go.Box(
y=np.log10(df1['Installs'][df1.Type=='Free']),
name = 'Free',
marker = dict(
color = 'rgb(0, 128, 128)',
),
boxpoints='all'
)
layout = go.Layout(
title = "Number of downloads of paid apps Vs free apps",
yaxis= {'title': 'Number of downloads (log-scaled)'}
)
data = [trace0, trace1]
plotly.offline.iplot({'data': data, 'layout': layout})
Paid apps have a relatively lower number of downloads than free apps. However, it is not too bad.There is no significant difference.
corrmat = df1.corr()
#f, ax = plt.subplots()
p =sns.heatmap(corrmat, annot=True, cmap=sns.diverging_palette(220, 20, as_cmap=True))
df_copy = df1.copy()
df_copy = df_copy[df_copy.Reviews > 10]
df_copy = df_copy[df_copy.Installs > 0]
df_copy['Installs'] = np.log10(df1['Installs'])
df_copy['Reviews'] = np.log10(df1['Reviews'])
sns.lmplot("Reviews", "Installs", data=df_copy)
ax = plt.gca()
_ = ax.set_title('Number of Reviews Vs Number of Downloads (Log scaled)')
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
A moderate positive correlation of 0.64 exists between the number of reviews and number of downloads. This means that customers tend to download a given app more if it has been reviewed by a larger number of people.
This also means that many active users who downloaded an app usually also leave back a review or feedback.
So, getting your app reviewed by more people maybe a good idea to increase your app's capture in the market!
Step of preparing the data after graphs
I prepared these values in the table after the graphs, because there were graphs that were not displayed correctly due to the change in values
#Converting Type classification into binary
def type_cat(types):
if types == 'Free':
return 0
else:
return 1
df1['Type'] = df1['Type'].map(type_cat)
Converting of the content rating section into integers. In this specific instance, given that the concent rating is somewhat relatable and has an order to it, we do not use one-hot encoding.
#Cleaning of content rating classification
RatingL = df1['Content Rating'].unique()
RatingDict = {}
for i in range(len(RatingL)):
RatingDict[RatingL[i]] = i
df1['Content Rating'] = df1['Content Rating'].map(RatingDict).astype(int)
df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9360 entries, 0 to 10840 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 9360 non-null object 1 Category 9360 non-null object 2 Rating 9360 non-null float64 3 Reviews 9360 non-null int64 4 Size 9360 non-null float64 5 Installs 9360 non-null int64 6 Type 9360 non-null int64 7 Price 9360 non-null float64 8 Content Rating 9360 non-null int64 9 Genres 9360 non-null object 10 Last Updated 9360 non-null int64 11 Current Ver 9360 non-null object 12 Android Ver 9360 non-null object 13 Category_c 9360 non-null int64 14 Genres_c 9360 non-null int64 dtypes: float64(3), int64(7), object(5) memory usage: 1.4+ MB
I dropped these portions of information as i deemed it unecessary for our machine learning algorithm
Here we are building a linear regression model based on the the following columns: 'Reviews', 'Size', 'Installs', 'Price', 'Content Rating', 'Last Updated', 'Category_c', 'Category_numeric'
for predicting the Rating vector.
#dropping of unrelated and unnecessary items
df1.drop(labels = ['Last Updated','Current Ver','Android Ver','App'], axis = 1, inplace = True)
df1.head()
Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Category_c | Genres_c | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | ART_AND_DESIGN | 4.1 | 159 | 19000000.0 | 10000 | 0 | 0.0 | 0 | Art & Design | 0 | 0 |
1 | ART_AND_DESIGN | 3.9 | 967 | 14000000.0 | 500000 | 0 | 0.0 | 0 | Art & Design;Pretend Play | 0 | 1 |
2 | ART_AND_DESIGN | 4.7 | 87510 | 8700000.0 | 5000000 | 0 | 0.0 | 0 | Art & Design | 0 | 0 |
3 | ART_AND_DESIGN | 4.5 | 215644 | 25000000.0 | 50000000 | 0 | 0.0 | 1 | Art & Design | 0 | 0 |
4 | ART_AND_DESIGN | 4.3 | 967 | 2800000.0 | 100000 | 0 | 0.0 | 0 | Art & Design;Creativity | 0 | 2 |
# for dummy variable encoding for Categories
df2 = pd.get_dummies(df1, columns=['Category'])
df2.head()
Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Category_c | Genres_c | ... | Category_PERSONALIZATION | Category_PHOTOGRAPHY | Category_PRODUCTIVITY | Category_SHOPPING | Category_SOCIAL | Category_SPORTS | Category_TOOLS | Category_TRAVEL_AND_LOCAL | Category_VIDEO_PLAYERS | Category_WEATHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.1 | 159 | 19000000.0 | 10000 | 0 | 0.0 | 0 | Art & Design | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 3.9 | 967 | 14000000.0 | 500000 | 0 | 0.0 | 0 | Art & Design;Pretend Play | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 4.7 | 87510 | 8700000.0 | 5000000 | 0 | 0.0 | 0 | Art & Design | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 4.5 | 215644 | 25000000.0 | 50000000 | 0 | 0.0 | 1 | Art & Design | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4.3 | 967 | 2800000.0 | 100000 | 0 | 0.0 | 0 | Art & Design;Creativity | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 43 columns
Dimensionality reduction is way to reduce the complexity of a model and avoid overfitting. we will discuss the Principal Component Analysis (PCA) algorithm used to compress a dataset onto a lower-dimensional feature subspace with the goal of maintaining most of the relevant information.
from sklearn.decomposition import PCA
X = df1.drop(labels = ['Category','Rating','Genres'],axis = 1)
y = df1.Rating
print("The number of columns (features) before Dimension Reduction is: ", X.shape[1])
pca = PCA().fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
print(pca.explained_variance_ratio_)
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
The number of columns (features) before Dimension Reduction is: 8 [9.46602338e-01 5.27434959e-02 6.54166228e-04 1.35113768e-13 2.83906071e-14 1.68140865e-15 6.74579367e-17 6.86686639e-18]
We have applied the PCA Algorithm, we have refered to the the field "explained_variance_ratio_", which presents the Percentage of variance explained by each of the selected components. As we can see most of the information can be spot in a single principal component. We can see there is no big difference between 3 components and above.
Now we would like to explore PCA with only 3 principal component, and see how the data is scattered in 3 dimension.
# Dimension Reduction:
pca = PCA(n_components=3)
x_res = pca.fit_transform(X)
fig = plt.figure()
ax = fig.add_subplot(projection="3d")
ax.set_title('Data representation with 3 components', fontsize=10)
ax.scatter(x_res[:, 0], x_res[:, 1], x_res[:, 2],cmap=plt.cm.nipy_spectral,marker='o',edgecolor="k")
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.subplots_adjust(right=1.3)
print(f"Before PCA we had {X.shape[1]} features after applying PCA we got only 3 Principal components" )
Before PCA we had 8 features after applying PCA we got only 3 Principal components
After our final checks for the preprocessing of our data, looks like we can start work! So the next question is what exactly are we doing and how are we doing it.
So the goal of this instance is to see if we can use existing data provided(e.g. File size, no of reviews) to predict the ratings of the google applications. In other words, our dependent variable Y, would be the rating of the apps.
One important factor to note is that the dependent variable Y, is a continuous variable(aka infinite no of combinations), as compared to a discrete variable. Naturally there are ways to convert our Y to a discrete variable but I decided to keep Y as a continuous variable for the purposes of this machine learning session.
Next question, what models should we apply and how should we evaluate them?
Model wise, I'm not too sure as well as there are like a ton of models out there that can be used for machine learning. Hence, I basically just chose the 3 most common models that I use, being linear regression, SVM, and random forest regressor.
we consider one-hot vs integer encoded results for the category section, as well as including/excluding the genre section. eventually we have trained 3 different models.
We then evaluate the models by comparing the predicted results against the actual results graphically, as well as use the mean squared error, mean absolute error and mean squared log error as possible benchmarks. Moreover The MSLE is usually used when we don't want to penalize the large differences in the predicted and the actual values when the predicted and the actual values are big numbers, in our case the numbers are quite small, but of curiosity we decided to explore this metric too.
The use of the error term will be evaluated right at the end after running through all the models.
let's use 3 different regression models with two different techniques on treating the categorical variable
So before we start, the following is code to obtain the error terms for the various models, for comparability.
#for evaluation of error term and
def Evaluation_metrics(y_true, y_predict):
print ('Mean Squared Error: '+ str(metrics.mean_squared_error(y_true,y_predict)))
print ('Mean absolute Error: '+ str(metrics.mean_absolute_error(y_true,y_predict)))
print ('Mean squared Log Error: '+ str(metrics.mean_squared_log_error(y_true,y_predict)))
#to add into results_index for evaluation of error term
def Evaluationmatrix_dict(y_true, y_predict, name = 'Linear - Integer'):
dict_matrix = {}
dict_matrix['Series Name'] = name
dict_matrix['Mean Squared Error'] = metrics.mean_squared_error(y_true,y_predict)
dict_matrix['Mean Absolute Error'] = metrics.mean_absolute_error(y_true,y_predict)
dict_matrix['Mean Squared Log Error'] = metrics.mean_squared_log_error(y_true,y_predict)
return dict_matrix
We start off by looking at linear regression model (without the genre label)
#excluding Genre label
#Integer encoding
X = df1.drop(labels = ['Category','Rating','Genres','Genres_c'],axis = 1)
y = df1.Rating
# print(X.columns)
# print(df1.shape)
# print(X.shape)
y = pd.DataFrame(y)
# print(y.columns)
# print(y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
lr_model = LinearRegression()
lr_model.fit(X_train,y_train)
results = lr_model.predict(X_test)
print("The linear regression model without including the 'Category' column:")
Evaluation_metrics(y_test, results)
print('*'*100)
#dummy encoding
X_d = df2.drop(labels = ['Rating','Genres','Category_c','Genres_c'],axis = 1)
y_d = df2.Rating
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_d, y_d, test_size=0.30)
lr_model2 = LinearRegression()
lr_model2.fit(X_train_d,y_train_d)
results_d = lr_model2.predict(X_test_d)
print("The linear regression model with including the 'Category' column:")
Evaluation_metrics(y_test, results_d)
The linear regression model without including the 'Category' column: Mean Squared Error: 0.26034174488430706 Mean absolute Error: 0.35950301525190353 Mean squared Log Error: 0.012390708425322507 **************************************************************************************************** The linear regression model with including the 'Category' column: Mean Squared Error: 0.27395312892305324 Mean absolute Error: 0.37451271395173386 Mean squared Log Error: 0.012917690037432377
We can easily conclude from the metrics we got in the two models:
The first model is more accurate than the second model, we see it's quite obvious by the results in all three metrics which the error of the first model is lower. therefore we can easily conclude the Column 'Category' is not relevant for the prediction of Rating.
plt.figure(figsize=(12,7))
sns.regplot(results,y_test,color='teal', label = 'Integer', marker = 'x')
sns.regplot(results_d,y_test_d,color='orange',label = 'Dummy')
plt.legend()
plt.title('Linear model - Excluding Genres')
plt.xlabel('Predicted Ratings')
plt.ylabel('Actual Ratings')
plt.show()
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
print ('Actual mean of population:' + str(y.mean()))
print ('Integer encoding(mean) :' + str(results.mean()))
print ('Dummy encoding(mean) :'+ str(results_d.mean()))
print ('Integer encoding(std) :' + str(results.std()))
print ('Dummy encoding(std) :'+ str(results_d.std()))
Actual mean of population:Rating 4.191838 dtype: float64 Integer encoding(mean) :4.191422875385875 Dummy encoding(mean) :4.1962304579996985 Integer encoding(std) :0.05645292177229634 Dummy encoding(std) :0.10752740037768857
At first glance, it's hard to really see which model(dummy vs one-hot) is better in terms of predictive accuracy. What is striking however is the that at first glance, the dummy model seems favors the outcome of a lower rating compared to the integer model.
Although if we look at the actual mean of the predictive results, both are approximately the same, however the dummy encoded results have a much larger standard deviation as compared to the integer encoded model.
Next is looking at the linear model including the genre label as a numeric value.
explainer = shap.explainers.Linear(lr_model, X)
shap_values = explainer(X)
# visualize the first prediction's explanation
shap.plots.waterfall(shap_values[0])
It can be seen that category_c column have the greatest impact!!!
#Including genre label
#Integer encoding
X = df1.drop(labels = ['Category','Rating','Genres'],axis = 1)
y = df1.Rating
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
lr_model3 = LinearRegression()
lr_model3.fit(X_train,y_train)
Results = lr_model3.predict(X_test)
#resultsdf = resultsdf.append(Evaluationmatrix_dict(y_test,Results, name = 'Linear(inc Genre) - Integer'),ignore_index = True)
#dummy encoding
X_d = df2.drop(labels = ['Rating','Genres','Category_c'],axis = 1)
y_d = df2.Rating
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_d, y_d, test_size=0.30)
lr_model4 = LinearRegression()
lr_model4.fit(X_train_d,y_train_d)
Results_d = lr_model4.predict(X_test_d)
#resultsdf = resultsdf.append(Evaluationmatrix_dict(y_test_d,Results_d, name = 'Linear(inc Genre) - Dummy'),ignore_index = True)
plt.figure(figsize=(12,7))
sns.regplot(Results,y_test,color='teal', label = 'Integer', marker = 'x')
sns.regplot(Results_d,y_test_d,color='orange',label = 'Dummy')
plt.legend()
plt.title('Linear model - Including Genres')
plt.xlabel('Predicted Ratings')
plt.ylabel('Actual Ratings')
plt.show()
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
print ('Integer encoding(mean) :' + str(Results.mean()))
print ('Dummy encoding(mean) :'+ str(Results_d.mean()))
print ('Integer encoding(std) :' + str(Results.std()))
print ('Dummy encoding(std) :'+ str(Results_d.std()))
Integer encoding(mean) :4.191776448197456 Dummy encoding(mean) :4.197336191442146 Integer encoding(std) :0.053148569036122514 Dummy encoding(std) :0.0990839083109886
When including the genre data, we see a slight difference in the mean between the integer and dummy encoded linear models. The dummy encoded model's std is still higher than the integer encoded model.
What's striking to me personally is that the dummy encoded regression line in the scatterplot is now flatter than the integer encoded regression line, which might suggest a "worse" outcome, given that usually you would want your regression's beta value to be closer to 1 than to 0.
explainer = shap.explainers.Linear(lr_model3, X)
shap_values = explainer(X)
# visualize the first prediction's explanation
shap.plots.waterfall(shap_values[0])
Here,Too, It can be seen that category_c column have the greatest impact!!!
Next up is the SVM model.
#Excluding genres
from sklearn import svm
#Integer encoding
X1 = df1.drop(labels = ['Category','Rating','Genres','Genres_c'],axis = 1)
print(X1.columns)
y = df1.Rating
X_train, X_test, y_train, y_test = train_test_split(X1.values, y.values, test_size=0.30)
svm_model1 = svm.SVR()
svm_model1.fit(X_train,y_train)
Results2 = svm_model1.predict(X_test)
#resultsdf = resultsdf.append(Evaluationmatrix_dict(y_test,Results2, name = 'SVM - Integer'),ignore_index = True)
#dummy based
X_d = df2.drop(labels = ['Rating','Genres','Category_c','Genres_c',],axis = 1)
y_d = df2.Rating
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_d, y_d, test_size=0.30)
svm_model2 = svm.SVR()
svm_model2.fit(X_train_d,y_train_d)
Results2_d = svm_model2.predict(X_test_d)
Evaluation_metrics(y_test_d, Results2_d)
#resultsdf = resultsdf.append(Evaluationmatrix_dict(y_test_d,Results2_d, name = 'SVM - Dummy'),ignore_index = True)
Index(['Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Category_c'], dtype='object') Mean Squared Error: 0.27220199920482613 Mean absolute Error: 0.3479069223229053 Mean squared Log Error: 0.013183825089110602
plt.figure(figsize=(12,7))
sns.regplot(Results2,y_test,color='teal', label = 'Integer', marker = 'x')
sns.regplot(Results2_d,y_test_d,color='orange',label = 'Dummy')
plt.legend()
plt.title('SVM model - excluding Genres')
plt.xlabel('Predicted Ratings')
plt.ylabel('Actual Ratings')
plt.show()
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
print ('Integer encoding(mean) :' + str(Results2.mean()))
print ('Dummy encoding(mean) :'+ str(Results2_d.mean()))
print ('Integer encoding(std) :' + str(Results2.std()))
print ('Dummy encoding(std) :'+ str(Results2_d.std()))
Integer encoding(mean) :4.283442794770366 Dummy encoding(mean) :4.282999831225259 Integer encoding(std) :0.054188497106618445 Dummy encoding(std) :0.05285191693223024
The results are quite interesting. Overall the model predicted quite a bit of ratings to be approximately at 4.2, even though the actual ratings were not. Looking at the scatterplot, the integer encoded model seems to have performed better in this instance.
As usual, the dummy encoded model has a higher std than the integer encoded model.
#Integer encoding, including Genres_c
svm_model3 = svm.SVR()
X = df1.drop(labels = ['Category','Rating','Genres'],axis = 1)
y = df1.Rating
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
svm_model3.fit(X_train,y_train)
Results2a = svm_model3.predict(X_test)
#evaluation
#resultsdf = resultsdf.append(Evaluationmatrix_dict(y_test,Results2a, name = 'SVM(inc Genres) - Integer'),ignore_index = True)
#dummy encoding, including Genres_c
svm_model4 = svm.SVR()
X_d = df2.drop(labels = ['Rating','Genres','Category_c'],axis = 1)
y_d = df2.Rating
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_d, y_d, test_size=0.30)
svm_model4.fit(X_train_d,y_train_d)
Results2a_d = svm_model4.predict(X_test_d)
#evaluation
#resultsdf = resultsdf.append(Evaluationmatrix_dict(y_test_d,Results2a_d, name = 'SVM(inc Genres) - Dummy'),ignore_index = True)
plt.figure(figsize=(12,7))
sns.regplot(Results2a,y_test,color='teal', label = 'Integer', marker = 'x')
sns.regplot(Results2a_d,y_test_d,color='orange',label = 'Dummy')
plt.legend()
plt.title('SVM model - including Genres')
plt.xlabel('Predicted Ratings')
plt.ylabel('Actual Ratings')
plt.show()
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
print ('Integer encoding(mean) :' + str(Results2a.mean()))
print ('Dummy encoding(mean) :'+ str(Results2a_d.mean()))
print ('Integer encoding(std) :' + str(Results2a.std()))
print ('Dummy encoding(std) :'+ str(Results2a_d.std()))
Integer encoding(mean) :4.279173831239593 Dummy encoding(mean) :4.282408756288392 Integer encoding(std) :0.05522668369882636 Dummy encoding(std) :0.054516967736945234
With the inclusion of the genre variable, the dummy encoding model now seems to be performing better, as we see the regression line comparing the actual vs the predicted results to be very similar to that of the integer encoded model.
Furthermore the std of the dummy encoded model has fallen significantly, and now has a higher mean compared to the integer encoded model.
Next up is the random forest regressor model. Honestly this is my favorite model as not only is it fast, it also allows you to see what independent variables significantly affect the outcome of the model.
from sklearn.ensemble import RandomForestRegressor
#Integer encoding
X = df1.drop(labels = ['Category','Rating','Genres','Genres_c'],axis = 1)
y = df1.Rating
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
rfr_model1 = RandomForestRegressor(max_depth=10)
rfr_model1.fit(X_train,y_train)
Results3 = rfr_model1.predict(X_test)
#evaluation
#resultsdf = resultsdf.append(Evaluationmatrix_dict(y_test,Results3, name = 'RFR - Integer'),ignore_index = True)
#dummy encoding
X_d = df2.drop(labels = ['Rating','Genres','Category_c','Genres_c'],axis = 1)
y_d = df2.Rating
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_d, y_d, test_size=0.30)
rfr_model2 = RandomForestRegressor()
rfr_model2.fit(X_train_d,y_train_d)
Results3_d = rfr_model2.predict(X_test_d)
#evaluation
#esultsdf = resultsdf.append(Evaluationmatrix_dict(y_test,Results3_d, name = 'RFR - Dummy'),ignore_index = True)
plt.figure(figsize=(12,7))
sns.regplot(Results3,y_test,color='teal', label = 'Integer', marker = 'x')
sns.regplot(Results3_d,y_test_d,color='orange',label = 'Dummy')
plt.legend()
plt.title('RFR model - excluding Genres')
plt.xlabel('Predicted Ratings')
plt.ylabel('Actual Ratings')
plt.show()
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
print ('Integer encoding(mean) :' + str(Results3.mean()))
print ('Dummy encoding(mean) :'+ str(Results3_d.mean()))
print ('Integer encoding(std) :' + str(Results3.std()))
print ('Dummy encoding(std) :'+ str(Results3_d.std()))
Integer encoding(mean) :4.189496403088552 Dummy encoding(mean) :4.197710131766382 Integer encoding(std) :0.22657328634390755 Dummy encoding(std) :0.29170396286221345
At first glance, I would say that the RFR model produced the best predictive results, just looking at the scatter graph plotted. Overall both models, the integer and the dummy encoded models seem to perform relatively similar, although the dummy encoded model has a higher overall predicted mean.
#for integer
plt.figure(figsize=(12,7))
Feat_impt = {}
for col,feat in zip(X.columns,rfr_model2.feature_importances_):
Feat_impt[col] = feat
Feat_impt_df = pd.DataFrame.from_dict(Feat_impt,orient = 'index')
Feat_impt_df.sort_values(by = 0, inplace = True)
Feat_impt_df.rename(index = str, columns = {0:'Pct'},inplace = True)
plt.figure(figsize= (14,10))
Feat_impt_df.plot(kind = 'barh',figsize= (14,10),legend = False)
plt.show()
<Figure size 864x504 with 0 Axes>
<Figure size 1008x720 with 0 Axes>
If we look at what influences the ratings, the top 4 being reviews, size, category, and number of installs seem to have the highest influence. This is quite an interesting observation, while also rationalizable.
#for dummy
Feat_impt_d = {}
for col,feat in zip(X_d.columns,rfr_model1.feature_importances_):
Feat_impt_d[col] = feat
Feat_impt_df_d = pd.DataFrame.from_dict(Feat_impt_d,orient = 'index')
Feat_impt_df_d.sort_values(by = 0, inplace = True)
Feat_impt_df_d.rename(index = str, columns = {0:'Pct'},inplace = True)
plt.figure(figsize= (14,10))
Feat_impt_df_d.plot(kind = 'barh',figsize= (14,10),legend = False)
plt.show()
<Figure size 1008x720 with 0 Axes>
Looking at the breakdown even further, it would seem that indeed Reviews, size and number of install remain as a significant contributer to the predictiveness of app ratings. What's interesting to me is that how the Tools category of apps have such a high level of predictiveness in terms of ratings, as say compared to the Food and Drink category.
#Including Genres_C
#Integer encoding
X = df1.drop(labels = ['Category','Rating','Genres'],axis = 1)
y = df1.Rating
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
rfr_model3 = RandomForestRegressor()
rfr_model3.fit(X_train,y_train)
Results3a = rfr_model3.predict(X_test)
#evaluation
#resultsdf = resultsdf.append(Evaluationmatrix_dict(y_test,Results3a, name = 'RFR(inc Genres) - Integer'),ignore_index = True)
#dummy encoding
X_d = df2.drop(labels = ['Rating','Genres','Category_c'],axis = 1)
y_d = df2.Rating
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_d, y_d, test_size=0.30)
rfr_model4 = RandomForestRegressor()
rfr_model4.fit(X_train_d,y_train_d)
Results3a_d = rfr_model4.predict(X_test_d)
#evaluation
#resultsdf = resultsdf.append(Evaluationmatrix_dict(y_test,Results3a_d, name = 'RFR(inc Genres) - Dummy'),ignore_index = True)
plt.figure(figsize=(12,7))
sns.regplot(Results3a,y_test,color='teal', label = 'Integer', marker = 'x')
sns.regplot(Results3a_d,y_test_d,color='orange',label = 'Dummy')
plt.legend()
plt.title('RFR model - including Genres')
plt.xlabel('Predicted Ratings')
plt.ylabel('Actual Ratings')
plt.show()
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
print ('Integer encoding(mean) :' + str(Results3.mean()))
print ('Dummy encoding(mean) :'+ str(Results3_d.mean()))
print ('Integer encoding(std) :' + str(Results3.std()))
print ('Dummy encoding(std) :'+ str(Results3_d.std()))
Integer encoding(mean) :4.189496403088552 Dummy encoding(mean) :4.197710131766382 Integer encoding(std) :0.22657328634390755 Dummy encoding(std) :0.29170396286221345
Again with the inclusion of the genre variable, the results do not seem to defer significantly as compared to the previous results.
#for integer
Feat_impt = {}
for col,feat in zip(X.columns,rfr_model3.feature_importances_):
Feat_impt[col] = feat
Feat_impt_df = pd.DataFrame.from_dict(Feat_impt,orient = 'index')
Feat_impt_df.sort_values(by = 0, inplace = True)
Feat_impt_df.rename(index = str, columns = {0:'Pct'},inplace = True)
plt.figure(figsize= (14,10))
Feat_impt_df.plot(kind = 'barh',figsize= (14,10),legend = False)
plt.show()
<Figure size 1008x720 with 0 Axes>
From the results, it would seem that the genre section actually plays an important part in the decision tree making. Yet the exclusion of it dosent seem to significantly impact results. This to me is quite interesting.
#for dummy
Feat_impt_d = {}
for col,feat in zip(X_d.columns,rfr_model4.feature_importances_):
Feat_impt_d[col] = feat
Feat_impt_df_d = pd.DataFrame.from_dict(Feat_impt_d,orient = 'index')
Feat_impt_df_d.sort_values(by = 0, inplace = True)
Feat_impt_df_d.rename(index = str, columns = {0:'Pct'},inplace = True)
plt.figure(figsize= (14,10))
Feat_impt_df_d.plot(kind = 'barh',figsize= (14,10),legend = False)
plt.show()
<Figure size 1008x720 with 0 Axes>
Finally, looking at the results, it is not easy to conclude which model has the best predictive accuracy and lowest error term. Using this round of data as a basis, the dummy encoded SVM model including genres has the lowest overall error rates, followed by the integer encoded RFR model including genes. Yet, all models seem to be very close in terms of it's error term, so this result is likely to change.
What is very surprising to me is how the RFR dummy model has such a significantly more error term compared to all the other models, even though on the surface it seemed to perform very similarly to the RFR integer model.
The Android market is 12 years old, it got huge success since many developers have been designing, developing apps.
According to an article which was published 3 months ago: "Mobile Apps Market Size To Grow By USD 653.91 billion" Therefore a lot of capital is going through those domains of mobile app developemnt. Some well known Israeli successful mobile apps are: Joytunes, and Lightricks.
As we have noticed most of the apps are free, in this project we have delve into the secret sauce of the rating metric. we explored in depth the features for eatch app, and tried to predict a numeric value of the rating. We have built many types of regressor models (Linear Regressor, SVM, Random Forest Regressor), some of them where better than the rest.
In conclusion, we were not able to answer all the questions we asked, but we have important conclusions for example: the higher the rating of the app, the more it was downloaded from the store, The developer concludes that he must, that the user of the app will have great pleasure, and the developer will have to invest in the user's enjoyment ....