TOPIC:
Indetifying the profile of a profitable app for the ANDROID and iOS Market.
ABSTRACT:
As a company that builds free of charge apps, the main source of revenue is the in-app advertisements.That means mostly the active users may influence company's profit.Our target is to analyze the data and transfer to our developers what kind of apps are likely to attract more and active users.
DATA:
We will analyze two data sets that seem suitable for our purpose:
A data set containing data about approximately ten thousand Android apps from Google Play.
A data set containing data about approximately seven thousand iOS apps from the App Store.
#Load libraries
import csv
from csv import reader
import os
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#Load data
android=pd.read_csv('googleplaystore.csv')
android.head(2)
android['Category'].unique()
#Genres offering more detailed description of the categories.So will take into account this since gives more info
android['Genres'].unique()[:10]
DROP THE DUPLICATES, KEEP THAT ONE WITH THE HIGHEST REVIEWS
clean_android=android.sort_values('Reviews',ascending=False).drop_duplicates('App',keep='first')
clean_android.reset_index(drop=True,inplace=True)
clean_android.head(2)
#confirm the number of duplicated rows
dupli_lines=len(android)-len(android.sort_values('Reviews',ascending=False).drop_duplicates('App',keep='first'))
print('Duplicated lines:',dupli_lines)
print('Original number of rows:',len(android))
print('No duplicates, number of rows:',len(clean_android))
Find NaN.We have 1463 rows with no rating
clean_android.isnull().sum()
#The NaN rows of Rating
clean_android[clean_android['Rating'].isnull()].head(2)
#check the installs=users of the rows with NaN in rating
na_rows=clean_android[clean_android['Rating'].isnull()]
na_rows.sort_values('Installs',ascending=False).head()
-Rows with NaN Rating have very low number of installs against the top_20 of apps installs. Furthermore they have very low number of reviews.Seem they don't offer any valuable info to our analysis and in consequence we can drop these rows. -Additional we see the row 4484 has false rating value and installs ,therefore we will drop this row too. Row 4484 is the row of Rating's NaN which is already included in the list of NaN's rows we will drop.
#Lets compare with the top_10 of installs
clean_android.sort_values('Installs',ascending=False).head(3)
#Find the rows with NaN
na_rows=clean_android.isnull().sum(axis=1)
rows_to_drop=na_rows[na_rows!=0].index
rows_to_drop
#DROP NaN rows
clean_android.drop(clean_android.index[rows_to_drop],inplace=True)
clean_android.reset_index(drop=True,inplace=True)
#confirm there is not left any NaN
clean_android.isnull().sum()
Take a look at the frequency of the offered app categories in the android market
top10_frequent_app=clean_android['Genres'].value_counts().head(10)
top10_frequent_app
fig,ax=plt.subplots(figsize=(16,5))
ax=sns.barplot(top10_frequent_app.index,top10_frequent_app.values)
print(type(clean_android['Rating'][0]))
print(type(clean_android['Reviews'][0]))
print(type(clean_android['Installs'][0]))
Transform to int the values of the columns reviews and installs in order to check for their correlation and find the top_10 of the most installed reviewed and rated categories.
clean_android['Reviews']=[int(i) for i in clean_android['Reviews']]
#installs is a string
print(type(clean_android['Installs'][0]))
clean_android['Installs'].unique()
#clean the strings from '+',',' transform them to float
to_integers=[]
for i in clean_android['Installs']:
s=i.replace('+',',').replace(',','')
b=float(s)
to_integers.append(b)
clean_android['Installs']=to_integers
#top_10 of the most installed apps
top_10_installs=clean_android.sort_values('Installs',ascending=False).head(10)
top_10_installs
Top reviewed are not top rated.But reviewing an app shows a more stable interaction of the user with the app, spend more time to evaluate and go through the app, since we don't have any info about the login and logout time. So potential advertisements are more likely to be seen by these users.
top_10_reviewed=clean_android.sort_values('Reviews',ascending=False).head(10)
top_10_reviewed.head()
top10_reviewed_and_installed=clean_android.sort_values(['Installs','Reviews'],ascending=[False,False]).head(10)
top10_reviewed_and_installed.head(10)
We can see below that the majority of the most installed app have rate from 4.0 to 4.5 while also the highest reviews
#Check relations between columns rating,reviews,installs,Genres
scatter_matrix(clean_android[['Rating','Reviews','Installs']],figsize=(16,6))#,'Installs']])
plt.show()
clean_android.plot.scatter('Rating','Reviews')
We observe two clusters of apps with the most installed
clean_android.plot.scatter('Rating','Installs')
clean_android.plot.scatter('Installs','Reviews')
Lets have a look in ios apps
ios=pd.read_csv('AppleStore.csv')
ios.head(2)
Here we don't have installs or reviews, so we can't know the total number of users, but we have total number of rating which shows us the number of interactive users with the app.Based on that we will try to see which categories are the most popular here.
#top_10 of most rated apps
ios_top10_rated=ios.sort_values('rating_count_tot',ascending=False).head(10)
ios_top10_rated.head()
CONCLUSION:
We saw here too that first positions are dominated by social apps while instagram is social-photography topic. iOS apps have more games in the top_10 ranking but still the main apps are social and communication categories. So an idea for a new app could be an app that teachs you how to edit video and photography while the work of each user could be published in the social app(facebook,instagram) under competition where the score will be given a part by build in criteria of the app and a second part by other users of the app or the users of the social connected app(For example facebook).Combine a prize provided commercial companies this may attract massively the interest of many customers and therefore will be profitable for the company through the advertisements.