TOPIC:
Indetifying the profile of a profitable app for the ANDROID and iOS Market.

ABSTRACT:
As a company that builds free of charge apps, the main source of revenue is the in-app advertisements.That means mostly the active users may influence company's profit.Our target is to analyze the data and transfer to our developers what kind of apps are likely to attract more and active users.

DATA:
We will analyze two data sets that seem suitable for our purpose: A data set containing data about approximately ten thousand Android apps from Google Play. A data set containing data about approximately seven thousand iOS apps from the App Store.

In [1]:
#Load libraries
import csv
from csv import reader
import os
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
In [2]:
#Load data
android=pd.read_csv('googleplaystore.csv')
android.head(2)
Out[2]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
In [3]:
android['Category'].unique()
Out[3]:
array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION',
       '1.9'], dtype=object)
In [4]:
#Genres offering more detailed description of the categories.So will take into account this since gives more info
android['Genres'].unique()[:10]
Out[4]:
array(['Art & Design', 'Art & Design;Pretend Play',
       'Art & Design;Creativity', 'Art & Design;Action & Adventure',
       'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business',
       'Comics', 'Comics;Creativity'], dtype=object)

DROP THE DUPLICATES, KEEP THAT ONE WITH THE HIGHEST REVIEWS

In [5]:
clean_android=android.sort_values('Reviews',ascending=False).drop_duplicates('App',keep='first')
clean_android.reset_index(drop=True,inplace=True)
clean_android.head(2)
Out[5]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 GollerCepte Live Score SPORTS 4.2 9992 31M 1,000,000+ Free 0 Everyone Sports May 23, 2018 6.5 4.1 and up
1 Ad Block REMOVER - NEED ROOT TOOLS 3.3 999 91k 100,000+ Free 0 Everyone Tools December 17, 2013 3.2 2.2 and up
In [6]:
#confirm the number of duplicated rows
dupli_lines=len(android)-len(android.sort_values('Reviews',ascending=False).drop_duplicates('App',keep='first'))
print('Duplicated lines:',dupli_lines)
print('Original number of rows:',len(android))
print('No duplicates, number of rows:',len(clean_android))
Duplicated lines: 1181
Original number of rows: 10841
No duplicates, number of rows: 9660

Find NaN.We have 1463 rows with no rating

In [7]:
clean_android.isnull().sum()
Out[7]:
App                  0
Category             0
Rating            1463
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64
In [8]:
#The NaN rows of Rating
clean_android[clean_android['Rating'].isnull()].head(2)
Out[8]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
74 Voice Tables - no internet PARENTING NaN 970 71M 100,000+ Free 0 Everyone Parenting May 28, 2018 2.0 4.0.3 and up
113 Gold Quote - Gold.fr FINANCE NaN 96 1.5M 10,000+ Free 0 Everyone Finance May 19, 2016 2.3 2.2 and up
In [9]:
#check the installs=users of the rows with NaN in rating 
na_rows=clean_android[clean_android['Rating'].isnull()]
na_rows.sort_values('Installs',ascending=False).head()
Out[9]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
5704 Young Speeches LIBRARIES_AND_DEMO NaN 2221 2.4M 500,000+ Free 0 Everyone Libraries & Demo January 8, 2017 1.1 2.3 and up
8793 EJ.by NEWS_AND_MAGAZINES NaN 10 2.3M 500+ Free 0 Everyone News & Magazines October 27, 2015 1.2 4.0.3 and up
4503 Poteau BA FAMILY NaN 3 4.0M 500+ Free 0 Everyone Education July 16, 2017 1.0.2 5.0 and up
1896 CD JUANITO SPORTS NaN 6 16M 500+ Free 0 Everyone Sports October 26, 2017 6.0 4.1 and up
1900 F TOOLS NaN 6 4.9M 500+ Free 0 Everyone Tools May 15, 2018 1.0.2 4.0 and up

-Rows with NaN Rating have very low number of installs against the top_20 of apps installs. Furthermore they have very low number of reviews.Seem they don't offer any valuable info to our analysis and in consequence we can drop these rows. -Additional we see the row 4484 has false rating value and installs ,therefore we will drop this row too. Row 4484 is the row of Rating's NaN which is already included in the list of NaN's rows we will drop.

In [10]:
#Lets compare with the top_10 of installs
clean_android.sort_values('Installs',ascending=False).head(3)
Out[10]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
4484 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M 1,000+ Free 0 Everyone NaN February 11, 2018 1.0.19 4.0 and up NaN
7372 My Talking Tom GAME 4.5 14892469 Varies with device 500,000,000+ Free 0 Everyone Casual July 19, 2018 4.8.0.132 4.1 and up
5682 Candy Crush Saga GAME 4.4 22430188 74M 500,000,000+ Free 0 Everyone Casual July 5, 2018 1.129.0.2 4.1 and up
In [11]:
#Find the rows with NaN
na_rows=clean_android.isnull().sum(axis=1)
rows_to_drop=na_rows[na_rows!=0].index    
rows_to_drop
Out[11]:
Int64Index([  74,  113,  118,  146,  148,  175,  270,  308,  309,  312,
            ...
            9650, 9651, 9652, 9653, 9654, 9655, 9656, 9657, 9658, 9659],
           dtype='int64', length=1470)
In [12]:
#DROP NaN rows 
clean_android.drop(clean_android.index[rows_to_drop],inplace=True)
In [13]:
clean_android.reset_index(drop=True,inplace=True)
In [14]:
#confirm there is not left any NaN
clean_android.isnull().sum()
Out[14]:
App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

Take a look at the frequency of the offered app categories in the android market

In [15]:
top10_frequent_app=clean_android['Genres'].value_counts().head(10)
top10_frequent_app
Out[15]:
Tools              717
Entertainment      471
Education          429
Finance            302
Productivity       301
Lifestyle          300
Personalization    296
Action             292
Medical            290
Sports             266
Name: Genres, dtype: int64
In [16]:
fig,ax=plt.subplots(figsize=(16,5))
ax=sns.barplot(top10_frequent_app.index,top10_frequent_app.values)
In [17]:
print(type(clean_android['Rating'][0]))
print(type(clean_android['Reviews'][0]))
print(type(clean_android['Installs'][0]))
<class 'numpy.float64'>
<class 'str'>
<class 'str'>

Transform to int the values of the columns reviews and installs in order to check for their correlation and find the top_10 of the most installed reviewed and rated categories.

In [18]:
clean_android['Reviews']=[int(i) for i in clean_android['Reviews']]
In [19]:
#installs is a string
print(type(clean_android['Installs'][0]))
clean_android['Installs'].unique()
<class 'str'>
Out[19]:
array(['1,000,000+', '100,000+', '500,000+', '50,000+', '10,000,000+',
       '5,000,000+', '10,000+', '50,000,000+', '100,000,000+', '5,000+',
       '1,000+', '1,000,000,000+', '500+', '100+', '10+', '50+',
       '500,000,000+', '5+', '1+'], dtype=object)
In [20]:
#clean the strings from '+',','  transform them to float 
to_integers=[]
for i in clean_android['Installs']:
    s=i.replace('+',',').replace(',','')
    b=float(s)
    to_integers.append(b)
    
clean_android['Installs']=to_integers    
In [21]:
#top_10 of the most installed apps
top_10_installs=clean_android.sort_values('Installs',ascending=False).head(10)
top_10_installs
Out[21]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
7823 Google Photos PHOTOGRAPHY 4.5 10859051 Varies with device 1.000000e+09 Free 0 Everyone Photography August 6, 2018 Varies with device Varies with device
1397 Instagram SOCIAL 4.5 66577446 Varies with device 1.000000e+09 Free 0 Teen Social July 31, 2018 Varies with device Varies with device
405 Google News NEWS_AND_MAGAZINES 3.9 878065 13M 1.000000e+09 Free 0 Teen News & Magazines August 1, 2018 5.2.0 4.4 and up
4802 YouTube VIDEO_PLAYERS 4.3 25655305 Varies with device 1.000000e+09 Free 0 Teen Video Players & Editors August 2, 2018 Varies with device Varies with device
2541 Google+ SOCIAL 4.2 4831125 Varies with device 1.000000e+09 Free 0 Teen Social July 26, 2018 Varies with device Varies with device
2686 Gmail COMMUNICATION 4.3 4604483 Varies with device 1.000000e+09 Free 0 Everyone Communication August 2, 2018 Varies with device Varies with device
95 Google Chrome: Fast & Secure COMMUNICATION 4.3 9643041 Varies with device 1.000000e+09 Free 0 Everyone Communication August 1, 2018 Varies with device Varies with device
3738 Hangouts COMMUNICATION 4.0 3419513 Varies with device 1.000000e+09 Free 0 Everyone Communication July 21, 2018 Varies with device Varies with device
6873 Google Play Books BOOKS_AND_REFERENCE 3.9 1433233 Varies with device 1.000000e+09 Free 0 Teen Books & Reference August 3, 2018 Varies with device Varies with device
823 Facebook SOCIAL 4.1 78158306 Varies with device 1.000000e+09 Free 0 Teen Social August 3, 2018 Varies with device Varies with device

Top reviewed are not top rated.But reviewing an app shows a more stable interaction of the user with the app, spend more time to evaluate and go through the app, since we don't have any info about the login and logout time. So potential advertisements are more likely to be seen by these users.

In [22]:
top_10_reviewed=clean_android.sort_values('Reviews',ascending=False).head(10)
top_10_reviewed.head()
Out[22]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
823 Facebook SOCIAL 4.1 78158306 Varies with device 1.000000e+09 Free 0 Teen Social August 3, 2018 Varies with device Varies with device
1271 WhatsApp Messenger COMMUNICATION 4.4 69119316 Varies with device 1.000000e+09 Free 0 Everyone Communication August 3, 2018 Varies with device Varies with device
1397 Instagram SOCIAL 4.5 66577446 Varies with device 1.000000e+09 Free 0 Teen Social July 31, 2018 Varies with device Varies with device
1968 Messenger – Text and Video Chat for Free COMMUNICATION 4.0 56646578 Varies with device 1.000000e+09 Free 0 Everyone Communication August 1, 2018 Varies with device Varies with device
2791 Clash of Clans GAME 4.6 44893888 98M 1.000000e+08 Free 0 Everyone 10+ Strategy July 15, 2018 10.322.16 4.1 and up
In [23]:
top10_reviewed_and_installed=clean_android.sort_values(['Installs','Reviews'],ascending=[False,False]).head(10)
top10_reviewed_and_installed.head(10)
Out[23]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
823 Facebook SOCIAL 4.1 78158306 Varies with device 1.000000e+09 Free 0 Teen Social August 3, 2018 Varies with device Varies with device
1271 WhatsApp Messenger COMMUNICATION 4.4 69119316 Varies with device 1.000000e+09 Free 0 Everyone Communication August 3, 2018 Varies with device Varies with device
1397 Instagram SOCIAL 4.5 66577446 Varies with device 1.000000e+09 Free 0 Teen Social July 31, 2018 Varies with device Varies with device
1968 Messenger – Text and Video Chat for Free COMMUNICATION 4.0 56646578 Varies with device 1.000000e+09 Free 0 Everyone Communication August 1, 2018 Varies with device Varies with device
4544 Subway Surfers GAME 4.5 27725352 76M 1.000000e+09 Free 0 Everyone 10+ Arcade July 12, 2018 1.90.0 4.1 and up
4802 YouTube VIDEO_PLAYERS 4.3 25655305 Varies with device 1.000000e+09 Free 0 Teen Video Players & Editors August 2, 2018 Varies with device Varies with device
7823 Google Photos PHOTOGRAPHY 4.5 10859051 Varies with device 1.000000e+09 Free 0 Everyone Photography August 6, 2018 Varies with device Varies with device
7921 Skype - free IM & video calls COMMUNICATION 4.1 10484169 Varies with device 1.000000e+09 Free 0 Everyone Communication August 3, 2018 Varies with device Varies with device
95 Google Chrome: Fast & Secure COMMUNICATION 4.3 9643041 Varies with device 1.000000e+09 Free 0 Everyone Communication August 1, 2018 Varies with device Varies with device
218 Maps - Navigate & Explore TRAVEL_AND_LOCAL 4.3 9235373 Varies with device 1.000000e+09 Free 0 Everyone Travel & Local July 31, 2018 Varies with device Varies with device

We can see below that the majority of the most installed app have rate from 4.0 to 4.5 while also the highest reviews

In [24]:
#Check relations between columns rating,reviews,installs,Genres
scatter_matrix(clean_android[['Rating','Reviews','Installs']],figsize=(16,6))#,'Installs']])
plt.show()
In [25]:
clean_android.plot.scatter('Rating','Reviews')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x2065ef46860>

We observe two clusters of apps with the most installed

In [26]:
clean_android.plot.scatter('Rating','Installs')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x2065f40d400>
In [27]:
clean_android.plot.scatter('Installs','Reviews')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x2065f4b7a90>

Lets have a look in ios apps

In [28]:
ios=pd.read_csv('AppleStore.csv')
In [32]:
ios.head(2)
Out[32]:
id track_name size_bytes currency price rating_count_tot rating_count_ver user_rating user_rating_ver ver cont_rating prime_genre sup_devices.num ipadSc_urls.num lang.num vpp_lic
0 284882215 Facebook 389879808 USD 0.0 2974676 212 3.5 3.5 95.0 4+ Social Networking 37 1 29 1
1 389801252 Instagram 113954816 USD 0.0 2161558 1289 4.5 4.0 10.23 12+ Photo & Video 37 0 29 1

Here we don't have installs or reviews, so we can't know the total number of users, but we have total number of rating which shows us the number of interactive users with the app.Based on that we will try to see which categories are the most popular here.

In [31]:
#top_10 of most rated apps
ios_top10_rated=ios.sort_values('rating_count_tot',ascending=False).head(10)
ios_top10_rated.head()
Out[31]:
id track_name size_bytes currency price rating_count_tot rating_count_ver user_rating user_rating_ver ver cont_rating prime_genre sup_devices.num ipadSc_urls.num lang.num vpp_lic
0 284882215 Facebook 389879808 USD 0.0 2974676 212 3.5 3.5 95.0 4+ Social Networking 37 1 29 1
1 389801252 Instagram 113954816 USD 0.0 2161558 1289 4.5 4.0 10.23 12+ Photo & Video 37 0 29 1
2 529479190 Clash of Clans 116476928 USD 0.0 2130805 579 4.5 4.5 9.24.12 9+ Games 38 5 18 1
3 420009108 Temple Run 65921024 USD 0.0 1724546 3842 4.5 4.0 1.6.2 9+ Games 40 5 1 1
4 284035177 Pandora - Music & Radio 130242560 USD 0.0 1126879 3594 4.0 4.5 8.4.1 12+ Music 37 4 1 1

CONCLUSION:

We saw here too that first positions are dominated by social apps while instagram is social-photography topic. iOS apps have more games in the top_10 ranking but still the main apps are social and communication categories. So an idea for a new app could be an app that teachs you how to edit video and photography while the work of each user could be published in the social app(facebook,instagram) under competition where the score will be given a part by build in criteria of the app and a second part by other users of the app or the users of the social connected app(For example facebook).Combine a prize provided commercial companies this may attract massively the interest of many customers and therefore will be profitable for the company through the advertisements.