The aim of this project is to identify profitable Android (Google Play) and iOS (the App Store) mobile apps.
The apps in consideration are free to download and install, and the main source of the company's revenue consists of in-app ads. This means the revenue for any given app is mostly influenced by the number of its users - the more users that see and engage with the ads, the better. Hence it is necessary to analyze available data to understand what type of apps are likely to attract more users both on Google Play and the App Store.
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.
Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try first to analyze a sample of the data instead, to see if we can find any relevant existing data at no cost. For this purpose, there are 2 data sets available in the form of CSV files:
To open and explore these two data sets, a function explore_data()
was created:
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
# Opening the data sets and saving both as lists of lists
from csv import reader
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]
explore_data(android, 0, 3, True)
explore_data(ios, 0, 3, True)
# Android data set columns
print(android_header)
print('\n')
# iOS data set columns
print(ios_header)
The Google Play data set (Android apps) contains 10,841 apps and 13 columns. The most informative columns for us seem to be the following: 'App'
, 'Category'
, 'Rating'
, 'Reviews'
, 'Installs'
, 'Type'
, 'Price'
, 'Content Rating'
and 'Genres'
.
The App Store data set (iOS apps) contains 7,197 apps and 16 columns. The columns potentially useful for our data analysis might be the following: 'track_name'
, 'currency'
, 'price'
, 'rating_count_tot'
, 'rating_count_ver'
, 'user_rating'
, 'user_rating_ver'
, 'cont_rating'
and 'prime_genre'
.
For further details about both data sets and the meaning of each column, the corresponding data set documentation can be addressed: Android apps data set and iOS apps data set.
For both data sets discussion sections are available here: for Google Play and for the App Store. In the discussion section dedicated to Google Play data set in one of the topics it was reported a wrong value for the row 10,472 (missing 'Rating'
and a column shift for next columns).
print(android_header)
print('\n')
print(android[10472])
Inspecting the reported row, we can see that the missing value is actually not 'Rating'
but 'Category'
, and also for 'Genres'
there is no value. For comparison, let's check some other row of this data set:
print(android_header)
print('\n')
print(android[5])
Hence the row 10,472 indeed has a missing value for 'Category'
, empty cell for 'Genres'
, and all the values in between are shifted to the left. This row has to be removed from the data set:
del android[10472]
Exploring the Google Play data set, it was discovered that some apps have duplicate entries. For instance, Instagram has 4 entries:
for app in android:
name = app[0]
if name == 'Instagram':
print(app)
In total, there are 1,181 cases where an app occurs more than once:
# Creating the lists of duplicate apps and unique apps
duplicate_apps = []
unique_apps = []
for app in android:
name = app[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])
We need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.
Returning to the rows we printed for the Instagram app, the main difference happens on the 4th position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times:
for app in android:
name = app[0]
if name == 'Instagram':
print(app)
We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.
# Creating a dictionary with the highest number of reviews for each app
reviews_max = {}
for app in android:
name = app[0]
n_reviews = float(app[3])
if (name in reviews_max and reviews_max[name] < n_reviews) or name not in reviews_max:
reviews_max[name] = n_reviews
Given that in the Google Play data set 1,181 duplicates were detected, after we remove the duplicates, we should be left with 9,659 rows. We expect also the length of the dictionary to be equal to 9,659:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))
# Creating a new data set without duplicates (one entry per app)
android_clean = []
already_added = []
for app in android:
name = app[0]
n_reviews = float(app[3])
if (n_reviews == reviews_max[name]) and (name not in already_added):
android_clean.append(app)
already_added.append(name)
Checking the length of the resulting data set (again, expected value is 9,659):
print(len(android_clean))
print(android_clean[:5])
Since our company uses only English to develop its apps, we'd like to analyze only the apps that are directed toward an English-speaking audience.
Inspecting both data sets, it was detected that both have also apps with non-English names, that is containing symbols unusual for English texts (i.e. not English letters, digits 0-9, punctuation marks, and special symbols). These apps have to be removed.
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(android_clean[442][0])
print(android_clean[7940][0])
According to the ASCII system, the numbers corresponding to the set of common English characters are all in the range 0-127. Hence we have to create a function to identify if each symbol of each app name belongs or not to this range. If it doesn't, the app cannot be considered for further data analysis and has to be removed from the data set.
def english_apps(string):
for symbol in string:
if ord(symbol) > 127:
return False
return True
Let's check this function on some apps:
print(english_apps('Instagram'))
print(english_apps('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_apps('Docs To Go™ Free Office Suite'))
print(english_apps('Instachat 😜'))
It results that sometimes the function cannot correctly identify certain English app names containing emojis and some special characters that fall outside the ASCII range. In this case we can lose valuable data.
To minimize the impact of data loss, we'll only remove an app if its name has more than 3 characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to 3 such symbols will still be labeled as English.
# Editing the previous function
def english_apps(string):
acceptable = 0
for symbol in string:
if ord(symbol) > 127:
acceptable += 1
if acceptable > 3:
return False
return True
# Checking the updated function
print(english_apps('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_apps('Docs To Go™ Free Office Suite'))
print(english_apps('Instachat 😜'))
Now we will filter out non-English apps from both data sets:
android_cleaned_filtered = []
ios_filtered = []
for app in android_clean:
check = english_apps(app[0])
if check == True:
android_cleaned_filtered.append(app)
for app in ios:
check = english_apps(app[1])
if check == True:
ios_filtered.append(app)
explore_data(android_cleaned_filtered, 0, 3, True)
explore_data(ios_filtered, 0, 3, True)
After filtering the data set with android apps counts 9,614 rows and the one with iOS apps 6,183 rows.
The company is specialized in building only free apps. Hence, before proceeding to the data analysis step, we have to remove all non-free apps from both data sets.
android_final = []
ios_final = []
for app in android_cleaned_filtered:
if app[7] == '0':
android_final.append(app)
for app in ios_filtered:
if app[4] == '0.0':
ios_final.append(app)
print('Final number of android apps:', len(android_final))
print('Final number of iOS apps:', len(ios_final))
Now we have 8,864 android apps and 3,222 iOS apps for further data analysis.
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users, because our revenue is highly influenced by the number of people using our apps.
To minimize risks and overhead, our validation strategy for an app idea is comprised of 3 steps:
Because our final goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.
Let's begin the analysis by getting a sense of what are the most common genres for each market. For Google Play data set the genres of the apps are described in the column 'Genres'
and 'Category'
, for the App Store data set - in the column 'prime_genre'
.
We'll build two functions we can use to analyze the frequency tables:
def freq_table(dataset, index):
dictionary = {}
number_apps = 0
for row in dataset:
number_apps += 1
dictionary[row[index]] = dictionary.get(row[index], 0) + 1
dictionary_percent = {}
for key in dictionary:
dictionary_percent[key] = (dictionary[key] / number_apps) * 100
return dictionary_percent
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
# Prime_genre column
display_table(ios_final, -5)
Among iOS English free apps, the most common genre is Games (58%) followed with a big gap by Entertainment (7.9%). The general impression is that the apps designed for entertainment (games, photo and video, social networking, sports, music) significantly dominate the App Store, in comparison to the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle).
Judging only by the frequency table, we still cannot recommend an app profile for the App Store market, because a large number of apps for a particular genre does not necessarily imply that apps of that genre have a large number of users.
# Category column
display_table(android_final, 1)
Among Android English free apps, the most common categories are also of entertaining character (FAMILY(18.9%) and GAME(9.7%). However, the dispersion of percentages for the other categories is not as large as for iOS apps, and in general a more balanced landscape of both practical and fun apps is observed. The number of categories is comparable with the number of iOS apps' genres.
If we look at the prime_genre
column for Android apps, we will see that it is much more detailed and specified and not anymore comparable with the the number of iOS app genres:
# Genres column
display_table(android_final, 9)
Like in the previous case, from these frequency tables alone we cannot deduce anything about the genres (categories) with the most users and cannot recommend an app profile for Google Play.
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs
column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot
column.
# Calculating the average number of user ratings per app genre on the App Store:
prime_genre = freq_table(ios_final, -5)
for genre in prime_genre:
total = 0
len_genre = 0
for app in ios_final:
genre_app = app[-5]
if genre_app == genre:
number_rating = float(app[5])
total += number_rating
len_genre += 1
average_number_rating = total / len_genre
print(genre, ':', average_number_rating)
Looking at the results, a preliminary conlusion is that the most popular app genres (based on the average number of user ratings) are the following:
Let's investigate more in detail each of them, in particular their contents of apps:
print('Navigation')
for app in ios_final:
if app[-5] == 'Navigation':
print(app[1], ':', app[5])
print('\n')
print('Reference')
for app in ios_final:
if app[-5] == 'Reference':
print(app[1], ':', app[5])
print('\n')
print('Social Networking')
for app in ios_final:
if app[-5] == 'Social Networking':
print(app[1], ':', app[5])
print('\n')
print('Music')
for app in ios_final:
if app[-5] == 'Music':
print(app[1], ':', app[5])
print('\n')
print('Weather')
for app in ios_final:
if app[-5] == 'Weather':
print(app[1], ':', app[5])
print('\n')
print('Book')
for app in ios_final:
if app[-5] == 'Book':
print(app[1], ':', app[5])
Thus, the most promising iOS app profiles seem to be Social Networking and Book.
Our next step is to provide an app profile recommendation for the Google Play market. We have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough, with most values being open-ended (100+, 1,000+, 5,000+, etc.). We want to use these data anyway, after some cleaning: leaving the numbers as they are, removing commas and the plus characters, converting the numbers into float
type.
# Calculating the average number of installs per app genre on Google Play
categories = freq_table(android_final, 1)
for category in categories:
total = 0
len_category = 0
for app in android_final:
category_app = app[1]
if category_app == category:
number_installs = app[5]
number_installs = number_installs.replace('+', '')
number_installs = number_installs.replace(',', '')
total += float(number_installs)
len_category += 1
average_number_installs = total / len_category
print(category, ':', average_number_installs)
We see that the most popular app genres (based on the average number of installs) are the following:
Let's investigate more in detail the contents of their apps. First, it seems that these seemingly popular genres are dominated by some giant apps, with the number of installs more than 100 millions. These values, certainly, result in very biased average values.
for app in android_final:
if app[1] == 'COMMUNICATION':
number_installs = app[5]
number_installs = number_installs.replace('+', '')
number_installs = number_installs.replace(',', '')
if float(number_installs) >= 100000000:
print(app[0], ':', app[5])
If to exclude from consideration these numerous giant apps of COMMUNICATION genre, the average would be reduced roughly 10 times:
under_100_millions = []
for app in android_final:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
under_100_millions.append(float(n_installs))
average_number_installs = sum(under_100_millions) / len(under_100_millions)
print('COMMUNICATION')
print('Before: 38456119')
print('After: ', average_number_installs)
The same tendency is traced for all the other genres that look the most popular ones:
for app in android_final:
if app[1] == 'VIDEO_PLAYERS':
number_installs = app[5]
number_installs = number_installs.replace('+', '')
number_installs = number_installs.replace(',', '')
if float(number_installs) >= 100000000:
print(app[0], ':', app[5])
under_100_millions = []
for app in android_final:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
if (app[1] == 'VIDEO_PLAYERS') and (float(n_installs) < 100000000):
under_100_millions.append(float(n_installs))
average_number_installs = sum(under_100_millions) / len(under_100_millions)
print('VIDEO_PLAYERS')
print('Before: 24727872')
print('After: ', average_number_installs)
for app in android_final:
if app[1] == 'SOCIAL':
number_installs = app[5]
number_installs = number_installs.replace('+', '')
number_installs = number_installs.replace(',', '')
if float(number_installs) >= 100000000:
print(app[0], ':', app[5])
under_100_millions = []
for app in android_final:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
if (app[1] == 'SOCIAL') and (float(n_installs) < 100000000):
under_100_millions.append(float(n_installs))
average_number_installs = sum(under_100_millions) / len(under_100_millions)
print('SOCIAL')
print('Before: 23253652')
print('After: ', average_number_installs)
for app in android_final:
if app[1] == 'PHOTOGRAPHY':
number_installs = app[5]
number_installs = number_installs.replace('+', '')
number_installs = number_installs.replace(',', '')
if float(number_installs) >= 100000000:
print(app[0], ':', app[5])
under_100_millions = []
for app in android_final:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
if (app[1] == 'PHOTOGRAPHY') and (float(n_installs) < 100000000):
under_100_millions.append(float(n_installs))
average_number_installs = sum(under_100_millions) / len(under_100_millions)
print('PHOTOGRAPHY')
print('Before: 17840110')
print('After: ', average_number_installs)
for app in android_final:
if app[1] == 'PRODUCTIVITY':
number_installs = app[5]
number_installs = number_installs.replace('+', '')
number_installs = number_installs.replace(',', '')
if float(number_installs) >= 100000000:
print(app[0], ':', app[5])
under_100_millions = []
for app in android_final:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
if (app[1] == 'PRODUCTIVITY') and (float(n_installs) < 100000000):
under_100_millions.append(float(n_installs))
average_number_installs = sum(under_100_millions) / len(under_100_millions)
print('PRODUCTIVITY')
print('Before: 16787331')
print('After: ', average_number_installs)
for app in android_final:
if app[1] == 'GAME':
number_installs = app[5]
number_installs = number_installs.replace('+', '')
number_installs = number_installs.replace(',', '')
if float(number_installs) >= 100000000:
print(app[0], ':', app[5])
under_100_millions = []
for app in android_final:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
if (app[1] == 'GAME') and (float(n_installs) < 100000000):
under_100_millions.append(float(n_installs))
average_number_installs = sum(under_100_millions) / len(under_100_millions)
print('GAME')
print('Before: 15588015')
print('After: ', average_number_installs)
for app in android_final:
if app[1] == 'TRAVEL_AND_LOCAL':
number_installs = app[5]
number_installs = number_installs.replace('+', '')
number_installs = number_installs.replace(',', '')
if float(number_installs) >= 100000000:
print(app[0], ':', app[5])
under_100_millions = []
for app in android_final:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
if (app[1] == 'TRAVEL_AND_LOCAL') and (float(n_installs) < 100000000):
under_100_millions.append(float(n_installs))
average_number_installs = sum(under_100_millions) / len(under_100_millions)
print('TRAVEL_AND_LOCAL')
print('Before: 13984077')
print('After: ', average_number_installs)
This investigation reveals some insights for each of the most popular genres.
When we were investigating the app genres of the App Store, we defined as potential also the Book profile. For Google Play, the corresponding category (BOOKS_AND_REFERENCE) doesn't appear one of the most popular and, practically, is on the 11th place among the 33 categories. It could be also difficult to extract from here some ideas for a social networking app. Hence to create apps profitable on both markets, books don't seem to be the best chioce.
All in all, after a thorough analysis of the most common and the most popular app genres of both datasets, the SOCIAL NETWORKING profile was suggested as the most interesting for our purposes, i.e. creating profitable free English apps with the revenue based on in-app ads for both the App Store and Google play. To stand out in the existing apps of this kind and to overcome the competition, a right theme has to be selected. As some possible ideas, it was proposed to create an online quiz, quest, some other online games with a lot of people/teams involved, or a social networking app dedicated to searching for co-travellers, discussing itineraries and places to visit.