As a datascientist at a company that builds apps for Apple Store and Google Play markets our goal is to help developers to take a datadriven decisions about new app idea.
In this project our purpose is to build a profitable app that will be free to download and to install. Our main revenue relies on in-app ads. That's why we aim to reach more users and want our user to spend more time on our app. In such a a way more users will see and will engage with adds.
To find out what kind of apps mights be interesting for users we need to analyze data to help our developpers to understand what type of apps will attract more users.
For this project we're going to study Apple Store data and Google Market data in order to find the best directions to follow to build a new app. We want to make our app available on both markets.
We will focuse on free English-speaking apps.
In this task we're going to use a sample of already availbale data online.
Hopefully, we have available data for both platforms:
In the code below we :
list_data
.After these steps we want to find out what columns can be potentially interesting for our data analysis.
from csv import reader
def open_file(dataset, has_header=True):
opened_file = open(dataset, encoding='utf8')
read_file = reader(opened_file)
list_data = list(read_file)
if has_header:
return list_data[0], list_data[1:]
else:
return list_data[1:]
ios_header, ios = open_file('AppleStore.csv', has_header=True)
android_header, android = open_file('googleplaystore.csv', has_header=True)
The function explore_data
permits to explore our datasets from different perspectives:
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n')
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
print(ios_header)
print('\n')
explore_data(ios, 2, 5, rows_and_columns=True)
After a quick glance over our header we can assume that we might be interested in columns such as :
For more detailed info about columns check here.
print(android_header)
print('\n')
explore_data(android, 2, 5, rows_and_columns=True)
After a quick glance over our header we can assume that we might be interested in columns such as :
For more detailed info about columns check here.
In these previous outputs we can observe that the following datasets have:
1. iOS apps:
* 16 columns
* 7197 rows
2. Android apps:
* 13 columns
* 10841 rows
So we might think that there are much more apps for Google Play market and we should build our app for that platform. The discrepency might come:
Before getting started with our analysis we need to figure out if our data set is clean, have no errors and all the data points correspond to our requirements.
The function below we will check if all of our data set rows have the same length. More exactly we compare the length of each row to the length of our header.
def length_row(header, dataset):
header_lenght = len(header)
count = 0
for row in dataset:
count += 1
if header_lenght != len(row):
return count, row
ios_error = length_row(ios_header, ios)
android_error_row, android_error_line = length_row(android_header, android)
print(ios_error)
As we can see our iOS data set doesn have such length problem.
Below we're printing the number of the row error as well as the line itself, our header and another line. This will help to visually undrestand where is the problem.
print(android_error_row)
print(android_error_line)
print()
print(android_header)
print()
print(android[0])
Now we can see that Google Store dataset has an issue with a missing cell corresponding to 'Category' (ind 1) column on a line 10472. So we can delete this row to avoid any problems.
print(len(android))
del android[10472]
print(len(android))
Before moving on we need to make sure that our dataset contain no duplicates. We want to keep just one entry per app. Otherwise it can mislead our conclusion.
Below we're wrote a fucntion that checks if our datasets have duplicates application and how many.
def if_duplicate(dataset, index):
unique_apps =[]
duplicate_apps = []
for app in dataset:
name = app[index]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
return unique_apps, duplicate_apps
ios_unique, ios_duplicate = if_duplicate(ios, 1)
android_unique, android_duplicate = if_duplicate(android, 0)
print('Number of unique iOS apps is ', len(ios_unique))
print('Number of duplicate iOS apps is ', len(ios_duplicate))
print('\n')
print('Number of unique Android apps is ', len(android_unique))
print('Number of duplicate Android apps is ', len(android_duplicate))
Those duplicates should be removed but we want to be able to make a choice which one is not or less usefull. For this reason we're going to verify the rows of duplicate apps in order to find a good criterion for deleting/keeping.
First of all we are going to find out the most frequent duplicates. This will give us more occurencies - so better view on value differences.
For this we are creating a frequency table with app name as a key and frequency as a value. Afterwards we want to check what duplicates are the most frequent in Google Market data set. We're checking what are the apps that have more than 7 occurences.
duplicates_freq = {}
for app in android:
name = app[0]
if name not in duplicates_freq:
duplicates_freq[name] = 1
else:
duplicates_freq[name] += 1
for app in duplicates_freq:
if duplicates_freq[app] >= 7:
print(app)
In the code below we are printing all rows with the apps name Duolingo: Learn Languages Free
.
print(android_header)
for app in android:
name = app[0]
if name == 'Duolingo: Learn Languages Free':
print('\n')
print(app)
We have 7 occurences with several but nevertheless important differences:
This explain why we can not make our deletion randomly. Of course, in this case we're dealing with the same app but maybe different version, or maybe it's justa data scrapping issue (data collection was performed in different periods).
That's why it would be more relevant to keep the app with more important review number.
print(ios_duplicate)
iOS app dataset has only two duplicates.
for app in ios:
name = app[1]
if name == 'VR Roller Coaster':
print('\n')
print(app)
print('\n')
print(ios_header)
The main difference in duplicate apps consists in raiting count and rating count version. For this reason we are going to use review count criterion to delete our ios apps duplicates.
To do so we will:
def highest_review_app(dataset, ind_app, ind_review):
max_reviews = {}
for row in dataset:
app = row[ind_app]
review = float(row[ind_review])
if app in max_reviews and max_reviews[app] < review:
max_reviews[app] = review
elif app not in max_reviews:
max_reviews[app] = review
return max_reviews
ios_max_review = highest_review_app(ios, 1, 5)
android_max_review = highest_review_app(android, 0, 3)
print('Expected Google Market dataset length : ', len(android) - len(android_duplicate))
print('Actual Google Market dataset length : ', len(android_max_review))
print('\n')
print('Expected Apple Store dataset length : ', len(ios) - len(ios_duplicate))
print('Actual Apple Store dataset length : ', len(ios_max_review))
The function below:
We need the last step because we might have several apps with the same name and the same number of reviews thats why we need to double check with an already_added
list. If we don't use this condition we might end up with several duplicates.
def cleaning(dataset, ind_app, ind_review):
apps_clean = []
already_added = []
max_review = highest_review_app(dataset, ind_app, ind_review)
for app in dataset:
name = app[ind_app]
n_review = float(app[ind_review])
if (name not in already_added) and (n_review == max_review[name]):
apps_clean.append(app)
already_added.append(name)
return apps_clean
android_clean = cleaning(android, 0, 3)
explore_data(android_clean, 0, 3, True)
ios_clean = cleaning(ios, 1, 5)
explore_data(ios_clean, 0, 3, True)
Little reminder: our goal is to creat an app for English speaking users. This is why our second step will be cleaning the dataset from Non-English apps. Unfortunately, we can not be straightforward an ddelete all the apps that have non-ASCII characters.
Some apps may contain non-ASCII characters but still remain English speaking apps. For example, some app names may contain emojis or different characters that are not included in ASCII range of 127.
print(ord('™'))
print(ord('😜'))
In our case we our going to limit our app name filter to only 3 non-ASCII characters. In case if the apps name consists of 3 or less characters we will check if all the characters are non-ASCII.
def is_english(string):
non_ascii = 0
for ch in string:
if ord(ch) > 127:
non_ascii += 1
if non_ascii == len(string) or non_ascii >= 3:
return False
else:
return True
print(is_english('Instachat'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
Now, using this function we can explore our dataset and sort non-english apps (by our definition) by saving only English apps.
android_en = []
ios_en = []
for app in android_clean:
name = app[0]
if is_english(name):
android_en.append(app)
for app in ios_clean:
name = app[1]
if is_english(name):
ios_en.append(app)
explore_data(android_en, 0, 3, True)
explore_data(ios_en, 0, 3, True)
As we mentioned above: we're planing to build only free apps, thats why for our decision making process we want to keep only free apps.
In the code below we are creating new lists that will contain only free apps.
android_final = []
ios_final = []
for app in android_en:
price = app[7]
if price == '0':
android_final.append(app)
for app in ios_en:
price = app[4]
if price == '0.0':
ios_final.append(app)
print(len(android_final))
print(len(ios_final))
We're left with 8848 Android apps and 3196 iOS apps.
After all the cleaning process we can study our data in order to find presumably the most profitable category to develop. Our low lisk strategy is following:
We will study both dataset because in the future we're planning to engage with both markets an dwe need to find the most succesful niche.
Our next step will be discovering the most popular genres within each dataset. In the code below we are going to create a frequency table for each genre.
def freq_table(dataset, index):
table ={}
count = 0
#creating a frequency table
for row in dataset:
elem = row[index]
count += 1
if elem in table:
table[elem] += 1
else:
table[elem] = 1
#calculating percentage of each app category
for elem in table:
percentage = (table[elem] / count) * 100
table[elem] = percentage
return table
print(freq_table(ios_final, -5))
These results are not quite readable so we are going to sort our genres/categories according to their percentage in a descneding order.
To do so we're going to:
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
val_to_append = (table[key], key)
table_display.append(val_to_append)
table_sort = sorted(table_display, reverse = True)
for elem in table_sort:
print(elem[1], ' : ', elem[0])
print(display_table(ios_final, -5)) #exploring prime_genre
Observation of the ios dataset:
- Games is the most important category with more than half (58,26%)
- Entertainement takes the second place with almost 8%
- Photo & Video have reached 5%
- Education reaches 3,69%
This shows that the most present apps in our dataset are apps designed for fun and less for practical purposes(education, productivity, utilities, weather, business, etc.)
In this dataset app categories are not distributed equally.
Nevertheless, giving this information we can not assume that these apps have the greatest number of users. The demand might be less important as the offer.
print(display_table(android_final, -4))#exploring Genre
This percentage proportion shows us that we Google Play market has more of practical apps than fun apps.
In the top selection we have :
Other practical apps are more equally represented through the whole set of categories. This was not the case for iOS apps.
print(display_table(android_final, 1)) #exploring category
From this representation we can assume that the biggest part of apps were designed for practical purpose. All the top categories , except Game category with 10% are represented by productive apps.
However the Family
category is a bit vague. We are going to see what kind of apps are represented in this category and what types of genre is assigned to this category.
In the code below we've:
FAMILY
.def category_genre(dataset, elem, ind_imp, ind_comp):
table = {}
l_comp = []
for row in dataset:
if elem == row[ind_imp]:
l_comp.append(row[ind_comp])
for el in l_comp:
if el in table:
table[el] += 1
else:
table[el] = 1
return table, len(l_comp)
cat_to_gen, len_category = category_genre(android_final, 'FAMILY' , 1, -4)
genre_sort = []
for genre in cat_to_gen:
freq_genre = (cat_to_gen[genre], genre)
genre_sort.append(freq_genre)
genre_sort = sorted(genre_sort, reverse = True)
print('Google Play Market contains ', len_category, ' apps which category is FAMILY')
print()
for genre in genre_sort:
print(genre[1], ' : ', genre[0])
As we've expected the biggest part of apps allocated to the FAMILY
category has fun purpose: genre Entertainement
and different sub-types of entertainement and games take most important part in this category.
Nevertheless, Google Play app distribution between categories is more balanceв in comparison to iOS app dataset.
We need to find out what types of genres/categories are the most popular among users. For this we need to calculate the avarage number of installs for each app genre.
There are no information about the install numbers in our iOS dataset. So we're going to use the raiting_count_tot
column.
In the code below we're:
freq_genre_ios = freq_table(ios_final, -5)
for genre in freq_genre_ios:
total = 0
len_genre = 0
for row in ios_final:
genre_app = row[-5]
if genre == genre_app:
raiting = float(row[5])
total += raiting
len_genre += 1
average_raiting = total / len_genre
print(genre, ' : ', average_raiting)
It seems like the apps that received more raitings belong to Social Networking
, Reference
, Music
, Navigation
and Education
.
These categories must be dominated by the largest groups. Navigation
by Waze, Google Maps, etc. Social Networking
by Facebook, Pinterest, Instagram, etc. Music
by Spotify, Shazam, Pandora, etc.
These big influencers have a big impact on the whole map.
def most_comm(dataset, pattern):
for app in dataset:
name = app[1]
genre = app[-5]
if genre == pattern:
print(name, ':', app[5])
print('Most common apps within Navigation category :')
print(most_comm(ios_final, 'Navigation'))
print('Most common apps within Social Networking category')
print(most_comm(ios_final, 'Social Networking'))
As for Reference
category this result is influenced by the Bible app and Dictionary.com
print('Most common apps within Reference category')
print(most_comm(ios_final, 'Reference'))
Nevertheless we can explore more this category for few reasons:
For example, we might:
print('Most common apps within Education category')
print(most_comm(ios_final, 'Education'))
For example, we can notice that Education
category is predominated by learning langues apps or training apps. So we could make some app in the middle of 3 categories : Reference
, Education
and Book
.
Other categories represent less interest for us:
Weather
apps - people so not spend a lot of time watching weather forecast. So our chances to get a profit from in-app adds is quite low. We should get a reliable weather data which may require us to connect to non-free APIs.Food and drink
are dominated by huge businesses (Starbucks, Dunkin' Donuts, McDonald's, etc.). ANd we might need an actual cooking and delivery service.Finance
apps - will require us to hire domain expert in order to put in place banking, pay systems, transfers, etc.In Google Play Market we have the number of installs that we can use in order to find the most popular apps.
However the value of this column are not precise:
print(display_table(android_final, 5)) #installs column
Information presented as such doesn't help us to unterstand if the app was downloaded 1,000,000 times or 4,000,000.
However we don't need the meticulous precision. And we'll consider the numbers as they are: 100,000+ will correspond to 100000 downloads.
Thus, we will transform our entries into integers.
freq_category_android = freq_table(android_final, 1)
sort_list = []
for category in freq_category_android:
total = 0
len_category = 0
for app in android_final:
category_app = app[1]
if category_app == category:
n_instals = app[5]
n_instals = n_instals.replace('+', '')
n_instals = n_instals.replace(',', '')
total += float(n_instals)
len_category += 1
#print(n_instals)
average_installs = total / len_category
install_cat = (average_installs, category)
sort_list.append(install_cat)
#print(category, ' : ', average_installs)
sort_list = sorted(sort_list, reverse = True)
for el in sort_list:
print(el[1], ' : ', el[0])
As we can notice the most installed apps belong to COMMUNICATION
category. This specific category is dominated by : WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts with billion installs. And other apps have over 100 and 500 million installs.
def highest_android_installs(dataset, category, ind_instal):
for app in dataset:
if app[1] == category and (app[ind_instal] == '1,000,000,000+' or app[ind_instal] == '500,000,000+' or app[ind_instal] == '100,000,000+'):
print(app[0], ' : ', app[ind_instal])
print(highest_android_installs(android_final, 'COMMUNICATION', 5))
If we delete the giant from this category we'll notice how the avarage will decrease.
under_100_m = []
for app in android_final:
category = 'COMMUNICATION'
n_installs = app[5]
n_installs = n_installs.replace('+', '')
n_installs = n_installs.replace(',', '')
if category == app[1] and float(n_installs) < 100000000:
under_100_m.append(float(n_installs))
print(sum(under_100_m) / len(under_100_m))
#difference between average in Communication category with giant apps and without
print(sort_list[0][0] - (sum(under_100_m) / len(under_100_m)))
The same pattern can be observed for VIDEO_PLAYERS
category.This category is largely dominated by 9 apps.
print(highest_android_installs(android_final, 'VIDEO_PLAYERS', 5))
That's why these apps may creat the impression that these categories are extremely popular and deserve our attention.
Category 'BOOKS_AND_REFERENCE' seems to be popular as well and not overdominated by several apps.
for app in android_final:
if app[1] == 'BOOKS_AND_REFERENCE':
print(app[0], ':', app[5])
print(highest_android_installs(android_final, 'BOOKS_AND_REFERENCE', 5))
This niche seems to be dominated only by specific software allowinf to proccess the books and libraries. So this category may be very interesting for us.
Creating a similar app will be risky, but we might create an app based on a specific book, include different fun features.
In this project we analyzed data about App Store and Google Play mobile apps in order to advise our developpers team on the future app.
We concluded that the best idea would be taking a popular modern book and create an app with fun feautures. This idea seems to be profitable for both markets.