For this project our aim is to find what type of apps that attract more users.we're working as data analysts for a company that builds Android and iOS mobile apps that are free to download and install.
The main source of revenue consists of in-app ads. this means the more users that see and engage with the ads in our apps, the better. our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.
First we need to open the data sets and then explore it.
from csv import reader
# for App Store
opened_file1 = open('AppleStore.csv')
read_file1 = reader(opened_file1)
ios = list(read_file1)
ios_header = ios[0]
ios = ios[1:]
# for Google Play Store
opened_file2 = open('googleplaystore.csv')
read_file2 = reader(opened_file2)
android = list(read_file2)
android_header = android[0]
android = android[1:]
To explore the two data sets we created a function named explore_data() that we can repeatedly use to print rows in a readable way.
def explore_data(dataset, start, end, rows_columns= False) :
dataset_slice = dataset[start:end]
for row in dataset_slice :
print(row)
print('\n')
if rows_columns :
print('number of rows:', len(dataset))
print('number of columns:', len(dataset[0]))
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)
print(android_header)
print('\n')
explore_data(android, 0, 3, True)
The Google Play data set has a dedicated discussion section, and we can see that one of the discussions describes an error for a row with index 10472
print(android[10472])
The Google Play data set has a dedicated discussion section, and we can see that one of the discussions describes an error for a row with index 10472
## another approach for getting the defected row
for row in android :
if len(row) != len(android_header) :
print(row)
print('\n')
print(android.index(row))
print('\n')
print(android_header)
print(len(android))
del android[10472]
print(len(android))
If you explore the Google Play data set long enough or look at the discussions section, you'll notice some apps have duplicate entries. For instance, Instagram has four entries :
for app in android :
name = app[0]
if name == 'Instagram' :
print(app)
print('\n')
duplicate_apps = []
unique_apps = []
for app in android:
name = app[0]
if name in unique_apps :
duplicate_apps.append(name)
else :
unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])
print('\n')
print('Number of unique apps:', len(unique_apps)) # this is the actual number of apps we should analyze. the apps we s
We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.
If you examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.
The higher the number of reviews, the more recent the data should be so we'll only keep the row with the highest number of reviews and remove the other entries for any given app.
reviews_max = {}
for app in android :
name = app[0]
n_reviews = float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews :
reviews_max[name] = n_reviews
elif name not in reviews_max :
reviews_max[name] = n_reviews
print('Actual length:', len(reviews_max))
Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:
android_clean = []
already_added = []
for app in android:
name = app[0]
n_reviews = float(app[3])
if (n_reviews == reviews_max[name]) and name not in already_added :
android_clean.append(app)
already_added.append(name)
explore_data(android_clean, 0, 3, True)
we use English for the apps we develop at our company if we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience. We're not interested in keeping these apps, so we'll remove them.One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).
Each character we use in a string has a corresponding number associated with it. The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system and we can get the corresponding number of each character using the ord() built-in function.
We built this function below, and we use the built-in ord() function to find out the corresponding encoding number of each character.
def detect(string):
for character in string :
if ord(character) > 127 :
return False
return True
print(detect('Instagram'))
print(detect('爱奇艺PPS -《欢乐颂2》电视剧热播'))
The function seems to work fine but we found that there are some english app names use emoji or other symbols like ™ that fall outside the ASCII range and this will lead to remove useful apps if we use the function in its current form.
print(detect('Docs To Go™ Free Office Suite'))
print(detect('Instachat 😜'))
To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.This means all English apps with up to three emoji or other special characters will still be labeled as English.
We will do this by editing the function we built above (detect()).
def detect(string) :
non_ascii = 0
for character in string :
if ord(character) > 127 :
non_ascii += 1
if non_ascii> 3 :
return False
else :
return True
print(detect('Docs To Go™ Free Office Suite'))
print(detect('Instachat 😜'))
print(detect('爱奇艺PPS -《欢乐颂2》电视剧热播'))
android_english = []
ios_english = []
for app in android_clean :
name = app[0]
if detect(name) :
android_english.append(app)
for app in ios :
name = app[1]
if detect(name) :
ios_english.append(app)
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)
We can see that we're left with 9614 Android apps and 6183 iOS apps.
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.
android_free = [] # the final dataset
ios_free = [] # the final dataset
for app in android_english :
price = app[7]
if '$' in price :
price = price[1:]
if float(price) == 0 :
android_free.append(app)
for app in ios_english :
price = float(app[4])
if price == 0 :
ios_free.append(app)
print(len(android_free))
print(len(ios_free))
We're left with 8864 Android apps and 3222 iOS apps, which should be enough for our analysis.
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.
To minimize risks , our validation strategy for an app idea is :
Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.
We will start the analysis by asking ourselves what are the most common genres for each market.To do this we will build frequency table for prime genre column of the App store data set and the genres and category columns of the Google play store data set.
We'll build two functions we can use to analyze the frequency tables:
def freq_table(dataset, index) :
frequency = {}
for row in dataset :
value = row[index]
if value in frequency :
frequency[value] += 1
else :
frequency[value] = 1
frequency_percentage = {}
for value in frequency :
percentage = (frequency[value]/ len(dataset)) * 100
frequency_percentage[value] = percentage
return frequency_percentage
def display_table(dataset, index) :
table = freq_table(dataset, index)
table_display = []
for key in table :
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted :
print(entry[1], ':', entry[0])
We start by examining the category and generes columns of the Play store.
# for play store
display_table(android_free, 1) # category
After analyzing the frequency table of the category column in Google play store data set we see that most of the apps are designed for pracical purposes (family, tools, business, lifestyle, productivity, etc). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.
# for play store
display_table(android_free, 9) # genres
We see that the frequency table of the genres column also is like the category column as both of them most of their apps are designed for practical purposes. we notice also that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.
# for app store
display_table(ios_free, -5) # prime genres
We can see that among free english apps more than a half (58.16%) are games. Entertainment apps are close to 8% followed by photo and video apps which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.
The general impression is that app store apps (at least in free english data set) seem as if the most of them is designed for entertainment(games, photo and video, social networking, sports, music) and the apps for practical purpose is rare. However the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users.
Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.
Now, we'd like to get an idea about the kind of apps with the most users. To do so we need to calculate the average number of installs for each app genre.
For Google Play Store we will use the installs
column. As the same information is missing for App Store we will use rating_count_tot
column instead.
To calculate the average number of user ratings per app genre, we will need to :
genre_frequency = freq_table(ios_free, -5)
for genre in genre_frequency :
total = 0 # This variable will store the sum of user ratings specific to each genre.
len_genre = 0 # This variable will store the number of apps specific to each genre.
for app in ios_free :
genre_app = app[-5]
if genre_app == genre :
n_ratings = float(app[5])
total += n_ratings
len_genre += 1
avg_ratings = total / len_genre
print(genre, ':', avg_ratings)
Despite games being the most common apps in the App store it does not have the highest average number of user ratings per app. the genres that have the average number of user rating per app are Navigation
with 86090 averge, Reference
with 74942 average, and Social Networking
with 71548 average.
Let's look deeper into Navigation, Reference and Social Networking categories.
for app in ios_free :
if app[-5] == 'Navigation' :
print(app[1], ':', app[5]) # print name and number of ratings
For the Navigation
genre in the app store we see that the Waze app dominates the number of ratings (has the most users) then Google Maps comes after.
for app in ios_free :
if app[-5] == 'Reference' :
print(app[1], ':', app[5])
For the Reference
genre in the App store we see that the Bible and the Dictionary apps skew up the number of ratings.
for app in ios_free :
if app[-5] == 'Social Networking' :
print(app[1], ':', app[5])
The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.
As we mentioned before that the majority of the apps in the App store are for fun not practical purposes. It could be promising if we create a Refrence app and add different features to get Refrence_Social Networking app (hybrid app).
One thing we could do is take another popular book and turn it into an app and add community feature to let the users discuss the book between each other and make their fanart.
For Google play store we have data about the number of installs in installs
column. The install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.) but we don't need perfect precision with respect to the number of users. So we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.
display_table(android_free, 5)
categories_frequency = freq_table(android_free, 1)
for category in categories_frequency:
total = 0
len_category = 0
for app in android_free:
category_app = app[1]
if category_app == category:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
total += float(n_installs)
len_category += 1
avg_n_installs = total / len_category
print(category, ':', avg_n_installs)
On first impression we see that the categories that have the highest average number of installs is COMMUNICATION with 38456119, VIDEO_PLAYERS with 24727872, SOCIAL with 23253652 , PHOTOGRAPHY, GAME, Travel_AND_LOCAL, TOOLS, NEWS_AND_MAGAZINES and BOOKS_AND_REFERENCE. So let's dive deeper into those categories.
Let's build a function that displays app name for each category and its number of installs.
def app_installs(category) :
for app in android_free :
category_app = app[1]
app_name = app[0]
if category_app == category:
n_installs = (app[5])
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
n_installs = float(n_installs)
print(app_name, ':', app[5])
app_installs('COMMUNICATION')
We see that there are so many apps so we will narrow down the displayed apps between 100,000,000 and 1,000,000,000 installs.
To do this we will creat a new function to use it as a filtering tool.
def installs_filter(category) :
for app in android_free :
app_name = app[0]
category_app = app[1]
if category_app == category :
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
n_installs = float(n_installs)
if 100000000 <= n_installs <= 1000000000 :
print(app_name, ':', app[5])
installs_filter('COMMUNICATION')
From the previous output we conclude that the most installs in COMMUNICATION category is dominated by the big name companies like Facebook , Google and skype which will be hard to compete against.
installs_filter('VIDEO_PLAYERS')
installs_filter('SOCIAL')
installs_filter('PHOTOGRAPHY')
installs_filter('PRODUCTIVITY')
We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).
installs_filter('GAME')
For games category we found out that the market is already saturated and developing an app in this genre will not be the best move.
The other categories like Travel_AND_LOCAL, TOOLS, NEWS_AND_MAGAZINES don't seem interesting enough except for BOOKS_AND_REFERENCE which has 8767811 average number of installs.
Let's take a look in BOOKS_AND_REFERENCE category.
app_installs('BOOKS_AND_REFERENCE')
installs_filter('BOOKS_AND_REFERENCE')
This category seems promising to work with as we notice that there is no intense competition between the apps and also there is no giant companies like the other mentioned categories. However this does not mean that there is no apps with high number of installs.
Also we can work on the same idea we have in the App store and develop an app of a popular book an add some features to level up the content of the app as our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.
We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and App Store, and add some special features besides the raw version of the book. This might include a community feature to let the users discuss the book between each other and make their fanart.