The aim of this Project is to provide some criteria to identify some types of apps in English that could attract more users. Since our Company builds English Android and iOS mobile apps, I will focus my attention on English apps available on Google Play and the App Store. In addition, since our company's builds only free-to-download apps and our main source of revenue consistes of in-app ads, I am interested to spot those apps that attract the highest number of users - i.e. the highest number of people engaging with the in-app ads.
I started by opening and exploring the two datasets that will be the subject of my investigation. The first dataset contains more than 7,000 iOS apps from the App Store (data collected in July 2017). This dataset is called 'AppleStore.csv':
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
applestore_apps_data = list(read_file)
The second dataset contains around 10,000 Android Apps from Google Play (data collected in August 2018).This dataset is called 'googleplaystore.csv':
opened_file = open('googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
googleplay_apps_data = list(read_file)
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
I now look more in detail at the type of data contained in each dataset, and calculate both the total number of apps (i.e. rows) and parameters associated to each app (i.e. columns), in each dataset.
#Exploring Applestore
explore_data(applestore_apps_data, 0, 4, True)
#Exploring GooglePlay
explore_data(googleplay_apps_data, 0, 4, True)
Before performing my analysis, I need to remove duplicated data and corrected errors.
I firstly check wheather some of the rows had missing information. I compar the length of the header of each dataset with the length of each row in the dataset. I notice that in the GooglePlay dataset one row is shorter than the header. By looking at the original file source I discover that the "Category" parameter is missing. The row containing the mistake appears to be row 10473.
#Identifying rows with missing information on Apple Store
for row in applestore_apps_data:
if len(row) != len(applestore_apps_data[0]):
print(applestore_apps_data[0])
print(len(applestore_apps_data[0]))
print(row)
print(len(row))
#Identifying rows with missing information on Google Play
for row in googleplay_apps_data:
if len(row) != len(googleplay_apps_data[0]):
print(googleplay_apps_data[0]) #this is the dataset header.
print(len(googleplay_apps_data[0])) #this is the length of the header (i.e. normal row).
print(row) #this is the row containing the mistake.
print(len(row)) #this is the length of the shorter row.
print(googleplay_apps_data.index(row)) #this is the number of the row containing the mistake.
I therefore delete the row with the missing information.
del(googleplay_apps_data[10473])
Another possible source of mistakes is the presence of duplicated entries (i.e. some apps are present multiple times). I check if this is the case in both databases using a code that allowed me to identify multiple entries and show them:
duplicated_apps = []
unique_apps = []
for app in applestore_apps_data:
name = app[1]
if name in unique_apps:
duplicated_apps.append(name)
else:
unique_apps.append(name)
print(duplicated_apps[:5])
print('number of unique apps_apple: ', len(unique_apps))
print('number of duplicated apps_apple: ', len(duplicated_apps))
duplicated_apps = []
unique_apps = []
for app in googleplay_apps_data:
name = app[0]
if name in unique_apps:
duplicated_apps.append(name)
else:
unique_apps.append(name)
print(duplicated_apps[:5])
print('number of unique apps_google: ', len(unique_apps))
print('number of duplicated apps_google: ', len(duplicated_apps))
I have observed the presence of multiple duplicated apps, especially in GooglePlay. I will start cleaning this dataset first. I need to establish a criterium to select the right parameters to keep of each duplicated app. I notice that the only difference between the various duplicated versions of the same app is the total number of reviews, possibly because the information was retrieved at various moments during the same day. Indeed, the various numbers, albeit different, are all quite similar. I therefore decide to keep only the one duplicate with the highest number of reviews. To do so, I firstly create a list of all the single GooglePlay app and their associated maximum number of reviews.
reviews_max = {}
for app in googleplay_apps_data[1:]:
name = app[0]
n_reviews = float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
print('Google Play reviews max length = ', len(reviews_max)) #number of unique elements in the Google Play database.
Now I want to substitute the value for the number of reviews with the maximum number of reviews for each value in the initial dataset GooglePlay. In a second moment, I delete all duplicated vaules in this dataset, creating a new, clean dataset (called unique_googleplay_apps_data), and use this last one to clean the initial dataset. To do so, I loop through the GooglePlay dataset and included each of them in the list of duplicated values. I include it in the list of unique values only if the number of reviews matched the one in the dictionary "reviews_max" and if the value is not altready present in the list of duplicated values (meaning it was a unique entry). The obtained list contains 9659 entries, plus the header, as expected.
unique_googleplay_apps_data = []
duplicated_googleplay_apps_data = []
unique_googleplay_apps_data.append(googleplay_apps_data[0])
for app in googleplay_apps_data[1:]:
name = app[0]
n_reviews = float(app[3])
if (reviews_max[name] == n_reviews) and (name not in duplicated_googleplay_apps_data):
unique_googleplay_apps_data.append(app)
duplicated_googleplay_apps_data.append(name) # make sure this is inside the if block
print("Unique Google Play list elements = ", len(unique_googleplay_apps_data))
print(unique_googleplay_apps_data[:3]) #first three entries in the cleaned Google Play dataset.
To clean the Apple Store database, I knew from previous analysis that the names of the repeated apps are the following: 'Mannequin Challenge', 'VR Roller Coaster'. I therefore look for the entries with these names and compared them manually.
for app in applestore_apps_data:
name = app[1]
if name == str('Mannequin Challenge'):
print(app)
print("Apple Store putative duplicated Mannequin Challenge apps = ", applestore_apps_data[0])
for app in applestore_apps_data:
name = app[1]
if name == str('VR Roller Coaster'):
print(app)
print("Apple Store putative duplicated VR Roller Coaster apps = ", applestore_apps_data[0])
From this analysis it is not clear whather the AppleStore dataset contains duplicated entries or not. However, according to this thread (https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409) it seems the apps are all genuine. I therefore did not delete any of them.
For this project, I am only interested in apps directed towards an English-speaking audience. I have therefore to remove all apps not written in English. As a rule of thumb, all apps whose name contain non-English characters have to be excluded. English characters correspond to ASCII numbering 0-127. The definition below will help me distinguish between English and non-English apps. It retrieves "True" when an app is reasonably written in English, and "false" whan it is not. As it will appear clear for the last 2 examples below, however, this system is not optimal, as sometimes special character are used in English as well, for example for emojii or special functions.
def name_english(string):
for character in string:
if ord(character)>127:
return False
return True
print(name_english('Instagram'))
print(name_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(name_english('Docs To Go™ Free Office Suite'))
print(name_english('Instachat 😜'))
Here I will show how I modified the previous definition in order to allow three special characters to be present in the app name in order to keep it included in the English apps dataset:
def name_english1(string):
non_ascii = 0
for character in string:
if (ord(character)>127):
non_ascii += 1
if non_ascii>3: #this program allows pu to three special characters in the app name.
return False
return True
print(name_english('Instagram'))
print(name_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(name_english('Docs To Go™ Free Office Suite'))
print(name_english('Instachat 😜'))
I apply this function to trim all non-English apps out of the two datasets.
English_googleplay_apps_data = []
English_googleplay_apps_data.append(unique_googleplay_apps_data[0])
for app in unique_googleplay_apps_data[1:]:
name = app[0]
if name_english1(name) == True:
English_googleplay_apps_data.append(app)
print('Number of English apps in Google Play = ', len(English_googleplay_apps_data))
print(English_googleplay_apps_data[:4]) #I show the first four rows of the dataset.
English_applestore_apps_data = []
English_applestore_apps_data.append(applestore_apps_data[0])
for app in applestore_apps_data[1:]:
name = app[1]
if name_english1(name) == True:
English_applestore_apps_data.append(app)
print('Number of English apps in Apple Store = ', len(English_applestore_apps_data))
print(English_applestore_apps_data[:4]) #I show the first four rows of the dataset.
As I am only interested in selecting free English apps for my analysis, I used the same procedure to isolate them from both Apple Store and Google Play.
free_eng_googleplay_apps = []
free_eng_googleplay_apps.append(English_googleplay_apps_data[0])
for app in English_googleplay_apps_data[1:]:
type = app[7]
if type == '0':
free_eng_googleplay_apps.append(app)
print(free_eng_googleplay_apps[:3])
print('Number of free English apps in Google Play = ', len(free_eng_googleplay_apps[1:]))
free_eng_applestore_apps = []
free_eng_applestore_apps.append(English_applestore_apps_data[0])
for app in English_applestore_apps_data:
type = app[4]
if type == '0.0':
free_eng_applestore_apps.append(app)
print(free_eng_applestore_apps[:3])
print('Number of free English apps inApple Store = ', len(free_eng_applestore_apps[1:]))
At this point I have finished cleaning my data. There are 8864 apps in Google Play and 3222 apps in Alle Store. I consider these as sufficient big samples to proceed to data analysis.
I here start analysing the two cleaned apps datasets. The Company I work with develops apps for Google Play first, and if it has a good response from users then it is further developed and six months after is launched on Apple Store as well. I am therefore interested in finding out apps that are profitable both in Google Play and Apple Store. I will start by having a closer look at the data and the types of apps that are present in the two databases. From the header of both datasets, I can see that the apps have an associated "category" (in Google Drive) or "prime_genre" (in Apple Store).
print('Google Play table header = ', free_eng_googleplay_apps[0])
print('Apple Store table header = ', free_eng_applestore_apps[0])
Here, I class all apps according to their type. To do so, I created a function that retrieves the frequency table of app types for both markets.
def freq_table(dataset, index):
freq_dictionary = {}
for row in dataset:
value = row[index]
if value in freq_dictionary:
freq_dictionary[value] += 1
else:
freq_dictionary[value] = 1
perc_freq = {}
for element in freq_dictionary:
perc_freq[element] = freq_dictionary[element] / len(dataset)
perc_freq[element] = perc_freq[element]*100
return perc_freq
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
Below, I use the code genereated above to analyse the Category and Genres column of Google Play and the prime_genre column of Apple Store.
print("Percentage of each Category in Google Play")
display_table(free_eng_googleplay_apps[1:], 1)
print("Percentage of each Genres in Google Play")
display_table(free_eng_googleplay_apps[1:], 9)
print("Percentage of each Category in Apple Store")
display_table(free_eng_applestore_apps[1:], 11)
At first sight, it is important to notice that the two datasets are not easy to compare, as they have been classified in a different way, using a different amount of categories and with different names. For what concerns the Google Play market, the Family category is the most common (18%), followed by Games (9.7%) and Tools (8.5%). As Genres, Tools and Entertainment are the most common, although they represent only 8.4% and 6.1% of the whole market, respectively. For what concerns the Apple Store market, the Games genre is clearly the most common, as it represents 58% of the whole market and is followed by Entertainment - a very similar category - which however represents only 7.9% of the market. It is also noteworthy that in Google Play 10 categories represent less than 1% of the market, as well as 87 Genres. For Apple Store, there are 8 prime_genres that represent less than 1% of the market.
From this preliminary analysis, it appears that Apple Store contains a majority of entertainment-related apps, while Google Play is more heterogeneous. However, this is not sufficient to recomment an app category as the most promising to develop. To reach this goal, I will now take into account also the number of users per app category.
The number of users per app is a known parameter for the Google Play market (column "Install") but not for Apple Store. For this market, the only known information is the total number of ratings, that could be considered as a proxy of the total number of users. I will start by calculating the number of reatings per each app prime_genre in Apple Store.
genres_ios = freq_table(free_eng_applestore_apps[1:], -5)
print("Apple Store app categories and their associated average number of ratings:")
for genre in genres_ios:
total = 0
len_genre = 0
for app in free_eng_applestore_apps[1:]:
genre_app = app[-5]
if genre_app == genre:
n_ratings = float(app[5])
total += n_ratings
len_genre += 1
avg_n_ratings = total / len_genre
print(genre, ':', avg_n_ratings)
For Apple Store, the categories that contain the highest amount of User ratings are those belonging to the Navigation (average number of rating: 86090) and Reference (average number of ratings: 74942) categories. Here I inspect the name of these apps:
print("Apple Store Navigation apps and their associated average number of ratings:")
for app in free_eng_applestore_apps:
genre = app[-5]
if genre == "Navigation":
print(app[1], app[5])
print("Apple Store Reference apps and their associated average number of ratings:")
for app in free_eng_applestore_apps:
genre = app[-5]
if genre == "Reference":
print(app[1], app[5])
In comparison, entertaining types of apps have a lower number of rating. For example, the Games category is in 14th position out of 23 and has a mean of 22,788 ratings, which is less than 30% the number of ratings of the Navigation category. Here again, it is interesting to see the total number of apps that belong to the Games category:
print("Appe Store Game apps and their associated average number of ratings:")
Game_apps = []
for app in free_eng_applestore_apps:
genre = app[-5]
if genre == "Games":
Game_apps.append(app)
print(app[1], app[5])
print("Total Apple Store game apps = ", len(Game_apps))
Indeed, not only the mean rating is not the highest for the Games apps category, but the market appears to be saturated with nearly 2,000 apps of this type, obviously of various quality and interest for the users, as one can see by looking at the mean rating per app.
Based on the Apple Store apps data we have anayzed, it appears that game apps are overrepresented on the store, therefore the visibility of single apps is expected to be low. However, two top categories (Navigation and Reference) seem to be very successful, based on the mean number of ratings. These are also categories with lower number of apps per type, meaning that there should still be room for new, successful apps. For example, specific reference-type apps (e.g. dictionaries targeting less common languages, like Roumanian or Greek) would attract a discrete number of users is specific countries and quickly gain visibility. Niche dictionaries, such as dictionary of business terms or tourism could also be profitable.
Google Play provides data about the number of app users in the following way: each app is associated to a class that reflects the total number of users. The classes are not specific numbers but opened numbers (such as "more than 1 million, between 10,000 and 10,000,000 etc). For the purpose of this analysis, I considered that the actual number of users per app corresponds to the exact number of the class it is assigned to, even though this is just an approximation.
print("Distribution of Google Play app categories and associated average users:")
genres_google = freq_table(free_eng_googleplay_apps[1:], 1)
for category in genres_google:
total = 0
len_category = 0
for app in free_eng_googleplay_apps[1:]:
category_app = app[1]
if category_app == category:
n_ratings = float(app[3])
total += n_ratings
len_category += 1
avg_n_ratings = total / len_category
print(category, ':', avg_n_ratings)
The distribution observed in Google Play appears very different than the one of Apple Store. The highest categories for number of users here are the Communication, Social and Game. Since these data are more accurate than those of Apple Store, as I did not have to use the number of ratings as a proxy of the number of users, these data are also more reliable. It is interesting therefore to analyze the first three categories of this list more in detail.
communication_apps = []
print("List of Communication apps on Google Play =")
for app in free_eng_googleplay_apps[1:]:
category = app[1]
if category == "COMMUNICATION":
print(app[0], app[3])
communication_apps.append(app)
print("Number of Communication apps on Google Play =", len(communication_apps))
socialnetwork_apps = []
print("List of Social Network apps on Google Play =")
for app in free_eng_googleplay_apps[1:]:
category = app[1]
if category == "SOCIAL":
print(app[0], app[3])
socialnetwork_apps.append(app)
print("Number of Social Network apps on Google Play =", len(socialnetwork_apps))
games_apps = []
print("List of Games apps on Google Play =")
for app in free_eng_googleplay_apps[1:]:
category = app[1]
if category == "GAME":
print(app[0], app[3])
games_apps.append(app)
print("Number of Games apps on Google Play =", len(games_apps))
It appears that all the top three categories on Google Play (especially the Games category) are saturated, with more than 200 apps in each. In addition, since many apps are present the mean number of users is influenced by it, with some apps almost not used and some others (for example Facebook) with millions of users, therefore substantially influencing the overall average value. For the Apple Store market, I came up with the suggestion of focusing the attention on the Reference category of apps, by building more advanced and interactive dictionaries or focusing on less common language pairs. To test if this would be a good idea also on the basis of the Google Play dataset, I check the number of apps and their ratings in the "Books and Reference" category:
reference_apps = []
print("List of Books and Reference apps on Google Play =")
for app in free_eng_googleplay_apps[1:]:
category = app[1]
if category == "BOOKS_AND_REFERENCE":
print(app[0], app[3])
reference_apps.append(app)
print("Number of Books and Reference apps on Google Play =", len(reference_apps))
This is also a very populated category but the result is altered by the high number of book apps. Reference apps are quite few, as the word "dictionary" is rarely present in the apps name:
dictionary_apps = []
print("List of Reference apps on Google Play =")
for app in free_eng_googleplay_apps[1:]:
name = app[0]
if str("dict") in name:
print(app[0], app[3])
dictionary_apps.append(app)
elif str("Dict") in name:
print(app[0], app[3])
dictionary_apps.append(app)
elif str("DICT") in name:
print(app[0], app[3])
dictionary_apps.append(app)
print("Number of Reference apps on Google Play =", len(dictionary_apps))
The list containing only Dictionary apps appears much more reduced, with only 35 apps in total. Around one third of dictionary apps appear to be quite popular, with ten apps having more than 100,000 users. It is noteworthy that only few language pairs are present. For example, Spanish, Portuguese, Italian or even Chinese (very common languages) are not present.
It seems that there is a high potential in expanding this app category with more dictionaries that target more languages for English-speaking people. This app category is still poorly developed, with several subniches that can be rapidly and easily exploited. One third of the apps currenlty present are highly successful, which suggests that other good-quality apps could attract the attention of potential users. Building dictionary apps does not require any specific knowledge that would be necessary for other kind of apps such as gaming or buisiness administration, and therefore should be feasible for our company and ensure profit.
In this Project I have analyzed the list of apps present in both Apple Store and Google Play markets as of July 2018. The limitations of this analysis come from the fact that the data of the two datasets were not homogeneous. Two main differences were present: i) the app classification was done differently for the two databases, and ii) the Apple Store dataset did not contain the information about the number of users per app, and for this I had to use the number of app reviews as a proxy.
Nevertheless, I identified a subdeveloped niche (dictionaries) which has the potential of attracting a high number of users and that can be easily developed further, with low effort and no specific knowledge required. More specifically, I suggest to develop dictionaries for new language pairs (English-Spanish, English-Greek, English-Italian etc) or subniches (business English, small dictionaries for tourism-related situations).