Collecting data from different app stores provides an avenue for app-building businesses to make data-driven decisions. Success ultimately comes from the ability of developers to draw insights that them help capture target consumers in both the android and iOS markets.
For this project, we will assume that we are working as data analysts for a company that builds mobile Android and iOS apps. The company only builds apps that are free to download and install, and revenue is mainly from in-app ads. Thus, the more the number of users who see and engage with these ads, the better the revenue stream. Our goal is to analyze data to help our developers understand what type of apps are likely to attract more users.
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.
You can also find more information about this report at Statista: Number of apps available in leading app stores
Collecting information for Android and iOS apps (over 4 million apps) would require a significant amount of time, effort and money. To avoid spending resources on collecting new data ourselves, We will try to analyze a sample of the data instead. Luckily, there are two data sets that can help us analyze some relevant existing data at no cost:
During the course of this project, we will need to read files and create visualizations. For these purposes, we will import the reader
(for reading files) then plotly's make_subplots
and graph_objects
(for creating visualizations):
from csv import reader
from plotly.subplots import make_subplots
import plotly.graph_objects as go
We will start by opening and exploring and the two datasets we have selected:
To make things easier, we will create a function named extract_data()
. This function will take a file path as its argument, read the file, then return the resulting dataset as a list of lists.
def extract_data(filename):
"""reads file using filename url, then returns a list of lists"""
opened_file = open(filename)
read_file = reader(opened_file)
result = list(read_file)
return result[0], result[1:]
For easy exploration, we will also create another function called explore_data()
. This function will be used to display dataset rows in readable format
def explore_data(dataset, start, end, rows_and_columns=False):
"""
Displays dataset rows in readable format
Parameters:
dataset (list): a list of lists
start (int): start index for dataset slice
end (int): end index for dataset slice
rows and columns (boolean): specifies whether to return rows and columns.
output:
prints the sliced dataset rows
prints the number of dataset rows and columns if rows_and_columns is true
"""
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
Let's open and explore the Google App store dataset using the functions we created.
# reading and extracting the dataset header and data
google_header, google = extract_data('googleplaystore.csv')
# exploring the dataset
print(google_header, '\n')
explore_data(google,0,3,True)
We see that the Google App store dataset contains 10841 rows and 13 columns. At first glance, we can see that the following columns will be useful to consider in our analysis 'App'
, 'Category'
, 'Reviews'
, 'Installs'
, 'Type'
, 'Price'
, and 'Genres'
.
Now, we will repeat the same process to understand the iOS store Dataset
# reading and extracting the dataset header and data
apple_header, apple = extract_data('AppleStore.csv')
# exploring the dataset
print(apple_header, '\n')
explore_data(apple,0,3,True)
This iOS dataset comprises 7197 rows, and 16 columns. The following columns will aid our analysis: 'track_name'
, 'currency'
, 'price'
, 'rating_count_tot'
, 'rating_count_ver'
, and 'prime_genre'
.
Some of these column names are not self explanatory like the google app store columns. However, you can find sufficient information about each individual column here
Before beginning our analysis, we need to make sure the data we analyze is accurate and relevant to our objectives. To avoid erroneous conclusions, it is advisable to carry out the following steps:
Ensure accuracy
Ensure relevance
Recall that at our company only builds apps that are free to download and install, and we design them for an English-speaking audience. This means that we'll need to do the following:
Let's begin by detecting and deleting wrong or innacurate data. To do this we'd define a function called detect_error()
. This takes a dataset and its header as an argument, then prints any row with incomplete data from the dataset alongside its index.
def detect_error(dataset, header):
"""Returns the indices of rows in the dataset that do not match the header row in length"""
for row in dataset:
if len(row) != len(header):
print(header,'\n')
print(row, '\n')
print('The index of the erroneous row is: {}'.format(dataset.index(row)))
Next we'd check the Google App store dataset for inaccurate entries
detect_error(google,google_header)
The erroneous row 10472
has one less column than the header row. This signifies that the entry for one column is missing from this record. We can also see that the Category
value for this row is '1.9'
. This value can never be a valid category for a google store app.
The dedicated discussion section for this dataset has also outlined this problem. To ensure the accuracy of our analysis we will delete this row using Python's built-in del
function
print(len(google)) # check the number of records before deleting
del google[10472] # we need to be careful to run this only once
print(len(google)) # this should be one lesser than the records before deleting
Next we'd check the iOS store dataset for inaccurate entries
detect_error(apple,apple_header)
The function returns no result because the all the column entries are complete for the iOS App dataset
To ensure that there are no duplicates in our dataset we will define the function count_duplicates()
. This takes the dataset as an argument and returns the number of duplicate records.
Another function print_duplicates()
allows us to see all the entries for a particular duplicate app.
Let's define these functions:
def count_duplicates(dataset):
""" Prints the total number of dupicate records in a given dataset """
unique = []
duplicates = {}
for row in dataset:
app = row[0]
if app in unique:
if app in duplicates:
duplicates[app] += 1
else:
duplicates[app] = 1
else:
unique.append(app)
result = sum(duplicates.values()) #collates the total number of duplicate occurences
if result == 0:
print('There are no duplicates here')
else:
print(result)
def print_duplicates(dataset,item,index=0):
"""
Parameters:
dataset (list): The dataset of interest.
item (str): The entry for which duplicates are printed.
index (int): location of item in dataset row.
Result:
prints all duplicates of the given item in the dataset
"""
for row in dataset:
if row[index] == item:
print(row)
We will now check for duplicates in the Google App store dataset:
count_duplicates(google)
In total, there are 1,181 cases where an app occurs more than once. We can view these duplicates using print_duplicates()
. In this case we will examine the duplicate entries for 'WeChat'
and 'Instagram'
below:
print(google_header)
print('\n')
print_duplicates(google, 'WeChat')
print('\n')
print_duplicates(google, 'Instagram')
Now lets check the iOS dataset for duplicates:
count_duplicates(apple)
We can see that Although we have a lot of duplicate reocods in the google App store dataset, there are no duplicates in the iOS dataset. We do not want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.
If we examine the rows we printed for the WeChat
and Instagram
apps, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times with the most recent collection having the highest reviews.
After we examined the Google Play dataset, we observed that there are 1,181 duplicates. Since the lenght of this dataset is 10,840 rows. We should be left with 10,840 - 1,181
= 9,659
rows after we remove the duplicates:
print('Dataset length without duplicates:', len(google) - 1181)
To remove the duplicates, we will:
Let's create the dictionary:
unique_apps = {} # dictionary that will hold the unique app name and highest review value
for row in google:
app_name = row[0]
reviews = int(row[3])
if app_name not in unique_apps:
unique_apps[app_name] = reviews
elif app_name in unique_apps and (unique_apps[app_name] < reviews):
unique_apps[app_name] = reviews
# Check the length of the unique_apps dictionary
print('The length of the unique apps dictionary is: ',len(unique_apps.values()))
Note that the length of the dictionary is exactly 9,659
which was what we estimated that the google app store dataset would have if there were no duplicates.
Now, we'll use the information stored in the unique_apps dictionary and create a new dataset that does not have duplicate entries. To do this we will:
google_clean
(which will store our new cleaned data set) and already_added
(which will keep track of the app names that are added to google clean).app_name
.n_reviews
Finally we will add the current row to google_clean
, and the app name to already_added
if:
unique_apps
dictionary andWe need to add the second condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for unique_apps[app_name]
== n_reviews
, we will still end up with duplicate entries for some apps.
google_clean = []
already_added = []
for row in google:
app_name = row[0]
n_reviews = int(row[3])
if (unique_apps[app_name] == n_reviews) and (app_name not in already_added):
google_clean.append(row)
already_added.append(app_name)
Now let's check the length of the google_clean
list to be sure it contains exactly 9,659
data rows
print('The length of google_clean is: ', len(google_clean))
Since we only intend to develop English apps. We will like to analyze only the apps that are designed for an English-speaking audience. Therefore, we'll check both datasets for app names that suggest they are not designed for an English-speaking audience. If we find any of such entries, we will remove them.
Our criteria will be screening for app names that contain characters that are out of the ASCII
range. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters that fall outside the ASCII range. Will build a function called is_english()
to help us perform this screening excercise:
def is_english(string):
counter = 0
for char in string:
if ord(char) > 127: # ASCII range spans between 0 - 127
counter += 1
if counter > 3: # create permision for up to 3 non-ASCII characters
return False
return True
This means all English apps with up to three emoji or other special characters will still be labeled as English. Our criteria is not perfect, but it should be fairly effective for identifying the apps we want. We will build another function called extract_english()
that goes through a dataset and returns only the English apps.
def extract_english(dataset, index=0):
result = []
for row in dataset:
app_name = row[index]
if is_english(app_name):
result.append(row)
return result
Let's extract the English apps from the google_clean
and apple
data into variables named google_eng
and apple_eng
respectively. We will also use explore_data()
to preview our results.
google_eng = extract_english(google_clean)
apple_eng = extract_english(apple, 1)
explore_data(google_eng,0,3,True)
explore_data(apple_eng,0,3,True)
After removing the Non-English Apps we are left with 9,614 rows of Google App Store data and 6,183 row of iOS App data
We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our datasets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.
First we will define a function called extract_free()
that extracts the free apps by taking a dataset and the column index for price
as an argument. This function will also print the number of extracted rows.
def extract_free(dataset, index):
"""
Parameters:
dataset (list): The dataset of interest.
index (int): location of price info in the dataset row.
Output:
extracts the free apps into a list of lists
prints the number of records in the resulting list of lists
"""
result = []
for row in dataset:
price = row[index]
if price == '0' or price =='0.0':
result.append(row)
print(len(result))
return result
Then, we will extract the free google play and iOS apps into google_free
and apple_free
respectively
google_free, apple_free = extract_free(google_eng, 7), extract_free(apple_eng, 4)
This leaves us with 8,864 records from the google play dataset and 3,222 records from the iOS store dataset.
To recall, our goal is to determine the kinds of apps that are likely to attract more users. The more the number of people using our apps, the more our revenue. It would also be advisable to select app profiles that allow us maximize the potential for advertising.
A profile that is successful in both the Android and IOS markets will provide more avenues for expansion in the future.
We'll begin the analysis by determining the most common app genres for each market. For this, we'll build frequency tables for the prime_genre
column of the iOS data set, and for the Genres
and Category
columns of the google store data set
freq_table()
def freq_table(dataset, index):
result = {}
for row in dataset:
value = row[index]
if value in result:
result[value] +=1
else:
result[value] = 1
total = sum(result.values())
for item in result:
result[item]/=total # Obtain a fraction of the total
result[item]*=100 # Convert the fraction to a percentage
result[item] = round(result[item], 2)
return result
display_table()
def display_table(dataset, index=None):
if isinstance(dataset, list): # if the dataset is a list of lists compute the required frequency first
dictionary = freq_table(dataset, index)
else:
dictionary = dataset # else treat the dataset as a dictionary
result = []
for key, value in dictionary.items():
result.append((value, key)) # Appends a ('value','key') tuple into results
result = sorted(result, reverse=True) # Sort the resulting list in descending order
for item in result:
print(item[-1], ': ' , item[0])
return result
show_visuals()
def show_visuals(dataset, index=None, title_a='', title_b = '', main_title='', y_label=''):
'''computes analysis tables, then displays Bar and Pie charts obtained from analysis of the dataset'''
# store the resulting list from calling the display_table function
item = display_table(dataset, index)
# convert the list to a dictionary
item = dict(item)
# assign chart coordinates from the dictionary values
y_value = list(item.keys())
x_value = list(item.values())
# create a Bar and Pie chart using assigned coordinates
fig = make_subplots(rows=1, cols=2,
specs=[[{"type": "xy"}, {"type": "domain"}]],
subplot_titles=(title_a, title_b))
fig.add_trace(go.Bar(x=x_value[:5],
y=y_value[:5],
text=y_value,
textposition='outside',
showlegend=False),row=1,col=1)
fig.update_yaxes(title_text=y_label, showticklabels=False, row=1, col=1)
fig.add_trace(go.Pie(labels=x_value,
values =y_value,
textposition='inside',
textinfo='percent+label'), row=1,col=2)
fig.update_layout(template = 'plotly_white', title_text= main_title)
fig.show('svg', width='950')
Now that we have defined our functions, we can start by examining the frequency table for the prime_genre
column of the iOS Store data set. This column has an index of 11
. We will also run a quick visualisation of this result using show_visuals()
show_visuals(apple_free, 11, 'Top 5 Most Frequent App Categories',
'Overall Distribution of Categories',
'Distribution of iOS Apps by Category',
'Frequency (%)')
More than half (58.16%) of the free English Apps in the iOS store are Games. Entertainment apps constitute about 8%, while Photo and video apps comprise almost 5%. Only 3.66% of these apps are designed for Education, and Social networking apps account for 3.29% of the apps in our data set.
We can infer that the free English Apps segment of the iOS store is dominated by apps that are made for fun (games, entertainment, photo and video, social networking, sports, music, etc.). Practical purpose apps like (education, shopping, utilities, productivity, lifestyle, etc.) are relatively rare.
However, we cannot recommend an app profile for the App Store market based on this information alone. The fact that there is a large proportion of fun apps does not imply that they also have the greatest number of users — The supply of these apps may not neccessarily correlate with their demand.
Lets also examine the Category
and Genres
columns of the google apps data set (two columns which seem to be related). The indices of the columns are 1 and 9 respectively:
# compute frequency tables and display visuals for the google store 'Categories' column
show_visuals(google_free, 1, 'Top 5 most frequent App Categories',
'Overall Distribution of Categories',
'Distribution of Google Store Apps by Category', 'Frequency (%)')
Things are significantly different on the Google Play Store: Asides video games, not many apps are designed for fun
It seems that a good number of free apps are designed for practical purposes (Family, Tools, Business, Lifestyle, Productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids as shown by running the code block below:
wanted_rows = 15
for row in google_free:
if row[1] == 'FAMILY':
print(row[0])
wanted_rows -=1
if wanted_rows <= 0:
break
Even so, practical apps seem to have a better representation on Google Play compared to the iOS Store. This can be confirmed by computing a visual for the Genres
column:
show_visuals(google_free, 9, 'Top 5 Most Frequent App Genres',
'Overall Distribution of Genres',
'Distribution of Google store Apps by Genre', 'Frequency (%)')
We can immediately observe that while appearing to mean the same thing (definition wise), The Genres
column is much more detailed than the Category
column. Our interest is the bigger picture, so we'll only consider the Category
column moving forward.
Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have the most users
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs
column, but this information is missing for the iOS Store data set. As a workaround, we'll consider the total number of user ratings instead, which we can find in the rating_count_tot
column of the iOS dataset.
We will define a function below called compute_average
. This function takes the dataset, and indices of the categorical and measurement variables that we want from the dataset. It then computes and returns a dictionary containing the results:
def compute_average(dataset, cat_index, m_index):
'''Groups by cat_index, while computing the average of m_index. Returns result as a dictionary'''
unique_dict = freq_table(dataset,cat_index)
unique_arr = list(unique_dict.keys())
result = {}
for item in unique_arr:
total, count = 0, 0
for row in dataset:
genre = row[cat_index]
n_reviews = int(row[m_index])
if item == genre:
total+= n_reviews
count +=1
average = total/count
result[item]= round(average,2)
return result
Lets compute and visualize the average number of installs for each iOS App genre:
# compute the average installs for each iOS app genre
apple_abg = compute_average(apple_free, 11,5)
# display a visual of the computed averages
show_visuals(apple_abg, title_a = 'Top 5 Most Installed',
title_b = 'Distribution of Average Installs',
main_title = 'Distribution of Average iOS App Installs by Genre',
y_label = 'Average Number of Installs')
We can see that the top five iOS app Genres with the highest number of average installs are: Navigation, Refrence, Social Networking, Music and Weather.
Since we are dealing with averages, It will be advisable to probe into these categories more, just to be sure that there are no outliers overstating the average:
for row in apple_free:
if row[11] == 'Navigation':
print(row[1], ': ', row[5])
It becomes immediately apparent that the high average we obtained for Navigation apps is occuring as a result of users downloading the essential navigation apps: The Waze- GPS Maps and Google Maps
Now, lets take a closer look into the reference category:
for row in apple_free:
if row[11] == 'Reference':
print(row[1], ': ', row[5])
Here we see that there are more offerings that users are interacting with ranging from Bibles and Qurans to dictionaries and Translation tools. There are some outliers here too (e.g The Bible and Dictionaries have considerably large amount of installs when compared to other apps within the same group).
Using an average will also cause inflated average values in the Social networking
and Music
categoories because there are large selection of popular apps within the Social (Facebook, Pinterest, Skype, Messenger) and Music (Pandora, Spotify, Shazam e.t.c) categories:
wanted_rows = 15
for row in apple_free:
if row[11] == 'Social Networking':
print(row[1], ': ', row[5])
wanted_rows -=1
if wanted_rows <= 0:
break
wanted_rows = 15
for row in apple_free:
if row[11] == 'Music':
print(row[1], ': ', row[5])
wanted_rows -=1
if wanted_rows <= 0:
break
In this case it would be advisable to use a better measure of central tendency - The Median
We will define a compute_median()
function that performs essentially the same function as compute_average()
, but uses the median instead of the mean as a statistic
def compute_median(dataset, cat_index, m_index):
'''Groups by cat_index, while computing the median of m_index. Returns result as a dictionary'''
unique_dict = freq_table(dataset,cat_index)
unique_arr = list(unique_dict.keys())
result = {}
for item in unique_arr:
value_set =[]
for row in dataset:
genre = row[cat_index]
n_reviews = int(row[m_index])
if item == genre:
value_set.append(n_reviews)
value_set = sorted(value_set)
value_length = len(value_set)
if value_length % 2 == 1:
index = int((value_length - 1)/2)
result[item]= value_set[index]
else:
index_a = int(value_length/2)
index_b = int((value_length-2)/2)
result[item] = (value_set[index_a] + value_set[index_b])/2
return result
Now, we'll compute and visualize the median number of installs for each iOS App genre:
# compute the average installs for each iOS app genre
apple_med = compute_median(apple_free, 11,5)
# display a visual of the computed averages
show_visuals(apple_med, title_a = 'Top 5 Most Installed',
title_b = 'Distribution of Median Installs',
main_title = 'Distribution of Median iOS App Installs by Genre',
y_label = 'Median Number of Installs')
We see that the order has changed and the top 5 categories with the highest median installs are: Productivity, Navigation, Refrence, Shopping and Social Networking.
From earlier observations, we have noticed some skew in the Navigation and Reference categories. However, we could explore the Productivity and Shopping sections in a bit more detail. Lets examine the top 15 productivity apps on the iOS store:
wanted_rows = 15
for row in apple_free:
if row[11] == 'Productivity':
print(row[1], ': ', row[5])
wanted_rows -=1
if wanted_rows <= 0:
break
This Productivity category shows a relatively even spread of installs. However, we need to be careful with our decision making. I believe that the worst time any user will want to see an ad is when they are trying to be productive or studying reference books like the Bible or Quran. As a result, we will err on the side of caution and not explore these categories further for now.
This leaves us with the Shopping category which we can explore further too. Let's print the top 15 Shopping apps in the iOS store:
wanted_rows = 15
for row in apple_free:
if row[11] == 'Shopping':
print(row[1], ': ', row[5])
wanted_rows -=1
if wanted_rows <= 0:
break
We immediately notice that this category shows the far more even distribution of installs than the other categories we have explored, the outliers are relatively fewer too. The great news is: Who does not want to see fun ads for things to buy when shopping?
I am particularly less annoyed by ads in a shopping app than I would be if I saw the same ad when reading an E-Bible or E-Book. There are also lots of options we could explore further with shopping.
We could create an app that collates the deals of the day from the most popular online shopping platforms and display additional ads to users with minimal effort. Lets explore the google play dataset now shall we?
Previously, we came up with an app profile recommendation for the iOS Store based on the distribution of user ratings. We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):
show_goog = display_table(google_free, 5)
For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users.
We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.
To perform computations, however, we'll need to convert each install number from a string to a float. This means we need to remove the commas and the plus characters, or the conversion will fail and cause an error. We will do this using the .replace()
method, then calculate the average and median number of installs per app genre for the Google Play dataset
temp = []
for row in google_free:
n_installs = row[5]
formatted_installs = n_installs.replace('+', '')
formatted_installs = int(formatted_installs.replace(',', ''))
row[5] = formatted_installs
temp.append(row)
google_free = temp
# compute the average installs for each iOS app genre
google_average = compute_average(google_free,1,5)
# display a visual of the computed averages
show_visuals(google_average, title_a = 'Top 5 Most Installed',
title_b = 'Distribution of Average Installs',
main_title = 'Distribution of Average Android App Installs by Genre',
y_label = 'Average Number of Installs')
On average, communication apps have the most installs: 38,456,119. Video players follow with about 24,727,872 and Social apps with an average of approximately 23,253,652 installs.
Since we are now aware of the possibility of outliers, we will probe into these categories further:
for row in google_free:
if row[1] == 'COMMUNICATION' and row[5] >= 500000000:
print(row[0], ': ', row[5])
We can see the average for the Communication category is skewed up by a few apps that have over one billion installs (WhatsApp, Messenger, Skype, Google Chrome, Gmail, and Hangouts). We will even out this distribution using the median later. For now lets move on to the social category:
for row in google_free:
if row[1] == 'SOCIAL' and row[5] >= 500000000:
print(row[0], ': ', row[5])
The Social category is also influenced by apps with huge download volumes like Facebook, Google, Instagram and Snapchat.
Let's correct for these outliers using the median as a measure:
google_med = compute_median(google_free,1,5)
show_google_med = display_table(google_med, 5)
Again we see that there is a redistribution of app categories: The Shopping category is now in the top 5 categories by median installs, Communication and Social apps are no longer present on the leaderboard. This is because the median has corrected for the outliers that could have misled us to assume that a higher sample of these apps were being downloaded by users in general.
We will briefly explore the Shopping category before making our conclusions:
for row in google_free:
if row[1] == 'SHOPPING' and row[5] >= 500000000:
print(row[0], ': ', row[5])
The code above did not return any output. This shows that there are no applications in this group with excessively huge downloads that could have skewed the average. On close observation, we will find that there is an interesting distribution of downloads for apps of this group. Most of the Shopping apps are successful in the google Play store with the values revolving around 1,000,000 downloads:
for row in google_free:
if row[1] == 'SHOPPING':
print(row[0], ': ', row[5])