Apps are of two types: Paid and Free. The revenue for both types is different. For paid apps, the revenue is, the purchase made by user while downloading the app. For free apps, the main source of revenue would be in-app adds meaning the number of users determine the revenue - the more users who see and engage with the ads, the better.
Since, we are developing an free app, we would like to see what type of free apps does have the more users. The main aim of this project is to determine what type of apps are more likely to attract the users.
Since, the main goal of the analysis is to find out what kind of free mobile apps the users are more likely to download. So, we need to collect and analyze the data about mobile apps available on both Google Play and App store.
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.
The dataset used for this analysis are readily available on Kaggle.
Let's start by exploring the dataset
from csv import reader
# Opening the dataset of Apple store
open_apple = open('AppleStore.csv', encoding='utf8')
read_apple = reader(open_apple)
ios = list(read_apple) # Converting them into a lists of list
# Opening the dataset of Google play
open_google = open('googleplaystore.csv', encoding='utf8')
read_google = reader(open_google)
android = list(read_google) # Converting them into a lists of list
# Since there are two datasets, creating a function to use them so that rows can be printed in a readable way.
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
Let's take a look at the dataset of iOS and print a few rows of the dataset.
explore_data(ios, 0, 3, True)
Similarly, let's take a look at the dataset of Google Play and print a few rows.
explore_data(android, 0, 3, True)
We can see that the app store data has 16 columns whereas the google play has 13 columns. Let's take a look at the columns of each dataset and figure out what the column information is about.
ios_header = ios[0]
android_header = android[0]
print(ios_header)
print('\n')
print(android_header)
We have opened the datsets and explored it a bit. Before beginning our analysis, we must ensure that the data is accurate, if not, our analysis will be wrong. This basically means:
We need to remove apps that are free and which are non-English.
Looking at one of the discussion, we have found that the Google Play dataset has an error. Let's check the error and delete it.
print(android_header) # Printing header to check the values with the data
print('\n')
print(android[10473]) # The incorrect data
print('\n')
print(android[1]) # Checking it against a proper data
We can see that the incorrect data has a category column missing. And this has made a column shift in the rows after this particular data. Now, let's get rid of it.
# Getting rid of the incorrect data using del statement
del android[10473]
Upon further looking at the discussion we can notice that some apps have duplicate entries.
for app in android:
name = app[0]
if name == 'Instagram':
print(app)
Let's check the number of instances where an app has occured more than once and separate the duplicate apps.
duplicate_apps = []
unique_apps = []
for app in android:
name = app[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])
We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.
If we look closely at the rows we printed for Instagram
app, on the fourth position of each row corresponds to the number of reviews. The different numbers show the data was collected at different times.
We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.
Now we know that the number of duplicate apps is 1181. After we remove the duplicates, we should be left with 9660 rows.(Including the header)
print('Expected length:', len(android) - 1181)
To remove the duplicates, we will do the following:
reviews_max = {}
for row in android[1:]: #Excluding the header
name = row[0] # Name of the app
n_reviews = float(row[3]) # The number of reviews
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
Let's confirm that the length of the dictionary created is same as our expected length.
print('Expected length:', len(android[1:]) - 1181) #Excluding the header
print('Length of dictionary:', len(reviews_max))
Now that we have got the dictionary of apps without any duplicate values. We will use that to clean the main dataset of Play Store and remove all duplicate values.
android_clean = [] # Store the new cleaned data set
already_added = [] # Will store the app names
for row in android[1:]:
name = row[0]
n_reviews = float(row[3])
if n_reviews == reviews_max[name] and name not in already_added:
android_clean.append(row)
already_added.append(name)
Checking the length of the android_clean list to make sure it matches the expected length.
len(android_clean)
Now that we have got rid of the duplicate data. Let's move on to the next step. Since, the app which we are going to create are geared towards English-speaking audience. We have found that both the datasets have apps with names that they suggest they are not designed for English-speaking audience. Let's take a look at such apps.
print(ios[814][1])
print(ios[6732][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])
The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII system. Based on this number range, we will build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.
def is_english(words):
count = 0
for word in words:
if ord(word) > 127:
count += 1
if count > 3:
return False
else:
return True
# Checking if the function works properly with a few arguments
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
Using the function which we have created to filter out non-English apps from both datasets.
ios_english = []
android_english = []
# For ios
for row in ios[1:]:
name = row[0]
if is_english(name):
ios_english.append(row)
# For Play Store
for row in android_clean:
name = row[0]
if is_english(name):
android_english.append(row)
Now let's explore our both the cleaned datasets
explore_data(ios_english, 0, 3, True)
print('\n')
explore_data(android_english, 0 , 3, True)
So far we have removed inaccurate data, duplicate entries and non-English apps. Since, we are focusing only on free apps because the main source of revenue consists of in-app ads.
Our dataset contains both free and non-free apps, we need to isolate the free apps for our analysis.
free_apps_ios = []
free_apps_android = []
for row in ios_english:
price = row[4]
if price == '0.0':
free_apps_ios.append(row)
for row in android_english:
price = row[7]
if price == '0':
free_apps_android.append(row)
print(len(free_apps_ios))
print(len(free_apps_android))
Like we mentioned in the beginning of the analysis, the goal of the analysis is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.
So in order to minimize risks and overhead, our validation strategy for an app will be:
The end goal is to add the app on both Google Play and App Store, so we need to find a profile that works for both the markets.
We will do this by finding the most common genre for each market and do that, we will build a frequency table for few columns.
We will build two functions we can use to analyze the frequency tables:
def freq_table(dataset, index):
table = {}
length = 0
for value in dataset:
length += 1
name = value[index]
if name in table:
table[name] += 1
else:
table[name] = 1
table_percentages = {}
for key in table:
percentage = (table[key] / length) * 100
table_percentages[key] = percentage
return table_percentages
def display_table(dataset, index):
'''Takes in two parameters - dataset and index.
Generates a frequency table using the freq_table() function
Transforms the frequency table into a list of tuples, then sorts the list in a descending order
Prints the entries of the frequency table in descending order'''
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
First, using the above functions to make a frequency table for the prime_genre
column of the App Store dataset.
display_table(free_apps_ios, -5)
games
genre which in turn makes the competiton more for our app to be more popular.sports
genre has only about 1.94 %. This can be considered as a good genre for making our app. If a person follows multiple sports, we can make a sports app where the users can add the sports which they follow and we can make the app to show all the scores, news and headlines of their curated sports list.Now, similarly let's take a look at genres
and category
columns of the Google Play dataset.
# Genres column
display_table(free_apps_android, 9)
genre
column of the Google Play dataset, the tools
is on the top with above 8%, followed by entertainment
genre with 6% and education
with 5%sports
genre has only about 3%. Makes the genre a good possibility for our app.# Category column
display_table(free_apps_android, 1)
family
genre is on the top with 18% followed by games
, tools
with 9% and 8% respectively.sports
genre has less about 3.3%The frequency tables we analyzed showed us that apps designed for fun dominate the App Store, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to determine the kind of apps with the most users which also shows the apps which are popular.
For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.
Ler's calculate the average of number of user ratings per app genre on the app store. For that, we will do the following.
# Making a frequence table for prime_genre column to get the unique app genres.
ios_genres = freq_table(free_apps_ios, -5)
for genre in ios_genres:
total = 0 # Will store the number of ratings
len_genre = 0 # will store number of apps specific to each genre
for row in free_apps_ios:
genre_app = row[-5]
if genre_app == genre:
n_ratings = float(row[5])
total += n_ratings
len_genre += 1
avg_n_ratings = total / len_genre
print(genre, ':', avg_n_ratings)
Genres like social networking
, music
, weather
are dominated by the big tech industries. So, there would be immense competition in those genres. We can make an app for the sports
genre like mentioned above. The competition would be less and more chances for our app to be popular.
Now similarly let's have a look at Google Play. We have the installs
column, so it would be easy in this case.
display_table(free_apps_android, 5)
However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.): So let's modify them.
# Generating a frequency table
categories_android = freq_table(free_apps_android, 1)
for category in categories_android:
total = 0
len_category = 0
for app in free_apps_android:
category_app = app[1]
if category_app == category:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
total += float(n_installs)
len_category += 1
avg_n_installs = total / len_category
print(category, ':', avg_n_installs)
Even in this case, the sports
genre doesn't have much installs. So we can develop the app for Google Play Market for the sports genre.
We will create an app where the users can follow multiple sports and the app gives them curated news, headlines and even live scores for the sports which they follow. The competition for this genre is less in both iOS store and the Google Play. So in case, the app becomes popular, we can develop it for iOS, too.