The goal of that project is to analyze data to understand what type of apps are likely to attract more users on Google Play and the App Store. To do this, I'll need to collect, explore and analyze data about mobile apps available on these platforms.
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. Datasets that will be used in my analysis were found at Kaggle here and here and were scraped from roughly at the same time, in 2018.
The first step is to open csv files 'AppleStore.csv' and 'googleplaystore.csv'. For that task I have created a function read_csv().
from csv import reader
# create a function to read csv files
def read_csv(csv):
opened_csv = open(csv)
read_csv = reader(opened_csv)
dataset = list(read_csv)
return dataset
# read Apple Store
ios = read_csv('AppleStore.csv')
header_ios = ios[0]
ios = ios[1:]
# read Google Play Store
android = read_csv('googleplaystore.csv')
header_android = android[0]
android = android[1:]
To make it easier to explore datasets, I used a function named explore_data() that:
Next that function was implemented for both datasets.
# define the function
def explore_data(dataset, start, end, header=None):
print('Number of rows in dataset: {}.'.format(len(dataset)))
print('Number of columns in dataset: {}.'.format(len(dataset[0])))
if header is None:
print('Column names are:', dataset[0])
else:
print('Column names are', header)
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print() # adds a new (empty) line after each row
# explore ios dataset
explore_data(ios, 0, 3, header_ios)
# explore android dataset
explore_data(android, 0, 3, header_android)
# find free apps in Apple Store dataset
ios_free = []
for row in ios:
if row[4] == '0.0':
ios_free.append(row)
print('Apple Store free apps dataset includes', len(ios_free), 'rows', '\n')
# find free apps in Google Play Store dataset
android_free = []
for row in android:
if row[6] == 'Free':
android_free.append(row)
print('Google Play Store free apps dataset includes', len(android_free), 'rows', '\n')
At the first step we check both datasets for duplicates and print some of them to see, how much are they identical to each other.
# check for duplicates in Apple Store dataset
unique_names_ios = []
duplicate_names_ios = []
for row in ios_free:
name = row[1]
if name in unique_names_ios:
duplicate_names_ios.append(name)
else:
unique_names_ios.append(name)
print('There are', len(duplicate_names_ios), 'duplicates in the Apple Store dataset.', '\n')
for row in ios_free:
name = row[1]
if name in duplicate_names_ios[:1]:
print(row)
# check for duplicates in Google Play Store dataset
unique_names_android = []
duplicate_names_android = []
for row in android_free:
name = row[0]
if name in unique_names_android:
duplicate_names_android.append(name)
else:
unique_names_android.append(name)
print('There are', len(duplicate_names_android), 'duplicates in the Google Play Store dataset.', '\n')
print('Here are duplicate rows for one application. We can see how they differ from each other.', '\n')
for row in android_free:
name = row[0]
if name in duplicate_names_android[:1]:# print only duplicates for one app
print(row)
The main difference in duplicate rows happens on the the number of users' ratings or reviews (column 6
in Apple Store dataset and column 4
in Google Play Store dataset). The different numbers show that the data was collected at different time.
Next step in my data cleaning process is to only leave rows with the maximum number of reviews or ratings and add them to the cleaned datasets.
Finally I check the number of rows in datasets testing if I will get the same number with different methods of count.
# select only rows with maximum number of ratings in Apple Store dataset
clean_data_ios_dict = {}
for row in ios_free:
name = row[1]
n_ratings = float(row[5])
if name not in clean_data_ios_dict or n_ratings > float(clean_data_ios_dict[name][5]):
clean_data_ios_dict[name] = row
# convert dictionary with Apple Store data to list of lists
clean_data_ios = clean_data_ios_dict.values()
# check the result
print('Expected length of cleaned Apple Store dataset is', len(ios_free)-len(duplicate_names_ios))
print('Length of final clean Apple Store dataset is:', len(clean_data_ios), '\n')
# select only rows with maximum number of reviews in Google Play Store dataset
clean_data_android_dict = {}
for row in android_free:
name = row[0]
n_reviews = float(row[3])
if name not in clean_data_android_dict or n_reviews > float(clean_data_android_dict[name][3]):
clean_data_android_dict[name] = row
# convert dictionary with Google Play Store data to list of lists
clean_data_android = clean_data_android_dict.values()
# check the result
print('Expected length of cleaned Google Play Store dataset is', len(android_free)-len(duplicate_names_android))
print('Length of final clean Google Play Store dataset is:', len(clean_data_android), '\n')
Since my target markets are English-apeaking I'd like to remove from both datasets applications with non-English names. To do that I write a function _isenglish() to check if the name of the application is English and initialize it for Apple Store and Google Play Store datasets.
# define a function to check the name
def is_english(app):
number_of_false = 0
for letter in app:
if ord(letter) > 127:
number_of_false += 1
if number_of_false < 4:
return True
# iterate over Apple Store dataset
apple_store = []
for row in clean_data_ios:
name = row[1]
if is_english(name):
apple_store.append(row)
print("Final list of Apple Store apps has {} rows".format(len(apple_store)))
# iterate over Google Play Store
google_play_store = []
for row in clean_data_android:
name = row[0]
if is_english(name):
google_play_store.append(row)
print("Final list of Google Play Store apps has {} rows".format(len(google_play_store)))
The goal of the analysis is to find an idea of application which could be successful at both markets, Apple Store and Google Play Store. According to the assignment at first an app will be developed for Google Play Store and if it will be succesful, roll it out to Apple Store.
My first step is to identify the most common genres for applications. We will use column prime_genre
for AppleStore (12th position) and column Genres
for Google Play Store (10th position) to count the most common genre.
# display once more names of the columns
print('Column names for Apple Store dataset are: ', header_ios, '\n')
print('Column names for Google Play Store are: ', header_android, '\n')
# define function to create a frequency table
def freq_table(dataset, index):
freq_table = {}
total = len(dataset)
for row in dataset:
token = row[index]
if token in freq_table:
freq_table[token] += 1
else:
freq_table[token] = 1
# calculate percentages
freq_percentages = {}
for key in freq_table:
percentage = (freq_table[key]/total)*100
freq_percentages[key] = round(percentage, 2)
# print result in descending order
for key in sorted(freq_percentages, key=freq_percentages.get, reverse=True):
print(key,':', freq_percentages[key])
return freq_percentages
# iterate over Apple Store dataset
prime_genres = freq_table(apple_store, 11)
print('The column prime genre includes {} genres total.'.format(len(prime_genres)), '\n')
# iterate over Google Play Store dataset
# review column 'Genres'
genres = freq_table(google_play_store, 9)
print('The column Genres includes {} genres total.'.format(len(genres)), '\n')
# review column 'Category'
category = freq_table(google_play_store, 1)
print('The column Category includes {} genres total.'.format(len(category)))
Apple Store dataset
Google Play Store
Comparison
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play Store dataset, I can find this information in the Installs
column, but this information is missing for the App Store data set. As a workaround, I'll take the total number of user ratings as a proxy, which can be found in the rating_count_tot
column.
# count sum of user ratings for every genre for Apple Store dataset
ratings_dict = {}
for row in apple_store:
genre = row[11]
rating = float(row[5])
ratings_dict.setdefault(genre, []).append(rating)
# calculate average
for genre, rating in ratings_dict.items():
avg_rating = round(sum(rating)/len(rating))
ratings_dict[genre] = avg_rating
# sort and print resulting dictionary
for genre in sorted(ratings_dict, key=ratings_dict.get, reverse=True):
print(genre, ratings_dict[genre])
To use the column Installs
from Google Play Store dataset I need at first explore in which format and how is represented that data.
After I print a few values from the column, I realise that:
I will clean and convert the values from the column to integers and use these numbers as a certain approximation to real sum of installs.
# check the format of the column Installs
for row in google_play_store[:5]:
print(row[5])
# convert the column Installs from Google Play Store dataset to floats
for row in google_play_store:
row[5] = int(row[5].replace('+', '').replace(',', ''))
# count sum of installs for every category for Google Play Store
installs_dict = {}
for row in google_play_store:
category = row[1]
installs = row[5]
installs_dict.setdefault(category, []).append(installs)
# calculate average
for category, installs in installs_dict.items():
avg_installs = round(sum(installs)/len(installs))
installs_dict[category] = avg_installs
# sort and print resulting dictionary
for category in sorted(installs_dict, key=installs_dict.get, reverse=True):
print(category, installs_dict[category])
For Apple Store dataset top-3 of the most popular genres are Navigation, Reference and Social Networking. For Google Play Store top-3 categories are: Communication, Video Players, Social. But there is a high probability that the most popular are few leaders for specific category, e.g. Facebook with millions of users for Social category. Let's now check next 3
To check that at first I'll examine the top-6 apps for every top-3 category from Google Play Store Dataset.
# check top-6 apps mostly installed at Google Play Store in top-3 categories
# communication
print('The most popular apps in communication category are:', '\n')
for row in google_play_store:
if row[1] == 'COMMUNICATION' and row[5] > 100000000:
print(row[0], ':', row[5])
print('\n')
# video players
print('The most popular apps in video players category are:', '\n')
for row in google_play_store:
if row[1] == 'VIDEO_PLAYERS' and row[5] > 100000000:
print(row[0], ':', row[5])
print('\n')
#game
print('The most popular apps in social category are:', '\n')
for row in google_play_store:
if row[1] == 'SOCIAL' and row[5] > 100000000:
print(row[0], ':', row[5])
Indeed top-3 genres are dominated by few apps attracting millions of users. If we would like to succesfully compete with them the only way to do that is to offer a fundamentally new approaches or functionality. That is certainly not an easy task. Probably better approach will be to choose categories that are in the middle of the list. Let's check 3 categories from top-10 which : _Travel_andlocal and Game.
# calculate average total number of installs for Google Play Store
total = 0
for category in installs_dict:
total += int(installs_dict[category])
mean_installs = total/len(installs_dict)
print(round(mean_installs))
Average number of installs estimates at 7,281,600, in our list of categories there are two with a total number of that range: _Books_andreference and Shopping. Next I will explore them.
sorted = sorted(google_play_store, key = lambda x: x[5], reverse=True)
# books_and_references
print('The most popular apps in Books_and_reference category are:', '\n')
for row in sorted:
if row[1] == 'BOOKS_AND_REFERENCE' and row[5] > 10000000:
print(row[0], ':', row[5])
print('\n')
# game
print('The most popular apps in Shopping category are:', '\n')
for row in sorted:
if row[1] == 'SHOPPING' and row[5] > 10000000:
print(row[0], ':', row[5])
The category _Books_andreferences looks more interesting mostly because apart from few leaders like Google Play books or Amazon Kindle selling all sorts of books it contains an app with free books - Wattpad as well as religious text. I'd like to check which apps are in the middle range regarding total number of installs.
# check middle range
print('Apps of the middle range in Books_and_reference category:', '\n')
for row in sorted:
if row[1] == 'BOOKS_AND_REFERENCE' and (row[5] < 10000000 and row[5] > 4000000):
print(row[0], ':', row[5])
print('\n')
Apps in the middle range also contain religiuos texts, not only whole-sellers of books or dictionaries. Next step is to find out if the same genres are popular among users of Apple Store. Let's check top-3 genres in Apple Store dataset.
# check apps mostly rated at Apple Store
# navigation
print('There are few apps at Navigation genre:')
for row in apple_store:
if row[11] == 'Navigation':
print(row[1], ':', row[5])
print('\n')
# reference
print('There are more apps at Reference genre dominated by dictionaries and religious texts:')
for row in apple_store:
if row[11] == 'Reference':
print(row[1], ':', row[5])
print('\n')
# social networking
print('There are plenty apps at Social Networking genre but dominated by few of them:')
for row in apple_store:
if row[11] == 'Social Networking':
print(row[1], ':', row[5])
After the exploration of Apple Store dataset I see that the Reference genre is quite diverse dominated by dictionaries and religious texts but is not dominated by few leaders, so it could be an option to build an app in the same genre.
In this project, I analyzed datasets from Apple Store and Google Play Store with the goal of recommending an app genre that can be profitable for both markets. My final recommendation is to build some application based around a niche but commonly used text. Perfect candidate will be some sort of religiuos or quasi-religious qult text (Marie Kondo, Harry Potter or Satan Bible?). User experience can be enhanced with audio-versions or even gamification. Of course that recommendation should be backed by further study.