I'm a Data Anayst for a company which builds mobile apps that are free to download on Google play store and Apple app store. The main source of revenue for our apps is by in-app ads. This means revenue is mostly influenced by number of users who use our apps — the more users that see and engage with the ads, the better for the companys revenue.
The objective as a Data Analyst for this project is to analyze data to help the company understand type of apps likely to attract more users. This has the potential to help the company significantly improve our mobile app user base and subsequently increase the companys mobile revenue stream.
Based on published material available on the web. The below datasets will be used during our data analysis.
Begin analysis by first reading in the sample data downloaded from the internet. The data is then stored in separate lists for Google apps and Apple apps.
# Read the data
import csv
file_google = open('googleplaystore.csv')
read_google = csv.reader(file_google)
data_google = list(read_google)
data_google_header = data_google[0]
data_google = data_google[1:]
file_apple = open('AppleStore.csv')
read_apple = csv.reader(file_apple)
data_apple = list(read_apple)
data_apple_header = data_apple[0]
data_apple = data_apple[1:]
The defined function data_explore
will help with sampling data from the google and apple datasets
# This function prints rows from the dataset
def data_explore(data, start, end, rows_and_columns=False):
data_slice = data[start:end]
for row in data_slice:
print(row)
print('\n')
if rows_and_columns:
print('Number of rows:', len(data))
print('Number of columns:', len(data[0]))
Based on the sample data and headers displayed below, the following data points which have details about the app name, category, reviews, downloads, and price will be very useful for getting more data insights on which category / genre of published apps tend to do well on these app stores, furthermore, this will be analysed by considering price, the company is more interested in data relating to free mobile apps in the playstore and app store.
# Preview the data
print(data_google_header)
data_explore(data_google, 1, 6, True)
print ('\n')
print(data_apple_header)
data_explore(data_apple, 1, 6, True)
Before we progress further ahead, we need to clean the datasets (data cleaning) by detecting and removing:
For the time being, my company is only interested in developing
free
mobile appss for an english speaking audience hence we have to
streamline our dataset for these reasons.
# Based on published information on the discussion
# section of the google dataset, the row index [10472]
# (excluding header) is missing its category datapoint.
print (data_google[10472])
del data_google[10472]
The defined function duplicate
will help with finding duplicate data from the google and apple datasets
#This function checks for duplicates
def duplicate(data,index,dataset):
duplicate_list = []
unique_list = []
for app in data:
name=app[index]
if name in unique_list:
duplicate_list.append(name)
else:
unique_list.append(name)
print ('There are' , len(duplicate_list) , 'apps in the ' + dataset)
print ('\n')
print ('This is a duplicate app in the ' + dataset , ':', duplicate_list[:16])
print ('\n')
duplicate(data_google,0,'playstore dataset')
duplicate(data_apple,1,'appstore dataset')
The higher the number of reviews, the more recent the data should be, hence for the data analysis, only the row with the highest number of reviews will be kept while other duplicate entries will be removed.
reviews_max={}
for app in data_google:
name=app[0]
n_reviews=float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
print ('There will be', len(reviews_max), 'rows left when duplicates are removed from the playstore dataset')
The data_google_clean
list will not have duplicate enteries because the duplicates have been removed by keeping only data rows with the highest review.
data_google_clean=[]
already_added=[]
for app in data_google:
name=app[0]
n_reviews=float(app[3])
if n_reviews == reviews_max[name] and name not in already_added:
data_google_clean.append(app)
already_added.append(name)
data_explore(data_google_clean, 1, 6, True)
There are some mobile apps in the datasets which are for non-English audience. These will be removed since the company is more focused on developing English based apps. The ord
built in python function will be used to reduce the data set to only the apps which have the character range between 0-127
in ASCII [American Standard Code for Information Interchange].
# Preview data on non-english apps
print(data_apple[813])
print(data_apple[6731])
The data_english
function will help to remove data on non-English related mobile apps.
# This function removes rows on non-English mobile apps
def data_english(dstr):
count=0
for char in dstr:
dord=ord(char)
#print(dord)
if dord > 127:
count+=1
if count > 3:
return False
else:
return True
print(data_english('Instagram'))
print(data_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(data_english('Docs To Go™ Free Office Suite'))
print(data_english('Instachat 😜'))
The data_eng_clean
and data_eng_apple
list will only contain data on English related mobile apps because the non-English related rows have been removed from the datasets.
data_eng_google=[]
data_eng_apple=[]
for app in data_google_clean:
name=app[0]
if data_english(name):
data_eng_google.append(app)
for app in data_apple:
name=app[1]
if data_english(name):
data_eng_apple.append(app)
# Preview data
data_explore(data_eng_google, 0, 3, True)
data_explore(data_eng_apple, 0, 3, True)
The paid apps will be removed from the datasets by creating a final data list which does not have mobile apps with a price that is greater than 0.0
. The data_final_google
and data_final_apple
data list will be used for recommendation to the company when the data points have been adequately analysed further.
data_final_google=[]
data_final_apple=[]
for app in data_eng_google:
price=app[7]
if price == '0':
data_final_google.append(app)
for app in data_eng_apple:
price=app[4]
if price == '0.0':
data_final_apple.append(app)
print('There are' , len(data_final_google) , 'mobile apps remaining in the playstore dataset')
print('There are' , len(data_final_apple) , 'mobile apps remaining in the appstore dataset')
The data is ready to be analysed. The objective as a Data Analyst for this project was to analyze the sample data on mobile apps in the Google playstore and Apple apps store to help the company understand type of apps likely to attract more users. This has the potential to help the company significantly improve its mobile app user base and subsequently increase the companys mobile revenue stream.
# This function builds a frequency table for the
# prime_genre data column of the app store dataset
# and the genres and catrgory column of the playstore
# dataset which will be displayed in percentages
def data_freq(data, index):
datafreq={}
total=0
for app in data:
total +=1
datac = app[index]
if datac in datafreq:
datafreq[datac]+=1
else:
datafreq[datac]=1
data_perc={}
for app in datafreq:
perc = (datafreq[app]/total)*100
data_perc[app] = perc
return data_perc
# This function will display the percentages
# developed in data_freq by descending order
def display_table(data, index):
dataf = data_freq(data, index)
data_display = []
for key in dataf:
key_val_as_tuple = (dataf[key], key)
data_display.append(key_val_as_tuple)
data_sorted = sorted(data_display, reverse = True)
for entry in data_sorted:
print(entry[1], ':', entry[0])
Based on the data below generated from the Apple app store dataset, the Games
genre seems to be miles ahead in representation with 58% while Entertainment
genre takes distant second place with 7.8%. The genral impression is that mobile apps designed for entertainment such as games, social networking seem to be more available than mobile apps designed for practical purposes for instance (productivity, news, weather, finace).
display_table(data_final_apple, -5)
Based on the data below generated from the Google playstore dataset, the Family
and Game
genre seem to be have a higher availability percentage with a combined percentage of 28.64% taking first and second place respectively. The genral impression here as was similarly observed in the app store dataset is that mobile apps designed for entertainment seem to be highly represented than mobile apps designed for practical purposes for instance (productivity, finance, medical).
But, based on further observation, unlike the app store dataset which had a combined percentage of 66% for the top two popular genres geared towards fun/entertainment, the google play store only has 28.64% in direct comparison. This gives the impression that productivity apps tend to also do well in the google playstore dataset. For instance mobile apps in the genres: tools, business, productivety, finance and medical combine for a percentage 24.11%.
This data shows a more balanced landscape of both practical and fun apps in the Google playstore compared with the Apple app store which seems to be highly dominated by mobile apps designed for fun / entertainment.
display_table(data_final_google, 1)
The below will help to discover the most popular genres in the app store dataset, based on the total number of users, remember, the objective is to find the genre of apps which have the most users for both datasets. For the playstore dataset, the installs column will be useful, but for the app store dataset, the total number of user ratings will be used as a proxy, which we can find in the rating_count_tot column.
data_genres_apple = data_freq(data_final_apple, -5)
for genre in data_genres_apple:
total = 0
len_genre = 0
for app in data_final_apple:
data_genre = app[-5]
if data_genre == genre:
ratings = float(app[5])
total += ratings
len_genre += 1
ratings_avg = total / len_genre
print(genre, ':', ratings_avg)
The data shows that the highest number of user reviews belong to Navigation
genre, but this figure seems to be highly influenced by the popularity of google maps and waze, which have close to half a million user reviews together. This pattern seems to be the same with the Social Networking
genre where the average number of user reviews is highly influenced by popular organisations for instance Facebook and Microsoft.
for app in data_final_apple:
if app[-5] == 'Navigation':
print(app[1], ':', app[5])
Besides the Navigation
and Social Networking
genre which have data that might be skewing the results due to the popularity of certain high profile mobile apps, the Reference
genre seems to be doing quite well on the app store with 74,942 user based reviews on the average. By diving further into this genre, below, it could be observered that the Bible and Dictionary.com mobile apps which skew up the average rating for the Reference
genre.
Publishing a free mobile app in the Reference
genre could be promising for the company since the objective is to significantly increase the user base, this genre seems to be doing very well and keeping up with the popularity of genres like navigation and social networking. Publishing mobile versions of highly popular books could prove to be a success for bringing in more mobile users. Possibly, this could be a book which could be geared towards family / fun by integrating puzzles, mini games, thereby building parts of these highly popular genres into the possible ideas for the potential mobile apps.
for app in data_final_apple:
if app[-5] == 'Reference':
print(app[1], ':', app[5])
The below displays the total count on installs for data in the playstore dataset but this data point is not precise, for instance, we are not particularly certain whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000.
# Preview data based on installs
display_table(data_final_google, 5)
The below helps to compute the average number of installs for each genre (category) in the google playstore dataset for a better data analysis and subsequently mobile app recommendation.
google_categories = data_freq(data_final_google, 1)
for category in google_categories:
total = 0
len_category = 0
for app in data_final_google:
category_app = app[1]
if category_app == category:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
total += float(n_installs)
len_category += 1
avg_n_installs = total / len_category
print(category, ':', avg_n_installs)
The google playstore seems to be dominated by highly popular apps from high profile organisations. The data shows that communication apps have the most installs: 38,456,119 but like previously observed in the app store dataset, this number is skewed by highly popular social networking apps which tend to have more than one billion installs for instance whatsapp, facebook messenger, youtube, and instagram.
for app in data_final_google:
if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
or app[5] == '500,000,000+'
or app[5] == '100,000,000+'):
print(app[0], ':', app[5])
Based on further analysis, the genres for Game
and Books and Reference
seem to do quite well with the average number of installs being 15,588,015 and 8,767,811 respectively. This data is insightful because the previous recommendation for the app store could potentially be used for the google playstore market as well meaning the company could practically launch the same mobile app that will be compatible on both markets.
for app in data_final_google:
if app[1] == 'BOOKS_AND_REFERENCE':
print(app[0], ':', app[5])
Publishing a free mobile app in the Books and Reference
genre could be promising for the company since the objective is to significantly increase the user base, this genre seems to be doing very well and keeping up with the popularity of genres geared towards fun and entertainement like games, social networking, photography and video. This pattern is observed for both sample datasets from the google playstore market and apple app store.
Publishing the mobile version of a highly popular book could prove to be a success for bringing in more mobile users. Possibly, this could be a book which could be geared towards both family and fun category by trying to integrate puzzles, mini games, quizzes, thereby building parts of the highly popular genres on the market into the same mobile app, possible ideas for the soon to be developed mobile apps from the company.