The objective as a Data Analyst for this project is to analyze data to help my organisation understand type of apps likely to attract more users. This has the potential to help significantly improve the mobile app user base and subsequently increase mobile revenue stream.
Our mobile apps are free to download on the Google play store and Apple app store and main source of revenue is generated from in-app ads. This means the organisations revenue is mostly influenced by the number of users who use the apps — the more users that engage with the ads, the better.
Based on published material available on the web. The below datasets will be used during the data analysis.
The csv datasets downloaded from the web will be stored in separate lists for Google apps and Apple apps. The header row has additionally been excluded from the datasets.
# Read the data
import csv
file_google = open('googleplaystore.csv')
read_google = csv.reader(file_google)
data_google = list(read_google)
data_google_header = data_google[0]
data_google = data_google[1:]
file_apple = open('AppleStore.csv')
read_apple = csv.reader(file_apple)
data_apple = list(read_apple)
data_apple_header = data_apple[0]
data_apple = data_apple[1:]
The defined function data_explore
will help with sampling data from the google and apple datasets
# This function prints rows from the dataset
def data_explore(data, start, end, rows_and_columns=False):
data_slice = data[start:end]
for row in data_slice:
print(row)
print('\n')
if rows_and_columns:
print('Number of rows:', len(data))
print('Number of columns:', len(data[0]))
There are 10841 rows in the playstore dataset. Based on the google sample data and headers displayed, the following data points will be very useful for getting insights on which category / genre tend to be popular on the playstore market, furthermore, this will be analysed by considering price: App
, Category
, Rating
, Reviews
, Installs
, Price
.
# Preview google playstore data
print(data_google_header)
data_explore(data_google, 1, 6, True)
There are 7197 rows in the app store dataset, the following data points will be very useful for getting insights on which genre tend to be popular on the apple app store market, furthermore, this will be analysed by considering price: track_name
, price
, rating_count_tot
, user_rating
, prime_genre
.
# Preview apple app store data
print(data_apple_header)
data_explore(data_apple, 1, 6, True)
Before progressing further, the dataset is checked to detect and remove:
For the time being, my organisation is only interested in developing free mobile apps for an english speaking audience hence for this purpose, the data has to be streamlined.
# Based on published information on the discussion
# section of the google dataset, the row index [10472]
# (excluding header) is missing its category datapoint.
# This row is deleted using the `del` python command
print (data_google[10472])
del data_google[10472]
The defined function duplicate
will help with finding duplicate data from the google and apple datasets
#This function checks for duplicates
def duplicate(data,index,dataset):
duplicate_list = []
unique_list = []
for app in data:
name=app[index]
if name in unique_list:
duplicate_list.append(name)
else:
unique_list.append(name)
print ('There are' , len(duplicate_list) , 'duplicate apps in the ' + dataset)
print ('\n')
print ('These are duplicate instances in the ' + dataset , ':', duplicate_list[:16])
print ('\n')
duplicate(data_google,0,'playstore dataset')
duplicate(data_apple,1,'appstore dataset')
The duplicate data has been detected, the next part is to delete these duplicates.
For the google playstore dataset, the differences between the duplicate apps mainly bothers on the review count. To address this, it is assumed that the higher the number of reviews, the more recent the data should be, hence, only the row with the highest number of reviews will be kept while other duplicate entries will be removed.
For the apple app store dataset, there seems to be consensus agreement on the discussion forum that the duplicate app entries detected by the duplicate
function are actually unique apps, hence, these entries will be kept.
The data_google_clean
list will not have duplicate enteries because the duplicates have been removed by keeping only data rows with the highest review.
reviews_max={}
for app in data_google:
name=app[0]
n_reviews=float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
data_google_clean=[]
already_added=[]
for app in data_google:
name=app[0]
n_reviews=float(app[3])
if n_reviews == reviews_max[name] and name not in already_added:
data_google_clean.append(app)
already_added.append(name)
data_explore(data_google_clean, 1, 6, True)
Remember, my organisation is more focused on developing English based apps. The ord
built in python function will be used to reduce the data to only the apps which have the character range between 0-127
in ASCII [American Standard Code for Information Interchange].
# Preview data on non-english apps
print(data_apple[813])
print(data_apple[6731])
The data_english
function will help to remove data on non-English related mobile apps.
# This function checks for non-english characters
def data_english(string):
non_ascii=0
for char in string:
if ord(char) > 127:
non_ascii+=1
if non_ascii > 3:
return False
else:
return True
print(data_english('Instagram'))
print(data_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(data_english('Docs To Go™ Free Office Suite'))
print(data_english('Instachat 😜'))
The data_english
function is working perfectly, we will progress ahead to delete non-english related apps using this function. The resulting data will be stored in data_eng_google
and data_eng_apple
list.
data_eng_google=[]
data_eng_apple=[]
for app in data_google_clean:
name=app[0]
if data_english(name):
data_eng_google.append(app)
for app in data_apple:
name=app[1]
if data_english(name):
data_eng_apple.append(app)
The final part of the data cleaning process will be to detect and delte the paid mobile apps, we are more interested in data relating to free apps in both markets.
data_final_google=[]
data_final_apple=[]
for app in data_eng_google:
price=app[7]
if price == '0':
data_final_google.append(app)
for app in data_eng_apple:
price=app[4]
if price == '0.0':
data_final_apple.append(app)
print(len(data_final_google) , 'mobile apps to be analysed in the google playstore dataset')
print(len(data_final_apple) , 'mobile apps to be analysed in the apple app store dataset')
The data is ready to be analysed. The objective as a Data Analyst for this project was to analyze the sample data on mobile apps in the Google playstore and Apple apps store to help my organisation understand genre / category of apps likely to attract more users. This has the potential to help the organisation significantly improve its mobile app user base and subsequently increase mobile revenue.
# This function builds a frequency table for the
# prime_genre data column of the app store dataset
# and the genres and catrgory column of the playstore
# dataset which will be displayed in percentages
def data_freq(data, index):
datafreq={}
total=0
for app in data:
total +=1
datac = app[index]
if datac in datafreq:
datafreq[datac]+=1
else:
datafreq[datac]=1
data_perc={}
for app in datafreq:
perc = (datafreq[app]/total)*100
data_perc[app] = perc
return data_perc
# This function will display the percentages
# developed in data_freq by descending order
def display_table(data, index):
dataf = data_freq(data, index)
data_display = []
for key in dataf:
key_val_as_tuple = (dataf[key], key)
data_display.append(key_val_as_tuple)
data_sorted = sorted(data_display, reverse = True)
for entry in data_sorted:
print(entry[1], ':', entry[0])
Based on the data below generated from the Apple app store dataset, the Games
genre seems to be miles ahead in representation with 58% while Entertainment
genre takes distant second place with 7.8%. The genral impression is that mobile apps designed for entertainment such as games, social networking seem to be more available in the app store market than mobile apps which are designed for more practical purposes for instance (productivity, news, weather, finace).
display_table(data_final_apple, -5)
Based on the data below generated from the Google playstore dataset, the Family
and Game
genre have higher availability with combined percentage of 28.64%. The genral impression here as was similarly observed in the app store dataset is that mobile apps designed for entertainment seem to be highly represented than mobile apps designed for practical purposes.
But, based on further observation, unlike the app store data which had a combined percentage of 66% for the top two popular genres, the google play store on the other hand only has 28.64%. This implies that productivity apps tend to be better represented in the google playstore market. Looking more closely, mobile apps within the genres for tools, business, productivity, finance and medical have combined percentage of 24.11%.
This data tell us that there could be a more balanced landscape of both practical and fun apps in the Google playstore market when compared with the Apple app store.
display_table(data_final_google, 1)
The below will help to discover the most popular genres in the app store dataset, based on the total number of users, remember, the objective is to find the genre of apps which have the most users for both datasets. For the playstore data, the installs column will be useful as-is, but for the app store data, the total number of user ratings rating_count_tot column
will be used as a proxy, since we dont have a column that directly shows the total number of installs by users.
data_genres_apple = data_freq(data_final_apple, -5)
for genre in data_genres_apple:
total = 0
len_genre = 0
for app in data_final_apple:
data_genre = app[-5]
if data_genre == genre:
ratings = float(app[5])
total += ratings
len_genre += 1
ratings_avg = total / len_genre
print(genre, ':', ratings_avg)
The data shows that the highest number of user reviews belong to Navigation
genre, but this figure seems to be highly influenced by the popularity of google maps and waze, which have close to half a million user reviews combined. This pattern seems to be the same with the Social Networking
genre where the average number of user reviews is highly influenced by popular organisations like Facebook and Microsoft.
for app in data_final_apple:
if app[-5] == 'Navigation':
print(app[1], ':', app[5])
Besides the Navigation
and Social Networking
genre which have data that might be skewing the results due to popularity of certain popular apps, the Reference
genre seems to be doing quite well on the Apple app store with 74,942 user based reviews on the average. By diving further into this genre, below, it is observered that the Bible and Dictionary.com apps skew up the average rating for the Reference
genre. Despite this, publishing a free mobile app in the Reference
genre could be promising for the organisation. This genre seems to be doing quite well in keeping up with popularity of genres like social networking.
Publishing mobile versions of highly popular books could prove to be a success for bringing in more mobile users. Possibly, this could be a book which could be geared towards family / fun by integrating within the app elements such as puzzles, mini games, and quizzes.
for app in data_final_apple:
if app[-5] == 'Reference':
print(app[1], ':', app[5])
The total count on installs for the playstore data is not precise enough, for instance, we are not particularly certain whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000.
# Preview data based on installs
display_table(data_final_google, 5)
The below helps to compute the average number of installs for each genre (category) in the playstore data which would produce better data analysis.
google_categories = data_freq(data_final_google, 1)
for category in google_categories:
total = 0
len_category = 0
for app in data_final_google:
category_app = app[1]
if category_app == category:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
total += float(n_installs)
len_category += 1
avg_n_installs = total / len_category
print(category, ':', avg_n_installs)
The google playstore market seems to be dominated by popular apps from high profile organisations. The data shows that communication apps have the most installs: 38,456,119 but like previously observed in the app store data, this number is skewed by popular social networking apps which tend to have over one billion installs for example whatsapp, facebook messenger, youtube, and instagram.
for app in data_final_google:
if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
or app[5] == '500,000,000+'
or app[5] == '100,000,000+'):
print(app[0], ':', app[5])
Besides Communication
genre, further analysis shows that genres for Game
and Books and Reference
seem to do quite well with the average number of installs being 15,588,015 and 8,767,811 respectively. This data is very insightful because the previous recommendation for the Apple app store could potentially be used for the Google playstore market. This would mean my organisation could practically publish the same mobile app with versions compatible on both markets.
for app in data_final_google:
if app[1] == 'BOOKS_AND_REFERENCE':
print(app[0], ':', app[5])
Publishing a free mobile app in the Books and Reference
genre could be promising for the organisation, this genre seems to be doing quite well in keeping up with the popularity of the top genres on the app store market mostly geared towards fun and entertainement such as games, social networking, photography and video. This pattern is observed for both sample datasets from the google playstore and apple app store.
Publishing a compatible mobile version of a highly popular book on both app store markets could prove to be a success for bringing in more mobile users. Possibly, this could be a book which could be geared towards both family and fun category by trying to integrate puzzles, mini games, quizzes, thereby building parts of the highly popular genres on the market into this one mobile app.