This project showcases how to clean and analyze mobile app data using functions, conditional statements, for-loops, and dictionaries.
The scope of the project will be free, english games currently in the Apple Store and Google Play markets.
Goal of this project: Recognize what makes a mobile app profitable. With free applications, revenue streams are created when users interact with ads or purchase in-game items. Ergo, more users means more revenue.
Information like genre, user rating, installs, and reviews is used to find which apps are most popular.
Initial findings show that 'games' is the most common genre in the Apple Store, accounting for ~60% of apps. Google Play contains a more even distribution amongst its genres, with 'utility' apps being the most common at 8%.
Apple Store data source direct link: https://dq-content.s3.amazonaws.com/350/AppleStore.csv
Google Play Store data source direct link: https://dq-content.s3.amazonaws.com/350/googleplaystore.csv
The helper function reader
is imported and used to read the csv files for the apple and google datasets.
# import the reader function from the csv module and read in the files
from csv import reader
applefile = open('AppleStore.csv')
googlefile = open('googleplaystore.csv')
read_file1 = reader(applefile)
read_file2 = reader(googlefile)
# Create list of lists for each dataset
appdata = list(read_file1)
googdata = list(read_file2)
A function explore_data
was created to quickly gain insight into a dataset, showing a chosen slice of data along with number of rows and columns.
# Function that takes dataset, start of dataset slice, end of dataset slice, boolean condition as arguments
# and prints the desired slice, alongside number of rows and columns if it contains them
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset)) # Finds the length of the entire dataset ie rows
print('Number of columns:', len(dataset[0])) # Finds length of row in dataset ie columns
Let's use our previously written explore_data
function to print a few lines of each dataset.
explore_data(appdata, 0,2,True)
print('\n')
explore_data(googdata, 0,2,True)
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] Number of rows: 7198 Number of columns: 16 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] Number of rows: 10842 Number of columns: 13
Begin by searching for errors that need deleting or correction.
This can come in the form of duplicate rows, rows with blanks, etc.
One method would be to iterate over the dataset, find blanks, and return the index to either replace blank with or delete index. This will be done at a later time.
Fortuitously, a community discussion on Kaggle found an error in the google play dataset. Row 10473 (header included), contains a blank where a string should be.
For now, let us delete this row and assume that is the only blank error.
del googdata[10473]
def find_duplicates(dataset, index): # Assumes dataset with header
duplicate_apps = [] # Empty lists are created to isolate both duplicated and unique apps
unique_apps = []
for row in dataset[1:]: # Iterate over dataset without header and assign a variable to desired index
name = row[index]
if name in unique_apps: # If the index is already found in the empty list, it is a duplicate
duplicate_apps.append(name)
else:
unique_apps.append(name) # If the index is not found in the empty list, it is unique
total = len(duplicate_apps) + len(unique_apps)
print('Total entries: ' , total,'\n',
'No. of dup. apps: ', len(duplicate_apps),'\n',
'No. of unique apps: ',len(unique_apps),'\n')
if len(duplicate_apps) >= 1:
print('These are some of the duplicated apps: ',duplicate_apps[:3], '\n')
Check the datasets for any duplicates. If any, duplicates and unique elements should equate to the total entries found using the explore_data
function previously. *Note: The find_duplicate
function assumes a header row.*
find_duplicates(appdata, 0)
find_duplicates(googdata,0)
Total entries: 7197 No. of dup. apps: 0 No. of unique apps: 7197 Total entries: 10840 No. of dup. apps: 1181 No. of unique apps: 9659 These are some of the duplicated apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']
We did not include the header row in finding the duplicates. This should be considered for total entries.
The Apple Store dataset returns no duplicates, whereas the Google Play dataset returns 1,181 duplicated rows.
While the duplicate rows can be removed randomly, it is best to understand why there are duplicates and to remove them based on certain criteria.
For example, the google dataset has duplicates for the application, Instagram. Due to the data being saved at different times, the no. of user reviews change each time it is saved.
Save the latest data by setting a rule that only considers the one with the largest no. of reviews, ie. highest no. of reviews implies the the most recently updated.
# If the app is in the dictionary and its number of reviews is less than the already stored value, update value
# If the app is not in the dictionary, assign that current rating value
reviews_max = {}
for row in googdata[1:]:
name = row[0]
n_reviews = float(row[3])
if name in reviews_max and (reviews_max[name] < n_reviews):
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
print('Total apps: ', len(googdata[1:]),'\n',
'Unique: ', len(reviews_max), '\n',
'Copies: ', len(googdata[1:]) - len(reviews_max))
Total apps: 10840 Unique: 9659 Copies: 1181
Remove the duplicates by initializing two lists android_clean
and already_added
in which to store the new clean data set and list of copied names, respectively.
Iterate over the google dataset without the header row.
android_clean = [] # This will store the updated, unique app dataset
already_added = [] # This list will only store the names
for row in googdata[1:]:
name = row[0]
n_reviews = float(row[3])
# Secondary condition used in case of actual duplicates rather than larger values
if (reviews_max[name] == n_reviews) and (name not in already_added):
android_clean.append(row)
already_added.append(name)
Explore the data and check that the number of rows is correct.
explore_data(android_clean, 0, 3, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 9659 Number of columns: 13
English Apps
We previously stated that our scope is free, english games.
Let's create a function that removes non-english apps from the datasets.
The ASCII values of english characters (0-127) will be utilized to determine what type of data to filter out.
def is_eng(string):
for character in string:
if ord(character) > 127:
return False
return True
print(is_eng('Instagram'))
print(is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_eng('Docs To Go™ Free Office Suite'))
print(is_eng('Instachat 😜'))
True False False False
The ASCII value of emojis and other symbols are greater than 127.
To limit data loss, the filter of having 3 characters greater than 127 should be sufficient for this project.
def is_eng(string):
non_ascii = 0
for character in string:
if ord(character) > 127: # ord() returns the value for that character
non_ascii += 1 # increments by 1 for every character above the value 127
if non_ascii > 3:
return False
else:
return True
print(is_eng('Instagram'))
print(is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_eng('Docs To Go™ Free Office Suite'))
print(is_eng('Instachat 😜'))
True False True True
Using this updated function, non-English Apps from both data sets will be filtered out.
Utilize the filter function on the apple dataset and explore the data.
ios_eng = []
for row in appdata[1:]:
name = row[1]
if is_eng(name):
ios_eng.append(row)
explore_data(ios_eng, 0, 2, True)
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] Number of rows: 6183 Number of columns: 16
Now do the same for the google dataset.
android_eng = []
for row in android_clean:
name = row[0]
if is_eng(name):
android_eng.append(row)
explore_data(android_eng, 0, 2, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] Number of rows: 9614 Number of columns: 13
Free Apps
The scope of this project is free, english apps.
For now, the final cleaning of the data will be to extract all free apps.
free_eng_ios = []
free_eng_android = []
for row in ios_eng:
price = row[4]
if price == '0.0': # The apple dataset has the string of 0.0 in place of free apps
free_eng_ios.append(row)
for row in android_eng:
price = row[7]
if price == '0': # Unlike the apple dataset, google dataset uses the string of 0 for free apps
free_eng_android.append(row)
explore_data(free_eng_android, 0, 3, True)
explore_data(free_eng_ios, 0, 3, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 8864 Number of columns: 13 ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 3222 Number of columns: 16
Our goal is to determine the kinds of apps that are likely to attract more users. (More users = more revenue)
The more users that can download and interact with our applications, the more revenue that will be made.
Explore the header rows from the datasets and check which columns can help in our analysis.
explore_data(googdata, 0, 1, True)
print('\n')
explore_data(appdata, 0, 1, True)
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] Number of rows: 10841 Number of columns: 13 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] Number of rows: 7198 Number of columns: 16
The columns below can be scrutinized in our quest to find what makes an app popular.
Google data: user ratings[2], no. of ratings[3], installs[5], and content rating[8], categories[1];
Apple data: user rating[7], no. of ratings [5], content rating [10], genres[11]
Create a function freq_table
that will create a frequency table for any column.
def freq_table(dataset, index):
fq_table = {}
total = len(dataset)
for row in dataset:
column = row[index]
if column not in fq_table:
fq_table[column] = 1
else:
fq_table[column] += 1
for key in fq_table: # Convert the counts into percentages using the total length of the dataset
fq_table[key] /= total
fq_table[key] *= 100
return fq_table
freq_table(free_eng_ios, 11)
{'Social Networking': 3.2898820608317814, 'Photo & Video': 4.9658597144630665, 'Games': 58.16263190564867, 'Music': 2.0484171322160147, 'Reference': 0.5586592178770949, 'Health & Fitness': 2.0173805090006205, 'Weather': 0.8690254500310366, 'Utilities': 2.5139664804469275, 'Travel': 1.2414649286157666, 'Shopping': 2.60707635009311, 'News': 1.3345747982619491, 'Navigation': 0.186219739292365, 'Lifestyle': 1.5828677839851024, 'Entertainment': 7.883302296710118, 'Food & Drink': 0.8069522036002483, 'Sports': 2.1415270018621975, 'Book': 0.4345127250155183, 'Finance': 1.1173184357541899, 'Education': 3.662321539416512, 'Productivity': 1.7380509000620732, 'Business': 0.5276225946617008, 'Catalogs': 0.12414649286157665, 'Medical': 0.186219739292365}
The frequency showcases the data we want but not in a pleasing format.
Create a function display_table
that utilizes the output of the freq_table
function and returns a sorted table in descending order.
def display_table(dataset, index):
table = freq_table(dataset, index) # Assigning the ouput dictionary to the variable table
table_display = [] # initializing an empty list
for key in table: # iterating over every key when iterating over a dictionary
key_val_as_tuple = (table[key], key) # creating a tuple of the key:value pair, but in reverse order to be sorted
table_display.append(key_val_as_tuple) # append the tuple to the empty list
# Since sorted utilizes the first index to order the values, the key:value pair needed reversing
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
Apple Store
Using the display_table
function, we can analyze what genres are most common in the free_eng_ios
dataset.
The frequency table shows that Games are the most common genre, accounting for 58% of the aforementioned dataset. Coming in second are Entertainment type apps at nearly 8%, an overwhelming difference from Games.
The general impression is that games and entertainment applications dominate the free_eng_ios
dataset. Utility genres like finance, weather, navigation etc., receive less interaction.
display_table(free_eng_ios, 11) # Primary genre for ios apps
Games : 58.16263190564867 Entertainment : 7.883302296710118 Photo & Video : 4.9658597144630665 Education : 3.662321539416512 Social Networking : 3.2898820608317814 Shopping : 2.60707635009311 Utilities : 2.5139664804469275 Sports : 2.1415270018621975 Music : 2.0484171322160147 Health & Fitness : 2.0173805090006205 Productivity : 1.7380509000620732 Lifestyle : 1.5828677839851024 News : 1.3345747982619491 Travel : 1.2414649286157666 Finance : 1.1173184357541899 Weather : 0.8690254500310366 Food & Drink : 0.8069522036002483 Reference : 0.5586592178770949 Business : 0.5276225946617008 Book : 0.4345127250155183 Navigation : 0.186219739292365 Medical : 0.186219739292365 Catalogs : 0.12414649286157665
Google Play
Let's now use the display_table
function on the free_eng_android
dataset for the columns Categories and Genres.
The top category, Family, accounts for 18.9% of apps, while the second top category, Game, is at 9.7%. The difference between first and second is ~10% for this dataset, in comparison to free_eng_ios
's 50% difference.
Similarly, the Genres column most common type is Tools at 8.5%, and the second, Entertainment, at 6%. Only a 2% difference from first and second.
free_eng_android
showcases a more balanced distribution between apps, where fun and practical apps are popular.
display_table(free_eng_android,1) # categories for google apps
print('____________________','\n')
display_table(free_eng_android,9) # secondary genres for google apps
FAMILY : 18.907942238267147 GAME : 9.724729241877256 TOOLS : 8.461191335740072 BUSINESS : 4.591606498194946 LIFESTYLE : 3.9034296028880866 PRODUCTIVITY : 3.892148014440433 FINANCE : 3.7003610108303246 MEDICAL : 3.531137184115524 SPORTS : 3.395758122743682 PERSONALIZATION : 3.3167870036101084 COMMUNICATION : 3.2378158844765346 HEALTH_AND_FITNESS : 3.0798736462093865 PHOTOGRAPHY : 2.944494584837545 NEWS_AND_MAGAZINES : 2.7978339350180503 SOCIAL : 2.6624548736462095 TRAVEL_AND_LOCAL : 2.33528880866426 SHOPPING : 2.2450361010830324 BOOKS_AND_REFERENCE : 2.1435018050541514 DATING : 1.861462093862816 VIDEO_PLAYERS : 1.7937725631768955 MAPS_AND_NAVIGATION : 1.3989169675090252 FOOD_AND_DRINK : 1.2409747292418771 EDUCATION : 1.1620036101083033 ENTERTAINMENT : 0.9589350180505415 LIBRARIES_AND_DEMO : 0.9363718411552346 AUTO_AND_VEHICLES : 0.9250902527075812 HOUSE_AND_HOME : 0.8235559566787004 WEATHER : 0.8009927797833934 EVENTS : 0.7107400722021661 PARENTING : 0.6543321299638989 ART_AND_DESIGN : 0.6430505415162455 COMICS : 0.6204873646209386 BEAUTY : 0.5979241877256317 ____________________ Tools : 8.449909747292418 Entertainment : 6.069494584837545 Education : 5.347472924187725 Business : 4.591606498194946 Productivity : 3.892148014440433 Lifestyle : 3.892148014440433 Finance : 3.7003610108303246 Medical : 3.531137184115524 Sports : 3.463447653429603 Personalization : 3.3167870036101084 Communication : 3.2378158844765346 Action : 3.1024368231046933 Health & Fitness : 3.0798736462093865 Photography : 2.944494584837545 News & Magazines : 2.7978339350180503 Social : 2.6624548736462095 Travel & Local : 2.3240072202166067 Shopping : 2.2450361010830324 Books & Reference : 2.1435018050541514 Simulation : 2.0419675090252705 Dating : 1.861462093862816 Arcade : 1.8501805054151623 Video Players & Editors : 1.7712093862815883 Casual : 1.7599277978339352 Maps & Navigation : 1.3989169675090252 Food & Drink : 1.2409747292418771 Puzzle : 1.128158844765343 Racing : 0.9927797833935018 Role Playing : 0.9363718411552346 Libraries & Demo : 0.9363718411552346 Auto & Vehicles : 0.9250902527075812 Strategy : 0.9138086642599278 House & Home : 0.8235559566787004 Weather : 0.8009927797833934 Events : 0.7107400722021661 Adventure : 0.6768953068592057 Comics : 0.6092057761732852 Beauty : 0.5979241877256317 Art & Design : 0.5979241877256317 Parenting : 0.4963898916967509 Card : 0.45126353790613716 Casino : 0.42870036101083037 Trivia : 0.41741877256317694 Educational;Education : 0.39485559566787 Board : 0.3835740072202166 Educational : 0.3722924187725632 Education;Education : 0.33844765342960287 Word : 0.2594765342960289 Casual;Pretend Play : 0.236913357400722 Music : 0.2030685920577617 Racing;Action & Adventure : 0.16922382671480143 Puzzle;Brain Games : 0.16922382671480143 Entertainment;Music & Video : 0.16922382671480143 Casual;Brain Games : 0.13537906137184114 Casual;Action & Adventure : 0.13537906137184114 Arcade;Action & Adventure : 0.12409747292418773 Action;Action & Adventure : 0.10153429602888085 Educational;Pretend Play : 0.09025270758122744 Simulation;Action & Adventure : 0.078971119133574 Parenting;Education : 0.078971119133574 Entertainment;Brain Games : 0.078971119133574 Board;Brain Games : 0.078971119133574 Parenting;Music & Video : 0.06768953068592057 Educational;Brain Games : 0.06768953068592057 Casual;Creativity : 0.06768953068592057 Art & Design;Creativity : 0.06768953068592057 Education;Pretend Play : 0.056407942238267145 Role Playing;Pretend Play : 0.04512635379061372 Education;Creativity : 0.04512635379061372 Role Playing;Action & Adventure : 0.033844765342960284 Puzzle;Action & Adventure : 0.033844765342960284 Entertainment;Creativity : 0.033844765342960284 Entertainment;Action & Adventure : 0.033844765342960284 Educational;Creativity : 0.033844765342960284 Educational;Action & Adventure : 0.033844765342960284 Education;Music & Video : 0.033844765342960284 Education;Brain Games : 0.033844765342960284 Education;Action & Adventure : 0.033844765342960284 Adventure;Action & Adventure : 0.033844765342960284 Video Players & Editors;Music & Video : 0.02256317689530686 Sports;Action & Adventure : 0.02256317689530686 Simulation;Pretend Play : 0.02256317689530686 Puzzle;Creativity : 0.02256317689530686 Music;Music & Video : 0.02256317689530686 Entertainment;Pretend Play : 0.02256317689530686 Casual;Education : 0.02256317689530686 Board;Action & Adventure : 0.02256317689530686 Video Players & Editors;Creativity : 0.01128158844765343 Trivia;Education : 0.01128158844765343 Travel & Local;Action & Adventure : 0.01128158844765343 Tools;Education : 0.01128158844765343 Strategy;Education : 0.01128158844765343 Strategy;Creativity : 0.01128158844765343 Strategy;Action & Adventure : 0.01128158844765343 Simulation;Education : 0.01128158844765343 Role Playing;Brain Games : 0.01128158844765343 Racing;Pretend Play : 0.01128158844765343 Puzzle;Education : 0.01128158844765343 Parenting;Brain Games : 0.01128158844765343 Music & Audio;Music & Video : 0.01128158844765343 Lifestyle;Pretend Play : 0.01128158844765343 Lifestyle;Education : 0.01128158844765343 Health & Fitness;Education : 0.01128158844765343 Health & Fitness;Action & Adventure : 0.01128158844765343 Entertainment;Education : 0.01128158844765343 Communication;Creativity : 0.01128158844765343 Comics;Creativity : 0.01128158844765343 Casual;Music & Video : 0.01128158844765343 Card;Action & Adventure : 0.01128158844765343 Books & Reference;Education : 0.01128158844765343 Art & Design;Pretend Play : 0.01128158844765343 Art & Design;Action & Adventure : 0.01128158844765343 Arcade;Pretend Play : 0.01128158844765343 Adventure;Education : 0.01128158844765343
Calculate the average number of installs for each app genre to find what kinds are most popular.
Apple Store
The column user_ratings
will be used as a proxy for installs since the apple store dataset does not contain it.
Let's utilize nested loops and do the following:
Isolate the apps of each genre
Add up the user ratings for the apps of that genre
Divide the sum by the number of apps belonging to that genre (not by the total number of apps)
genres_ios = freq_table(free_eng_ios, 11)
for genre in genres_ios:
total = 0
len_genre = 0
for app in free_eng_ios:
genre_app = app[11]
if genre_app == genre:
no_of_ratings = float(app[5])
total += no_of_ratings
len_genre += 1
average_no = total / len_genre
print(f"{genre}: {average_no:,.2f}")
Social Networking: 71,548.35 Photo & Video: 28,441.54 Games: 22,788.67 Music: 57,326.53 Reference: 74,942.11 Health & Fitness: 23,298.02 Weather: 52,279.89 Utilities: 18,684.46 Travel: 28,243.80 Shopping: 26,919.69 News: 21,248.02 Navigation: 86,090.33 Lifestyle: 16,485.76 Entertainment: 14,029.83 Food & Drink: 33,333.92 Sports: 23,008.90 Book: 39,758.50 Finance: 31,467.94 Education: 7,003.98 Productivity: 21,028.41 Business: 7,491.12 Catalogs: 4,004.00 Medical: 612.00
The output above shows the average number of user ratings for each genre.
While Games and Entertainment are the most common genres for apps, the more popular genres seem to be Navigation, Reference, and Social Networking.
for app in free_eng_ios:
if app[-5] == 'Reference':
print(app[1], ':', app[5]) # print name and number of ratings
Bible : 985920 Dictionary.com Dictionary & Thesaurus : 200047 Dictionary.com Dictionary & Thesaurus for iPad : 54175 Google Translate : 26786 Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418 New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588 Merriam-Webster Dictionary : 16849 Night Sky : 12122 City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535 LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693 GUNS MODS for Minecraft PC Edition - Mods Tools : 1497 Guides for Pokémon GO - Pokemon GO News and Cheats : 826 WWDC : 762 Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718 VPN Express : 14 Real Bike Traffic Rider Virtual Reality Glasses : 8 教えて!goo : 0 Jishokun-Japanese English Dictionary & Translator : 0
The number of user ratings is heavily skewed by app giants like the Bible and Dictrionary.com. A similar pattern follows the Navigation and Social Networking genre where apps Google Maps and Waze for navigation, and Facebook and Instagram for Social Networking dominate those genres.
Google Play
We find that the installs
column does not have precise install numbers, rather it contains ranges for the amount of installs.
We can work with the ranges removing unwanted characters and converting into float values.
Let's combine the above with the same logic applied to the apple store dataset in the same loop.
categories_android = freq_table(free_eng_android, 1)
for category in categories_android:
total = 0
len_category = 0
for app in free_eng_android:
category_app = app[1]
if category == category_app:
no_of_installs = app[5]
no_of_installs = no_of_installs.replace('+','')
no_of_installs = no_of_installs.replace(',','')
total += float(no_of_installs)
len_category += 1
avg_installs = total / len_category
print(f"{category}: {avg_installs:,.2f}")
ART_AND_DESIGN: 1,986,335.09 AUTO_AND_VEHICLES: 647,317.82 BEAUTY: 513,151.89 BOOKS_AND_REFERENCE: 8,767,811.89 BUSINESS: 1,712,290.15 COMICS: 817,657.27 COMMUNICATION: 38,456,119.17 DATING: 854,028.83 EDUCATION: 1,833,495.15 ENTERTAINMENT: 11,640,705.88 EVENTS: 253,542.22 FINANCE: 1,387,692.48 FOOD_AND_DRINK: 1,924,897.74 HEALTH_AND_FITNESS: 4,188,821.99 HOUSE_AND_HOME: 1,331,540.56 LIBRARIES_AND_DEMO: 638,503.73 LIFESTYLE: 1,437,816.27 GAME: 15,588,015.60 FAMILY: 3,695,641.82 MEDICAL: 120,550.62 SOCIAL: 23,253,652.13 SHOPPING: 7,036,877.31 PHOTOGRAPHY: 17,840,110.40 SPORTS: 3,638,640.14 TRAVEL_AND_LOCAL: 13,984,077.71 TOOLS: 10,801,391.30 PERSONALIZATION: 5,201,482.61 PRODUCTIVITY: 16,787,331.34 PARENTING: 542,603.62 WEATHER: 5,074,486.20 VIDEO_PLAYERS: 24,727,872.45 NEWS_AND_MAGAZINES: 9,549,178.47 MAPS_AND_NAVIGATION: 4,056,941.77
The output above shows the average number of installs for each category in the Google Play store.
It finds that Communication is most the popular genere, with over 38 million installs.
However, we can infer that these numbers can be skewed by a few giants, similar to the Apple Store.
for app in free_eng_android:
if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '5,000,000+'):
print(app[0], ':', app[5]) # print name and number of ratings
WhatsApp Messenger : 1,000,000,000+ My Tele2 : 5,000,000+ Call Free – Free Call : 5,000,000+ Web Browser & Explorer : 5,000,000+ Skype Lite - Free Video Call & Chat : 5,000,000+ Google Duo - High Quality Video Calls : 500,000,000+ My Vodacom SA : 5,000,000+ Microsoft Edge : 5,000,000+ Messenger – Text and Video Chat for Free : 1,000,000,000+ imo free video calls and chat : 500,000,000+ Calls & Text by Mo+ : 5,000,000+ Skype - free IM & video calls : 1,000,000,000+ LINE: Free Calls & Messages : 500,000,000+ Google Chrome: Fast & Secure : 1,000,000,000+ UC Browser - Fast Download Private & Secure : 500,000,000+ Full Screen Caller ID : 5,000,000+ CIA - Caller ID & Call Blocker : 5,000,000+ Call Control - Call Blocker : 5,000,000+ Sync.ME – Caller ID & Block : 5,000,000+ Gmail : 1,000,000,000+ K-9 Mail : 5,000,000+ Daum Mail - Next Mail : 5,000,000+ Hangouts : 1,000,000,000+ JusTalk - Free Video Calls and Fun Video Chat : 5,000,000+ AT&T Call Protect : 5,000,000+ Viber Messenger : 500,000,000+ Brave Browser: Fast AdBlocker : 5,000,000+ Ear Agent: Super Hearing : 5,000,000+ Bluetooth Auto Connect : 5,000,000+ Chrome Dev : 5,000,000+ CM Transfer - Share any files with friends nearby : 5,000,000+ Your Freedom VPN Client : 5,000,000+ Caller ID & Call Block - DU Caller : 5,000,000+
Confirming our assumption, we see that apps like Gmail, WhatsApp and Skype skew the genre data.
This project showcased an introductory process for collecting, cleaning and analyzing a subset of app data from the Apple and Google Play stores.
The goal was to find what type of app profiles could be profitable for both markets.
At surface-level, we find that in both data subsets, Utilities, Social networking and Entertainment apps make up the most common as well as the most popular apps.
One possible idea could be to create a mobile game that allows for interactions between users, offering in-app purchases for better items, increasing the likelihood a user will want to spend to win. (This is highly common, however)
Another idea could be to combine Social Networking with something like a translator app. A repository of text translation could be at the ready for any user accessing that information. The more common translations could be analyzed by a community discussion to give confirmation on the accuracy of formal, informal or slang phrases.