Data analysis project by Heather Gray ______________________________________________________________________________________________________
This project explores free applications on the google and apple stores to determine what type of apps attract the most users so that we can decide which app genre could the most profitable for our company.
The following function was created to make exploring the app store data more easily.
It takes in 4 parameters:
caveats:
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print() # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
print() # empty line at end of output
from csv import reader
ios_file = open('AppleStore.csv', encoding='utf8')
android_file = open('googleplaystore.csv', encoding='utf8')
ios = list(reader(ios_file))
android = list(reader(android_file))
Using the explore_data() function, print the first 3 rows of the two data sets
print('='*3, 'IOS', '='*3)
explore_data(ios, 1, 3)
print('='*3, 'ANDROID', '='*3)
explore_data(android, 1, 3)
=== IOS === ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] === ANDROID === ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
A function is defined which will give us a list of headers in any data set:
def get_headers(file):
headers = [x for x in file[0]]
return headers
def print_formatted_headers(store_name, headers):
print('='*3, store_name.upper(), '='*3)
for header in enumerate(headers):
print(header[0], ':', header[1])
print('\n')
ios_headers = get_headers(ios)
android_headers = get_headers(android)
print_formatted_headers('ios', ios_headers)
print_formatted_headers('android', android_headers)
=== IOS === 0 : id 1 : track_name 2 : size_bytes 3 : currency 4 : price 5 : rating_count_tot 6 : rating_count_ver 7 : user_rating 8 : user_rating_ver 9 : ver 10 : cont_rating 11 : prime_genre 12 : sup_devices.num 13 : ipadSc_urls.num 14 : lang.num 15 : vpp_lic === ANDROID === 0 : App 1 : Category 2 : Rating 3 : Reviews 4 : Size 5 : Installs 6 : Type 7 : Price 8 : Content Rating 9 : Genres 10 : Last Updated 11 : Current Ver 12 : Android Ver
Our data set should exclude non-free apps and apps with foreign characters since we are only concerned with free apps in English
The discussion forum for the Google Play data set reveals a row with a missing category. We could replace the category to keep this item, but a single row is not that important and we have more apps in the Google data set than ios anyway. So we will delete the row to fix this error
print(android[10473])
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
del(android[10473])
print(android[10473])
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
According to the discussion boards for the data sets, there are numerous duplicate rows (for example, there are 4 entries for Instagram in the Android set). For items where the duplicate info varies, we should try to keep the rows with the latest data. The number of reviews is a decent way to determine this. It follows that the higher the number of reviews, the more recent the data row.
To get the count of duplicates, we will iterate over each data set. Unique applications will be appended to a list and the list will be checked before adding another. If the application is already in the list, we will append it to a list of duplicate entries instead. The function 'find_duplicates' will accomplish this.
def find_duplicates(name_col_index, dataset):
duplicates = []
unique = []
for app in dataset:
name = app[name_col_index]
if name in unique:
duplicates.append(name)
else:
unique.append(name)
return(duplicates)
and_name_col_num = 0
ios_name_col_num = 1
print('ANDROID: ', len(find_duplicates(and_name_col_num, android)))
print('IOS: \t ', len(find_duplicates(ios_name_col_num, ios)))
ANDROID: 1181 IOS: 2
iOS data set is pretty clean. There are only 2 duplicates. Android has many more, so we only print a sample of the first 15 duplicates.
print('='*3, 'ANDROID', '='*3)
print(find_duplicates(0, android)[:15])
print('\n')
print('='*3, 'IOS', '='*3)
print(find_duplicates(1, ios))
=== ANDROID === ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software'] === IOS === ['Mannequin Challenge', 'VR Roller Coaster']
Display the expected length of each data set after removing duplicates for accuracy comparison
print('Android expected length: ', len(android) - 1181)
print('iOS expected length: \t ', len(ios) - 2)
Android expected length: 9660 iOS expected length: 7196
In the data set, we need to identify the most recent entry for any duplicate. The best way to do that is to go by the review count. The row with the highest count is considered to be the most recent and will be kept.
In order to do this, we will create a dictionary to hold each unique app along with its highest review count
def create_review_count_dictionary(name_column_num, review_column_num, dataset):
review_count_dict = {}
for app in dataset[1:]:
app_name = app[name_column_num]
if app_name not in review_count_dict:
review_count_dict[app_name] = float(app[review_column_num])
else:
if float(review_count_dict[app_name]) < float(app[review_column_num]):
review_count_dict[app_name] = float(app[review_column_num])
return review_count_dict
We now validate the length of the dictionary, which should match our expected length above (9659 records).
and_review_count_col_num = 3
ios_review_count_col_num = 5
print('Length of dict for Android: ',
len(create_review_count_dictionary(and_name_col_num, and_review_count_col_num, android)))
print('Length of dict for iOS: ',
len(create_review_count_dictionary(ios_name_col_num, ios_review_count_col_num, ios)))
Length of dict for Android: 9659 Length of dict for iOS: 7195
The next function will utilize the dictionary above to create a new clean data set with no duplicate records
def remove_duplicate_entries(name_column_num, review_column_num, dataset):
android_clean = []
already_added = []
reviews_max = create_review_count_dictionary(name_column_num, review_column_num, dataset)
for app in dataset[1:]:
name = app[name_column_num]
n_reviews = float(app[review_column_num])
if (reviews_max[name] == n_reviews) and (name not in already_added):
android_clean.append(app)
already_added.append(name)
return android_clean
android = remove_duplicate_entries(and_name_col_num, and_review_count_col_num, android)
ios = remove_duplicate_entries(ios_name_col_num, ios_review_count_col_num, ios)
Let's test the length of the clean data set
print('Length of new data set: ', len(android))
print('Length of new data set: ', len(ios))
Length of new data set: 9659 Length of new data set: 7195
Because our app is not yet translated into other languages, we will filter any apps with non-English characters, assuming they would not be competing with our brand.
We define a function is_english() which will return False if the string has non-English characters. The function uses a regular expression and tests each character in a given string.
def is_english(s):
count = 0
for character in s:
if ord(character) > 127:
count += 1
return count <= 2
Test if is_english() works:
print('英漢字典 is English?\t\t', is_english('英漢字典'))
print('é spéciaux français is English?\t', is_english('é spéciaux français'))
print('English 123 is English?\t\t', is_english('English 123'))
print('Test. puncuation?!, is English?\t', is_english('Test. punctuation?!,'))
print('Instachat 😜 is English?\t', is_english('Instachat 😜'))
英漢字典 is English? False é spéciaux français is English? False English 123 is English? True Test. puncuation?!, is English? True Instachat 😜 is English? True
Unfortunately, we will be also filtering applications that have more than 2 foreign characters in the title. It is assumed this will be very rare, so we will leave the function as is.
is_english() appears to work so now we will run every app name through the function and create a new dataset excluding non-english applications.
def clean_non_english(name_column_num, dataset):
english_only = []
non_english = []
for app in dataset[1:]:
if is_english(app[name_column_num]):
english_only.append(app)
else:
non_english.append(app)
return (english_only, non_english)
Let's test a few of the non-english apps to make sure the function worked as expected
WARNING: Running the next cell repeatedly will produce an error. If you wish to re-run this cell, you must restart & run all
non_english_sample = clean_non_english(ios_name_col_num, ios)[1][1][1]
print(non_english_sample)
聚力视频HD-人民的名义,跨界歌王全网热播
ios = clean_non_english(ios_name_col_num, ios)[0]
android = clean_non_english(and_name_col_num, android)[0]
print('iOS: \t ', len(ios))
print('Android: ', len(android))
iOS: 6152 Android: 9596
Because our application is free, we only want to compare to other free applications.
The isolate_free_apps function iterates through the given dataset. iOS and Android datasets differ in how they format the price, so there is a check for which data set is used.
def isolate_free_apps(price_column_num, dataset):
free_apps = []
for i in range(len(dataset[1:])):
price = dataset[1:][i][price_column_num]
if dataset is ios:
if price == '0.0':
free_apps.append(dataset[1:][i])
else:
if price == '0':
free_apps.append(dataset[1:][i])
return free_apps
ios_price_col_num = 4
and_price_col_num = 7
free_ios = isolate_free_apps(ios_price_col_num, ios)
free_android = isolate_free_apps(and_price_col_num, android)
print('iOS: \t', len(free_ios))
print('Android: ', len(free_android))
iOS: 3199 Android: 8846
There are nearly 3x as many free Android apps as free iOS apps. Out of pure curiousity, we will discover what percentage of Android apps are free vs. percentage of free iOS apps in our dataset:
print('Free iOS: \t', round(len(free_ios) / len(ios) * 100), '%')
print('Free Android: \t', round(len(free_android) / len(android) * 100), '%')
Free iOS: 52 % Free Android: 92 %
Now that we satisfied our curiousity about the number of free apps vs non-free, we can go ahead and overwrite our ios and android variables with our free app dataset
ios = isolate_free_apps(ios_price_col_num, ios)
android = isolate_free_apps(and_price_col_num, android)
The goal of our company is to release an app first on Android (due to the prevalance of Android phones) and if it does well, later release it on iOS. But what genres are most ubiquitous to each platform? And how do these genres perform? To find out, we will create app profiles.
The following function uses the handy Counter() object from the collections class. The Counter object creates a dictionary of values and their counts. Read more about Counter here
and_category_col_num = 1
ios_category_col_num = 11
from collections import Counter
def top_10_category(category_column_num, dataset):
return Counter([category[category_column_num] for category in dataset[1:]]).most_common(10)
top_10_ios = top_10_category(ios_category_col_num, ios)
top_10_android = top_10_category(and_category_col_num, android)
def print_categories(store_name, top_10):
print('='*3, store_name.upper(), '='*3)
for category in enumerate(top_10):
print(category[0] + 1, ':', category[1])
print()
The top 10 categories for iOS and Android are below
print_categories('iOS', top_10_ios)
print_categories('Android', top_10_android)
=== IOS === 1 : ('Games', 1863) 2 : ('Entertainment', 251) 3 : ('Photo & Video', 159) 4 : ('Education', 118) 5 : ('Social Networking', 105) 6 : ('Shopping', 83) 7 : ('Utilities', 79) 8 : ('Sports', 69) 9 : ('Music', 66) 10 : ('Health & Fitness', 65) === ANDROID === 1 : ('FAMILY', 1676) 2 : ('GAME', 858) 3 : ('TOOLS', 748) 4 : ('BUSINESS', 407) 5 : ('PRODUCTIVITY', 345) 6 : ('LIFESTYLE', 344) 7 : ('FINANCE', 328) 8 : ('MEDICAL', 313) 9 : ('SPORTS', 300) 10 : ('PERSONALIZATION', 294)
The android data set has an 'installs' column but the iOS dataset doesn't. So we will need to use the count of reviews as a metric to determine the install base of an app by category.
In order to do this, we will need a list of the unique categories in each dataset. Getting a "set" of category names is probably the way to go.
def get_category_set(dataset, category_column):
category_list = []
for app in dataset[1:]:
category = app[category_column]
if category not in category_list:
category_list.append(category)
return category_list
android_categories = get_category_set(android, and_category_col_num)
ios_categories = get_category_set(ios, ios_category_col_num)
def get_category_installs(category_column_num, review_count_colum_num, dataset):
category_reviews = {}
for app in dataset[1:]:
category = app[category_column_num]
review_count = int(app[review_count_colum_num])
if category in category_reviews:
category_reviews[category] += review_count
else:
category_reviews[category] = review_count
return Counter(category_reviews)
ios_category_review_count = get_category_installs(ios_category_col_num, ios_review_count_col_num, ios)
and_category_review_count = get_category_installs(and_category_col_num,
and_review_count_col_num, android)
Now we will print the top 10 most popular categories by install base.
In order to format the data in an understandable way, the length of the category string is taken into account when determining tabs.
def print_category_review_count(store_name, category_review_count):
print('='*3, store_name, '='*3)
for (cat, count) in category_review_count.most_common(10):
if len(cat) > 20:
print(cat, ': ', floor(count))
elif len(cat) > 12:
print(cat, ': \t', count)
elif len(cat) > 4:
print(cat, ': \t\t', count)
else:
print(cat, ': \t\t\t', count)
print()
print_category_review_count('iOS', ios_category_review_count)
print_category_review_count('Android', and_category_review_count)
=== iOS === Games : 40574984 Social Networking : 4609449 Music : 3783551 Entertainment : 3563035 Photo & Video : 2389089 Shopping : 2260151 Sports : 1587614 Health & Fitness : 1514371 Utilities : 1513363 Weather : 1463837 === Android === GAME : 587120874 COMMUNICATION : 285739629 TOOLS : 229299627 SOCIAL : 227936113 FAMILY : 189627665 PHOTOGRAPHY : 105465239 VIDEO_PLAYERS : 67630663 PRODUCTIVITY : 55418917 PERSONALIZATION : 53249961 SHOPPING : 44553582
According to our analysis of both install base and prevelance of each category of apps, games comes out on top every time. This might indicate creating a game for these platforms could be a good choice. It could also indicate the gaming app market is oversaturated, causing our game to be lost in the abundance.
Social networking is also pretty popular, but might be more difficult to break into because social media users are set in their ways and the major companies are established and have a loyal user base.
A good safe bet appears to be photo & video apps. They are reasonably popular on both platforms, but not to the point that they are too common.