The aim of this project is to find mobile apps that are profitable. We are assumed to be working for a company that builds both iOS and android applications and make them available on Apple appstore and Google playstore.
We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.
We aim to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store. To do this, we'll need to collect and analyze data about mobile apps available on Google Play and the App Store. As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead.
To avoid spending resources on collecting new data ourselves, we tried to see if we can find any relevant existing data at no cost. Luckily, here are two datasets that seem suitable for our goals:
A dataset containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the dataset directly from this link.
A dataset containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the dataset directly from this link.
Let's start by opening and exploring these two datasets.
def explore_data(dataset, start, end, rows_and_columns = False):
for i in dataset[start:end]:
print(i)
print('\n')# dds a new (empty) line after each row
if rows_and_columns:
print("Number of rows: ",len(dataset))
print("Number of Columns: ",len(dataset[0]))
from csv import reader
opened_android = open("C:/Users/ifediorah.kenechukwu/Documents/PythonDA/Datasets/googleplaystore.csv", encoding = "utf-8")
opened_apple = open("C:/Users/ifediorah.kenechukwu/Documents/PythonDA/Datasets/AppleStore.csv", encoding = "utf-8")
apple_data = list(reader(opened_apple))
android_data = list(reader(opened_android))
ios_header = apple_data[0]
ios = apple_data[1:]
android_header = android_data[0]
android = android_data[1:]
opened_android.close()
opened_apple.close()
print("Apple Dataset")
explore_data(ios,0,5,True)
print("\n\n\n")
print("Android Dataset")
explore_data(android,0,5,True)
Apple Dataset ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'] ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'] Number of rows: 7197 Number of Columns: 16 Android Dataset ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] Number of rows: 10841 Number of Columns: 13
print("Headers for Apple dataset")
print(ios_header)
print("Headers for Android dataset")
print(android_header)
Headers for Apple dataset ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] Headers for Android dataset ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
From the dataset documentation, users complained that a particular row in the android dataset has incomplete data. The category column is missing. Below, I'll confirm that and delete the row to avaoid inaccuracy.
print(android[10472])
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
del android[10472]
print(android[10472])
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
From inspecting the android data and also the documentation, it would seem the data contains duplicate entries.
Below, I'll write a function to count identify the duplicate entries and consequently go ahead to remove them. This removal will not be done randomly. I'll inspect the data and make sure I keep only the most recent one. I should be able to get that information from the number of installs.
# For Android
duplicate_apps = []
unique_apps = []
for app in android:
name = app[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
print(len(unique_apps))
print(len(duplicate_apps))
9659 1181
print(duplicate_apps[:10])
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']
# finding one duplicate app in order to inspect the differences
duplicate_apps1 = []
unique_apps1 = []
for app in android:
name = app[0]
if name in duplicate_apps and name in unique_apps1:
duplicate_apps1.append(name)
elif name in duplicate_apps and name not in unique_apps1:
unique_apps1.append(name)
print(len(unique_apps1))
print(len(duplicate_apps1))
798 1181
print(duplicate_apps1[:20])
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']
count = 0
for app in android:
count += 1
if app[0] == 'Google My Business':
print(app)
print(count)
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up'] 194 ['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up'] 240 ['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up'] 269
print(android_header)
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
count = 0
for app in android:
count += 1
if app[0] == 'Instagram':
print(app)
print(count)
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 2546 ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 2605 ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 2612 ['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 3910
print(android[249])
['Google Analytics', 'BUSINESS', '4.5', '78662', '22M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 13, 2018', '3.7.5', '4.4 and up']
From the above insights into the data, we can see that the varying element of each row is mostly the number of reviews. Which means that ths data must have been scraped at different intervals.
I am going to be taking out the duplicates and leaving the entries with the highest number of reviews as this has to be the most recently scraped.
#Getting the apps wth the highest reviews
reviews_max = {}
for app in android:
name = app[0]
n_reviews = float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
print(len(reviews_max))
9659
Now I have the apps with the highest reviews, I will loop through the android dataset again and this time will keep only a single copy of the duplicated entries. The sigle copy will be the one with highest reviesws. This will create a new dataset called android_clean
android_clean = []
already_added = []
for app in android:
name = app[0]
n_reviews = float(app[3])
if name not in already_added and n_reviews == reviews_max[name]:
android_clean.append(app)
already_added.append(name)
print(len(android_clean))
print(len(already_added))
9659 9659
print(android_clean[:5],'\n')
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]
explore_data(android_clean,0,5)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']
def check_ascii(a_string):
count = 0
for i in a_string:
if ord(i) > 127:
count += 1
if count > 3: # Since some english apps may have one or two characters taht are not in this range, we do a count and only classify the app as non-english if it contains more than 3 characters that are not in our ASCII range
return False
return True
print(check_ascii("Instagram"))
print(check_ascii("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(check_ascii("Docs To Go™ Free Office Suite"))
print(check_ascii("Instachat 😜"))
True False True True
android_filtered = []
android_nonenglish = []
for app in android_clean:
if check_ascii(app[0]):
android_filtered.append(app)
else:
android_nonenglish.append(app[0])
print(len(android_filtered))
print(len(android_nonenglish))
9614 45
print(android_nonenglish)
['Flame - درب عقلك يوميا', 'သိင်္ Astrology - Min Thein Kha BayDin', 'РИА Новости', 'صور حرف H', 'L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템', 'AJ렌터카 법인 카셰어링', 'Al Quran Free - القرآن (Islam)', '中国語 AQリスニング', '日本AV历史', 'Ay Yıldız Duvar Kağıtları', 'বাংলা টিভি প্রো BD Bangla TV', 'Cъновник BG', 'CSCS BG (в български)', '뽕티비 - 개인방송, 인터넷방송, BJ방송', 'BL 女性向け恋愛ゲーム◆俺プリクロス', 'SecondSecret ‐「恋を読む」BLノベルゲーム‐', 'BL 女性向け恋愛ゲーム◆ごくメン', 'あなカレ【BL】無料ゲーム', '감성학원 BL 첫사랑', 'BQ-መጽሐፍ ቅዱሳዊ ጥያቄዎች', 'BS Calendar / Patro / पात्रो', 'Vip视频免费看-BT磁力搜索', 'Билеты ПДД CD 2019 PRO', 'Offline Jízdní řády CG Transit', 'Bonjour 2017 Abidjan CI ❤❤❤❤❤', 'CK 初一 十五', 'الفاتحون Conquerors', 'DG ग्राम / Digital Gram Panchayat', 'DM הפקות', 'DW فارسی By dw-arab.com', 'لعبة تقدر تربح DZ', 'বাংলাflix', 'RPG ブレイジング ソウルズ アクセレイト', '英漢字典 EC Dictionary', 'ECナビ×シュフー', 'أحداث وحقائق | خبر عاجل في اخبار العالم', 'EG SIM CARD (EGSIMCARD, 이지심카드)', 'パーリーゲイツ公式通販|EJ STYLE(イージェイスタイル)', 'FAHREDDİN er-RÂZİ TEFSİRİ', "I'm Rich/Eu sou Rico/أنا غني/我很有錢", 'AÖF Ev İdaresi 1. Sınıf', 'Ey Sey Storytime រឿងនិទានតាឥសី', '哈哈姆特不EY', 'FP Разбитый дисплей']
print(android_header)
print(android_filtered[0])
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
android_final = []
for app in android_filtered:
if app[6] == "Free":
android_final.append(app)
print(len(android_final))
8863
We have been working on the android dataset all along. Now that we are done cleaning this, let's go ahead to clean the ios dataset
#This is to make sure that all items in the ios data set have complete rows
for row in ios:
if len(row) != len(ios_header):
print(row[1])
#Let's loop through for duplicate entries
# For iOS
duplicate_ios = []
unique_ios = []
for app in ios:
name = app[1]
if name in unique_ios:
duplicate_ios.append(name)
else:
unique_ios.append(name)
print(duplicate_ios)
['Mannequin Challenge', 'VR Roller Coaster']
We can see that 2 apps appear twice
count = 0
for app in ios:
count += 1
if app[1] == 'VR Roller Coaster':
print(app)
print(count)
print(ios_header)
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1'] 4443 ['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1'] 4832 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
Like we did before, we are going to remove the duplicates and retain the app with higher rating count
#Getting the apps wth the highest reviews
ios_reviews_max = {}
for app in ios:
name = app[1]
n_reviews = float(app[5])
if name in ios_reviews_max and ios_reviews_max[name] < n_reviews:
ios_reviews_max[name] = n_reviews
elif name not in ios_reviews_max:
ios_reviews_max[name] = n_reviews
print(len(ios_reviews_max))
print(len(ios))
7195 7197
ios_clean = []
ios_already_added = []
for app in ios:
name = app[1]
n_reviews = float(app[5])
if name not in ios_already_added and n_reviews == ios_reviews_max[name]:
ios_clean.append(app)
ios_already_added.append(name)
len(ios_clean)
7195
ios_filtered = []
ios_nonenglish = []
for app in ios_clean:
if check_ascii(app[1]):
ios_filtered.append(app)
else:
ios_nonenglish.append(app[1])
print(len(ios_filtered))
print(len(ios_nonenglish))
6181 1014
print(ios_nonenglish[:50])
['爱奇艺PPS -《欢乐颂2》电视剧热播', '聚力视频HD-人民的名义,跨界歌王全网热播', '优酷视频', '网易新闻 - 精选好内容,算出你的兴趣', '淘宝 - 随时随地,想淘就淘', '搜狐视频HD-欢乐颂2 全网首播', '阴阳师-全区互通现世集结', '百度贴吧-全球最大兴趣交友社区', '百度网盘', '爱奇艺HD -《欢乐颂2》电视剧热播', '乐视视频HD-白鹿原,欢乐颂,奔跑吧全网热播', '万年历-值得信赖的日历黄历查询工具', '新浪新闻-阅读最新时事热门头条资讯视频', '喜马拉雅FM(听书社区)电台有声小说相声英语', '央视影音-海量央视内容高清直播', '腾讯视频HD-楚乔传,明日之子6月全网首播', '手机百度 - 百度一下你就得到', '百度视频HD-高清电视剧、电影在线观看神器', 'MOMO陌陌-开启视频社交,用直播分享生活', 'QQ 浏览器-搜新闻、选小说漫画、看视频', '同花顺-炒股、股票', '聚力视频-蓝光电视剧电影在线热播', '快看漫画', '乐视视频-白鹿原,欢乐颂,奔跑吧全网热播', '酷我音乐HD-无损在线播放', '随手记(专业版)-好用的记账理财工具', 'Dictionary ( قاموس عربي / انجليزي + ودجيت الترجمة)', '滴滴出行', '高德地图(精准专业的手机地图)', '百度HD-极速安全浏览器', '美丽说-潮流穿搭快人一步', '百度地图-智能的手机导航,公交地铁出行必备', 'Majiang Mahjong(单机+川麻+二人+武汉+国标)', '土豆视频HD—高清影视综艺视频播放器', '360手机卫士-超安全的来电防骚扰助手', 'QQ浏览器HD-极速搜索浏览器', '搜狗输入法-Sogou Keyboard', '百度网盘 HD', '大众点评-发现品质生活', '讯飞输入法-智能语音输入和表情斗图神器', '美柚 - 女生助手', '爱奇艺 - 电视剧电影综艺娱乐视频播放器', '搜狐视频-欢乐颂2 全网首播', '百度地图HD', 'QQ同步助手-新机一键换机必备工具', 'QQ音乐-来这里“发现・音乐”', '腾讯新闻-头条新闻热点资讯掌上阅读软件', '土豆(短视频分享平台)', '风行视频+ HD - 电影电视剧体育视频播放器', '仙劍奇俠傳5 - 劍傲丹楓']
Now to get the free apps from the ios dataset
ios_final = []
for app in ios_filtered:
if float(app[4]) == 0.0:
ios_final.append(app)
print(len(ios_final))
3220
explore_data(android_final,0,3)
explore_data(ios_final,0,3)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']
Now we know that our objective is to find an app profile that fits for both appstore and google play. We want to find apps that are successful in both markets.
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.
To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.
Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our datasets.
explore_data(android_final,0,3)
print(android_header)
print('\n')
print('\n')
print('\n')
explore_data(ios_final,0,3)
print(ios_header)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
def freq_table(dataset, index):
table = {}
for app in dataset:
freq = app[index]
if freq in table:
table[freq] += 1
else:
table[freq] = 1
total = sum(list(table.values()))
for item in table:
table[item] = round((table[item]/total) * 100, 2)
return table
ios_genre = freq_table(ios_final,11)
android_genre = freq_table(android_final, 9)
android_category = freq_table(android_final, 1)
print(ios_genre)
print("\n")
print("\n")
print(android_genre)
print("\n")
print("\n")
print(android_category)
{'Social Networking': 3.29, 'Photo & Video': 4.97, 'Games': 58.14, 'Music': 2.05, 'Reference': 0.56, 'Health & Fitness': 2.02, 'Weather': 0.87, 'Utilities': 2.52, 'Travel': 1.24, 'Shopping': 2.61, 'News': 1.34, 'Navigation': 0.19, 'Lifestyle': 1.58, 'Entertainment': 7.89, 'Food & Drink': 0.81, 'Sports': 2.14, 'Book': 0.43, 'Finance': 1.12, 'Education': 3.66, 'Productivity': 1.74, 'Business': 0.53, 'Catalogs': 0.12, 'Medical': 0.19} {'Art & Design': 0.6, 'Art & Design;Creativity': 0.07, 'Auto & Vehicles': 0.93, 'Beauty': 0.6, 'Books & Reference': 2.14, 'Business': 4.59, 'Comics': 0.61, 'Comics;Creativity': 0.01, 'Communication': 3.24, 'Dating': 1.86, 'Education': 5.35, 'Education;Creativity': 0.05, 'Education;Education': 0.34, 'Education;Pretend Play': 0.06, 'Education;Brain Games': 0.03, 'Entertainment': 6.07, 'Entertainment;Brain Games': 0.08, 'Entertainment;Creativity': 0.03, 'Entertainment;Music & Video': 0.17, 'Events': 0.71, 'Finance': 3.7, 'Food & Drink': 1.24, 'Health & Fitness': 3.08, 'House & Home': 0.82, 'Libraries & Demo': 0.94, 'Lifestyle': 3.89, 'Lifestyle;Pretend Play': 0.01, 'Card': 0.45, 'Arcade': 1.85, 'Puzzle': 1.13, 'Racing': 0.99, 'Sports': 3.46, 'Casual': 1.76, 'Simulation': 2.04, 'Adventure': 0.68, 'Trivia': 0.42, 'Action': 3.1, 'Word': 0.26, 'Role Playing': 0.94, 'Strategy': 0.9, 'Board': 0.38, 'Music': 0.2, 'Action;Action & Adventure': 0.1, 'Casual;Brain Games': 0.14, 'Educational;Creativity': 0.03, 'Puzzle;Brain Games': 0.17, 'Educational;Education': 0.39, 'Casual;Pretend Play': 0.24, 'Educational;Brain Games': 0.07, 'Art & Design;Pretend Play': 0.01, 'Educational;Pretend Play': 0.09, 'Entertainment;Education': 0.01, 'Casual;Education': 0.02, 'Casual;Creativity': 0.07, 'Casual;Action & Adventure': 0.14, 'Music;Music & Video': 0.02, 'Arcade;Pretend Play': 0.01, 'Adventure;Action & Adventure': 0.03, 'Role Playing;Action & Adventure': 0.03, 'Simulation;Pretend Play': 0.02, 'Puzzle;Creativity': 0.02, 'Simulation;Action & Adventure': 0.08, 'Racing;Action & Adventure': 0.17, 'Sports;Action & Adventure': 0.02, 'Educational;Action & Adventure': 0.03, 'Arcade;Action & Adventure': 0.12, 'Entertainment;Action & Adventure': 0.03, 'Art & Design;Action & Adventure': 0.01, 'Puzzle;Action & Adventure': 0.03, 'Education;Action & Adventure': 0.03, 'Strategy;Action & Adventure': 0.01, 'Music & Audio;Music & Video': 0.01, 'Health & Fitness;Education': 0.01, 'Board;Action & Adventure': 0.02, 'Board;Brain Games': 0.08, 'Casual;Music & Video': 0.01, 'Education;Music & Video': 0.03, 'Role Playing;Pretend Play': 0.05, 'Entertainment;Pretend Play': 0.02, 'Medical': 3.53, 'Social': 2.66, 'Shopping': 2.25, 'Photography': 2.94, 'Travel & Local': 2.32, 'Travel & Local;Action & Adventure': 0.01, 'Tools': 8.45, 'Tools;Education': 0.01, 'Personalization': 3.32, 'Productivity': 3.89, 'Parenting': 0.5, 'Parenting;Music & Video': 0.07, 'Parenting;Education': 0.08, 'Parenting;Brain Games': 0.01, 'Weather': 0.8, 'Video Players & Editors': 1.77, 'Video Players & Editors;Music & Video': 0.02, 'Video Players & Editors;Creativity': 0.01, 'News & Magazines': 2.8, 'Maps & Navigation': 1.4, 'Health & Fitness;Action & Adventure': 0.01, 'Educational': 0.37, 'Casino': 0.43, 'Trivia;Education': 0.01, 'Lifestyle;Education': 0.01, 'Card;Action & Adventure': 0.01, 'Books & Reference;Education': 0.01, 'Simulation;Education': 0.01, 'Puzzle;Education': 0.01, 'Adventure;Education': 0.01, 'Role Playing;Brain Games': 0.01, 'Strategy;Education': 0.01, 'Racing;Pretend Play': 0.01, 'Communication;Creativity': 0.01, 'Strategy;Creativity': 0.01} {'ART_AND_DESIGN': 0.64, 'AUTO_AND_VEHICLES': 0.93, 'BEAUTY': 0.6, 'BOOKS_AND_REFERENCE': 2.14, 'BUSINESS': 4.59, 'COMICS': 0.62, 'COMMUNICATION': 3.24, 'DATING': 1.86, 'EDUCATION': 1.16, 'ENTERTAINMENT': 0.96, 'EVENTS': 0.71, 'FINANCE': 3.7, 'FOOD_AND_DRINK': 1.24, 'HEALTH_AND_FITNESS': 3.08, 'HOUSE_AND_HOME': 0.82, 'LIBRARIES_AND_DEMO': 0.94, 'LIFESTYLE': 3.9, 'GAME': 9.73, 'FAMILY': 18.9, 'MEDICAL': 3.53, 'SOCIAL': 2.66, 'SHOPPING': 2.25, 'PHOTOGRAPHY': 2.94, 'SPORTS': 3.4, 'TRAVEL_AND_LOCAL': 2.34, 'TOOLS': 8.46, 'PERSONALIZATION': 3.32, 'PRODUCTIVITY': 3.89, 'PARENTING': 0.65, 'WEATHER': 0.8, 'VIDEO_PLAYERS': 1.79, 'NEWS_AND_MAGAZINES': 2.8, 'MAPS_AND_NAVIGATION': 1.4}
sum(list(android_genre.values()))
99.92000000000009
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
ios_genre_freq = display_table(ios_final,11)
print("\n")
print("\n")
android_genre_freq = display_table(android_final,9)
print("\n")
print("\n")
android_cat_freq = display_table(android_final,1)
Games : 58.14 Entertainment : 7.89 Photo & Video : 4.97 Education : 3.66 Social Networking : 3.29 Shopping : 2.61 Utilities : 2.52 Sports : 2.14 Music : 2.05 Health & Fitness : 2.02 Productivity : 1.74 Lifestyle : 1.58 News : 1.34 Travel : 1.24 Finance : 1.12 Weather : 0.87 Food & Drink : 0.81 Reference : 0.56 Business : 0.53 Book : 0.43 Navigation : 0.19 Medical : 0.19 Catalogs : 0.12 Tools : 8.45 Entertainment : 6.07 Education : 5.35 Business : 4.59 Productivity : 3.89 Lifestyle : 3.89 Finance : 3.7 Medical : 3.53 Sports : 3.46 Personalization : 3.32 Communication : 3.24 Action : 3.1 Health & Fitness : 3.08 Photography : 2.94 News & Magazines : 2.8 Social : 2.66 Travel & Local : 2.32 Shopping : 2.25 Books & Reference : 2.14 Simulation : 2.04 Dating : 1.86 Arcade : 1.85 Video Players & Editors : 1.77 Casual : 1.76 Maps & Navigation : 1.4 Food & Drink : 1.24 Puzzle : 1.13 Racing : 0.99 Role Playing : 0.94 Libraries & Demo : 0.94 Auto & Vehicles : 0.93 Strategy : 0.9 House & Home : 0.82 Weather : 0.8 Events : 0.71 Adventure : 0.68 Comics : 0.61 Beauty : 0.6 Art & Design : 0.6 Parenting : 0.5 Card : 0.45 Casino : 0.43 Trivia : 0.42 Educational;Education : 0.39 Board : 0.38 Educational : 0.37 Education;Education : 0.34 Word : 0.26 Casual;Pretend Play : 0.24 Music : 0.2 Racing;Action & Adventure : 0.17 Puzzle;Brain Games : 0.17 Entertainment;Music & Video : 0.17 Casual;Brain Games : 0.14 Casual;Action & Adventure : 0.14 Arcade;Action & Adventure : 0.12 Action;Action & Adventure : 0.1 Educational;Pretend Play : 0.09 Simulation;Action & Adventure : 0.08 Parenting;Education : 0.08 Entertainment;Brain Games : 0.08 Board;Brain Games : 0.08 Parenting;Music & Video : 0.07 Educational;Brain Games : 0.07 Casual;Creativity : 0.07 Art & Design;Creativity : 0.07 Education;Pretend Play : 0.06 Role Playing;Pretend Play : 0.05 Education;Creativity : 0.05 Role Playing;Action & Adventure : 0.03 Puzzle;Action & Adventure : 0.03 Entertainment;Creativity : 0.03 Entertainment;Action & Adventure : 0.03 Educational;Creativity : 0.03 Educational;Action & Adventure : 0.03 Education;Music & Video : 0.03 Education;Brain Games : 0.03 Education;Action & Adventure : 0.03 Adventure;Action & Adventure : 0.03 Video Players & Editors;Music & Video : 0.02 Sports;Action & Adventure : 0.02 Simulation;Pretend Play : 0.02 Puzzle;Creativity : 0.02 Music;Music & Video : 0.02 Entertainment;Pretend Play : 0.02 Casual;Education : 0.02 Board;Action & Adventure : 0.02 Video Players & Editors;Creativity : 0.01 Trivia;Education : 0.01 Travel & Local;Action & Adventure : 0.01 Tools;Education : 0.01 Strategy;Education : 0.01 Strategy;Creativity : 0.01 Strategy;Action & Adventure : 0.01 Simulation;Education : 0.01 Role Playing;Brain Games : 0.01 Racing;Pretend Play : 0.01 Puzzle;Education : 0.01 Parenting;Brain Games : 0.01 Music & Audio;Music & Video : 0.01 Lifestyle;Pretend Play : 0.01 Lifestyle;Education : 0.01 Health & Fitness;Education : 0.01 Health & Fitness;Action & Adventure : 0.01 Entertainment;Education : 0.01 Communication;Creativity : 0.01 Comics;Creativity : 0.01 Casual;Music & Video : 0.01 Card;Action & Adventure : 0.01 Books & Reference;Education : 0.01 Art & Design;Pretend Play : 0.01 Art & Design;Action & Adventure : 0.01 Arcade;Pretend Play : 0.01 Adventure;Education : 0.01 FAMILY : 18.9 GAME : 9.73 TOOLS : 8.46 BUSINESS : 4.59 LIFESTYLE : 3.9 PRODUCTIVITY : 3.89 FINANCE : 3.7 MEDICAL : 3.53 SPORTS : 3.4 PERSONALIZATION : 3.32 COMMUNICATION : 3.24 HEALTH_AND_FITNESS : 3.08 PHOTOGRAPHY : 2.94 NEWS_AND_MAGAZINES : 2.8 SOCIAL : 2.66 TRAVEL_AND_LOCAL : 2.34 SHOPPING : 2.25 BOOKS_AND_REFERENCE : 2.14 DATING : 1.86 VIDEO_PLAYERS : 1.79 MAPS_AND_NAVIGATION : 1.4 FOOD_AND_DRINK : 1.24 EDUCATION : 1.16 ENTERTAINMENT : 0.96 LIBRARIES_AND_DEMO : 0.94 AUTO_AND_VEHICLES : 0.93 HOUSE_AND_HOME : 0.82 WEATHER : 0.8 EVENTS : 0.71 PARENTING : 0.65 ART_AND_DESIGN : 0.64 COMICS : 0.62 BEAUTY : 0.6
For free English Apps on AppStore
For free English Apps on Google PlayStore
For PlayStore, we can do this using the installs column for each app but for AppStore, the install column does not exist so we will use the rating count (rating_count_tot) column
We will calculate the average number of user ratings per app genre on Appstore and the average number of installs per app genre on PlayStore.
ios_genre_unique = freq_table(ios_final, 11)
sorting_ios = []
for genre in ios_genre_unique:
total = 0
len_genre = 0
for app in ios_final:
genre_app = app[11]
if genre_app == genre:
len_genre += 1
total += float(app[5])
avg_user_rating = total/len_genre
sorting_ios.append((avg_user_rating, genre))
avg_ratings_sorted = sorted(sorting_ios, reverse = True)
for item in avg_ratings_sorted:
print(item[1], " : ", item[0])
Navigation : 86090.33333333333 Reference : 74942.11111111111 Social Networking : 71548.34905660378 Music : 57326.530303030304 Weather : 52279.892857142855 Book : 39758.5 Food & Drink : 33333.92307692308 Finance : 31467.944444444445 Photo & Video : 28441.54375 Travel : 28243.8 Shopping : 26919.690476190477 Health & Fitness : 23298.015384615384 Sports : 23008.898550724636 Games : 22812.92467948718 News : 21248.023255813954 Productivity : 21028.410714285714 Utilities : 18684.456790123455 Lifestyle : 16485.764705882353 Entertainment : 14029.830708661417 Business : 7491.117647058823 Education : 7003.983050847458 Catalogs : 4004.0 Medical : 612.0
Now we can see that the most popular genre of apps on the appstore are the navigation, social networking and reference genres. Cross referencing with the analysis for app population by genre, I would recommend building a social media app
First, let's inspect the column for no. of installs
display_table(android_final,5)
1,000,000+ : 15.73 100,000+ : 11.55 10,000,000+ : 10.55 10,000+ : 10.2 1,000+ : 8.39 100+ : 6.92 5,000,000+ : 6.83 500,000+ : 5.56 50,000+ : 4.77 5,000+ : 4.51 10+ : 3.54 500+ : 3.25 50,000,000+ : 2.3 100,000,000+ : 2.13 50+ : 1.92 5+ : 0.79 1+ : 0.51 500,000,000+ : 0.27 1,000,000,000+ : 0.23 0+ : 0.05
We can see that most values are open-ended (100+, 1,000+, 5,000+, etc.).
For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users.
We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from a string to a float. This means we need to remove the commas and the plus characters, or the conversion will fail and cause an error.
android_cat_unique = freq_table(android_final, 1)
sorting = []
for cat in android_cat_unique:
total_android = 0
len_cat_android = 0
for app in android_final:
cat_app = app[1]
n_installs = app[5].replace('+','')
n_installs = float(n_installs.replace(',',''))
if cat_app == cat:
len_cat_android += 1
total_android += n_installs
avg_installs = total_android/len_cat_android
sorting.append((avg_installs, cat))
avg_installs_sorted = sorted(sorting, reverse = True)
for item in avg_installs_sorted:
print(item[1], " : ", item[0])
COMMUNICATION : 38456119.167247385 VIDEO_PLAYERS : 24727872.452830188 SOCIAL : 23253652.127118643 PHOTOGRAPHY : 17840110.40229885 PRODUCTIVITY : 16787331.344927534 GAME : 15588015.603248259 TRAVEL_AND_LOCAL : 13984077.710144928 ENTERTAINMENT : 11640705.88235294 TOOLS : 10801391.298666667 NEWS_AND_MAGAZINES : 9549178.467741935 BOOKS_AND_REFERENCE : 8767811.894736841 SHOPPING : 7036877.311557789 PERSONALIZATION : 5201482.6122448975 WEATHER : 5074486.197183099 HEALTH_AND_FITNESS : 4188821.9853479853 MAPS_AND_NAVIGATION : 4056941.7741935486 FAMILY : 3697848.1731343283 SPORTS : 3638640.1428571427 ART_AND_DESIGN : 1986335.0877192982 FOOD_AND_DRINK : 1924897.7363636363 EDUCATION : 1833495.145631068 BUSINESS : 1712290.1474201474 LIFESTYLE : 1437816.2687861272 FINANCE : 1387692.475609756 HOUSE_AND_HOME : 1331540.5616438356 DATING : 854028.8303030303 COMICS : 817657.2727272727 AUTO_AND_VEHICLES : 647317.8170731707 LIBRARIES_AND_DEMO : 638503.734939759 PARENTING : 542603.6206896552 BEAUTY : 513151.88679245283 EVENTS : 253542.22222222222 MEDICAL : 120550.61980830671
We can see that the app categoies with higher popularity are communication, video players and social.
Notice that social is common for both ios and Android applications.