Notebook

Profitable App Profiles for the App Store and Google Play Markets¶

The aim of this project is to find mobile apps that are profitable. We are assumed to be working for a company that builds both iOS and android applications and make them available on Apple appstore and Google playstore.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

Opening and Exploring the Data¶

We aim to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store. To do this, we'll need to collect and analyze data about mobile apps available on Google Play and the App Store. As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead.

To avoid spending resources on collecting new data ourselves, we tried to see if we can find any relevant existing data at no cost. Luckily, here are two datasets that seem suitable for our goals:

A dataset containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the dataset directly from this link.

A dataset containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the dataset directly from this link.

Let's start by opening and exploring these two datasets.

In [1]:

def explore_data(dataset, start, end, rows_and_columns = False):
    for i in dataset[start:end]:
        print(i)
        print('\n')# dds a new (empty) line after each row
    if rows_and_columns:
        print("Number of rows: ",len(dataset))
        print("Number of Columns: ",len(dataset[0]))

In [2]:

from csv import reader
opened_android = open("C:/Users/ifediorah.kenechukwu/Documents/PythonDA/Datasets/googleplaystore.csv", encoding = "utf-8")
opened_apple = open("C:/Users/ifediorah.kenechukwu/Documents/PythonDA/Datasets/AppleStore.csv", encoding = "utf-8")
apple_data = list(reader(opened_apple))
android_data = list(reader(opened_android))
ios_header = apple_data[0]
ios = apple_data[1:] 
android_header = android_data[0]
android = android_data[1:]
opened_android.close()
opened_apple.close()

Now we have read in the data, let's explore the datasets¶

In [3]:

print("Apple Dataset")
explore_data(ios,0,5,True)
print("\n\n\n")
print("Android Dataset")
explore_data(android,0,5,True)

Apple Dataset
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows:  7197
Number of Columns:  16




Android Dataset
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows:  10841
Number of Columns:  13

In [4]:

print("Headers for Apple dataset")
print(ios_header)
print("Headers for Android dataset")
print(android_header)

Headers for Apple dataset
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
Headers for Android dataset
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Now we need to clean our data to suit our analytic purpose¶

From the dataset documentation, users complained that a particular row in the android dataset has incomplete data. The category column is missing. Below, I'll confirm that and delete the row to avaoid inaccuracy.

In [5]:

print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

In [6]:

del android[10472]

In [7]:

print(android[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']

From inspecting the android data and also the documentation, it would seem the data contains duplicate entries.

Below, I'll write a function to count identify the duplicate entries and consequently go ahead to remove them. This removal will not be done randomly. I'll inspect the data and make sure I keep only the most recent one. I should be able to get that information from the number of installs.

In [8]:

# For Android
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

In [9]:

print(len(unique_apps))
print(len(duplicate_apps))

9659
1181

In [10]:

print(duplicate_apps[:10])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']

In [11]:

# finding one duplicate app in order to inspect the differences
duplicate_apps1 = []
unique_apps1 = []

for app in android:
    name = app[0]
    if name in duplicate_apps and name in unique_apps1:
        duplicate_apps1.append(name)
    elif name in duplicate_apps and name not in unique_apps1:
        unique_apps1.append(name)
        
print(len(unique_apps1))
print(len(duplicate_apps1))

798
1181

In [12]:

print(duplicate_apps1[:20])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']

In [13]:

count = 0
for app in android:
    count += 1
    if app[0] == 'Google My Business':
        print(app)
        print(count)

['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']
194
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']
240
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']
269

In [14]:

print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

In [15]:

count = 0
for app in android:
    count += 1
    if app[0] == 'Instagram':
        print(app)
        print(count)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
2546
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
2605
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
2612
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
3910

In [16]:

print(android[249])

['Google Analytics', 'BUSINESS', '4.5', '78662', '22M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 13, 2018', '3.7.5', '4.4 and up']

From the above insights into the data, we can see that the varying element of each row is mostly the number of reviews. Which means that ths data must have been scraped at different intervals.

I am going to be taking out the duplicates and leaving the entries with the highest number of reviews as this has to be the most recently scraped.

In [17]:

#Getting the apps wth the highest reviews

reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [18]:

print(len(reviews_max))

What next?¶

Now I have the apps with the highest reviews, I will loop through the android dataset again and this time will keep only a single copy of the duplicated entries. The sigle copy will be the one with highest reviesws. This will create a new dataset called android_clean

In [19]:

android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name not in already_added and n_reviews == reviews_max[name]:
        android_clean.append(app)
        already_added.append(name)

In [20]:

print(len(android_clean))
print(len(already_added))

9659
9659

In [21]:

print(android_clean[:5],'\n')

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]

In [22]:

explore_data(android_clean,0,5)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']

Now we are going to remove apps that contain non-english characters. These characters are characters that corresponds to less than 127 based on the ASCII code¶

In [23]:

def check_ascii(a_string):
    count = 0
    for i in a_string:
        if ord(i) > 127:
            count += 1
        if count > 3: # Since some english apps may have one or two characters taht are not in this range, we do a count and only classify the app as non-english if it contains more than 3 characters that are not in our ASCII range
            return False
    return True

In [24]:

print(check_ascii("Instagram"))
print(check_ascii("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(check_ascii("Docs To Go™ Free Office Suite"))
print(check_ascii("Instachat 😜"))

True
False
True
True

Now, we will apply this filter function to our dataset¶

In [25]:

android_filtered = []
android_nonenglish = []
for app in android_clean:
    if check_ascii(app[0]):
        android_filtered.append(app)
    else:
        android_nonenglish.append(app[0])
print(len(android_filtered))
print(len(android_nonenglish))

9614
45

In [26]:

print(android_nonenglish)

['Flame - درب عقلك يوميا', 'သိင်္ Astrology - Min Thein Kha BayDin', 'РИА Новости', 'صور حرف H', 'L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템', 'AJ렌터카 법인 카셰어링', 'Al Quran Free - القرآن (Islam)', '中国語 AQリスニング', '日本AV历史', 'Ay Yıldız Duvar Kağıtları', 'বাংলা টিভি প্রো BD Bangla TV', 'Cъновник BG', 'CSCS BG (в български)', '뽕티비 - 개인방송, 인터넷방송, BJ방송', 'BL 女性向け恋愛ゲーム◆俺プリクロス', 'SecondSecret ‐「恋を読む」BLノベルゲーム‐', 'BL 女性向け恋愛ゲーム◆ごくメン', 'あなカレ【BL】無料ゲーム', '감성학원 BL 첫사랑', 'BQ-መጽሐፍ ቅዱሳዊ ጥያቄዎች', 'BS Calendar / Patro / पात्रो', 'Vip视频免费看-BT磁力搜索', 'Билеты ПДД CD 2019 PRO', 'Offline Jízdní řády CG Transit', 'Bonjour 2017 Abidjan CI ❤❤❤❤❤', 'CK 初一 十五', 'الفاتحون Conquerors', 'DG ग्राम / Digital Gram Panchayat', 'DM הפקות', 'DW فارسی By dw-arab.com', 'لعبة تقدر تربح DZ', 'বাংলাflix', 'RPG ブレイジング ソウルズ アクセレイト', '英漢字典 EC Dictionary', 'ECナビ×シュフー', 'أحداث وحقائق | خبر عاجل في اخبار العالم', 'EG SIM CARD (EGSIMCARD, 이지심카드)', 'パーリーゲイツ公式通販｜EJ STYLE（イージェイスタイル）', 'FAHREDDİN er-RÂZİ TEFSİRİ', "I'm Rich/Eu sou Rico/أنا غني/我很有錢", 'AÖF Ev İdaresi 1. Sınıf', 'Ey Sey Storytime រឿងនិទានតាឥសី', '哈哈姆特不EY', 'FP Разбитый дисплей']

In [27]:

print(android_header)
print(android_filtered[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

In [28]:

android_final = []
for app in android_filtered:
    if app[6] == "Free":
        android_final.append(app)
print(len(android_final))

We have been working on the android dataset all along. Now that we are done cleaning this, let's go ahead to clean the ios dataset

In [29]:

#This is to make sure that all items in the ios data set have complete rows
for row in ios:
    if len(row) != len(ios_header):
        print(row[1])

In [30]:

#Let's loop through for duplicate entries
# For iOS
duplicate_ios = []
unique_ios = []

for app in ios:
    name = app[1]
    if name in unique_ios:
        duplicate_ios.append(name)
    else:
        unique_ios.append(name)

In [31]:

print(duplicate_ios)

['Mannequin Challenge', 'VR Roller Coaster']

We can see that 2 apps appear twice

In [32]:

count = 0
for app in ios:
    count += 1
    if app[1] == 'VR Roller Coaster':
        print(app)
        print(count)
        
print(ios_header)

['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
4443
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']
4832
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

Like we did before, we are going to remove the duplicates and retain the app with higher rating count

In [33]:

#Getting the apps wth the highest reviews

ios_reviews_max = {}
for app in ios:
    name = app[1]
    n_reviews = float(app[5])
    if name in ios_reviews_max and ios_reviews_max[name] < n_reviews:
        ios_reviews_max[name] = n_reviews
    elif name not in ios_reviews_max:
        ios_reviews_max[name] = n_reviews

In [34]:

print(len(ios_reviews_max))
print(len(ios))

7195
7197

In [35]:

ios_clean = []
ios_already_added = []

for app in ios:
    name = app[1]
    n_reviews = float(app[5])
    if name not in ios_already_added and n_reviews == ios_reviews_max[name]:
        ios_clean.append(app)
        ios_already_added.append(name)

In [36]:

len(ios_clean)

Out[36]:

Now we remove non english apps in ios_clean¶

In [37]:

ios_filtered = []
ios_nonenglish = []
for app in ios_clean:
    if check_ascii(app[1]):
        ios_filtered.append(app)
    else:
        ios_nonenglish.append(app[1])
print(len(ios_filtered))
print(len(ios_nonenglish))

6181
1014

In [38]:

print(ios_nonenglish[:50])

['爱奇艺PPS -《欢乐颂2》电视剧热播', '聚力视频HD-人民的名义,跨界歌王全网热播', '优酷视频', '网易新闻 - 精选好内容，算出你的兴趣', '淘宝 - 随时随地，想淘就淘', '搜狐视频HD-欢乐颂2 全网首播', '阴阳师-全区互通现世集结', '百度贴吧-全球最大兴趣交友社区', '百度网盘', '爱奇艺HD -《欢乐颂2》电视剧热播', '乐视视频HD-白鹿原,欢乐颂,奔跑吧全网热播', '万年历-值得信赖的日历黄历查询工具', '新浪新闻-阅读最新时事热门头条资讯视频', '喜马拉雅FM（听书社区）电台有声小说相声英语', '央视影音-海量央视内容高清直播', '腾讯视频HD-楚乔传,明日之子6月全网首播', '手机百度 - 百度一下你就得到', '百度视频HD-高清电视剧、电影在线观看神器', 'MOMO陌陌-开启视频社交,用直播分享生活', 'QQ 浏览器-搜新闻、选小说漫画、看视频', '同花顺-炒股、股票', '聚力视频-蓝光电视剧电影在线热播', '快看漫画', '乐视视频-白鹿原,欢乐颂,奔跑吧全网热播', '酷我音乐HD-无损在线播放', '随手记（专业版）-好用的记账理财工具', 'Dictionary ( قاموس عربي / انجليزي + ودجيت الترجمة)', '滴滴出行', '高德地图（精准专业的手机地图）', '百度HD-极速安全浏览器', '美丽说-潮流穿搭快人一步', '百度地图-智能的手机导航，公交地铁出行必备', 'Majiang Mahjong（单机+川麻+二人+武汉+国标）', '土豆视频HD—高清影视综艺视频播放器', '360手机卫士-超安全的来电防骚扰助手', 'QQ浏览器HD-极速搜索浏览器', '搜狗输入法-Sogou Keyboard', '百度网盘 HD', '大众点评-发现品质生活', '讯飞输入法-智能语音输入和表情斗图神器', '美柚 - 女生助手', '爱奇艺 - 电视剧电影综艺娱乐视频播放器', '搜狐视频-欢乐颂2 全网首播', '百度地图HD', 'QQ同步助手-新机一键换机必备工具', 'QQ音乐-来这里“发现・音乐”', '腾讯新闻-头条新闻热点资讯掌上阅读软件', '土豆（短视频分享平台）', '风行视频+ HD - 电影电视剧体育视频播放器', '仙劍奇俠傳5 - 劍傲丹楓']

Now to get the free apps from the ios dataset

In [39]:

ios_final = []
for app in ios_filtered:
    if float(app[4]) == 0.0:
        ios_final.append(app)
print(len(ios_final))

In [40]:

explore_data(android_final,0,3)
explore_data(ios_final,0,3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']

Now that we have our clean data, we can go ahead to start our analysis¶

Now we know that our objective is to find an app profile that fits for both appstore and google play. We want to find apps that are successful in both markets.

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play.
If the app has a good response from users, we develop it further.
If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our datasets.

In [41]:

explore_data(android_final,0,3)
print(android_header)
print('\n')
print('\n')
print('\n')
explore_data(ios_final,0,3)
print(ios_header)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']






['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

Let's build frequency tables for the most common genre in the ios and android datasets¶

In [42]:

def freq_table(dataset, index):
    table = {}
    for app in dataset:
        freq = app[index]
        if freq in table:
            table[freq] += 1
        else: 
            table[freq] = 1
    total = sum(list(table.values()))
    for item in table:
        table[item] = round((table[item]/total) * 100, 2)
    return table

In [43]:

ios_genre = freq_table(ios_final,11)
android_genre = freq_table(android_final, 9)
android_category = freq_table(android_final, 1)

print(ios_genre)
print("\n")
print("\n")
print(android_genre)
print("\n")
print("\n")
print(android_category)

{'Social Networking': 3.29, 'Photo & Video': 4.97, 'Games': 58.14, 'Music': 2.05, 'Reference': 0.56, 'Health & Fitness': 2.02, 'Weather': 0.87, 'Utilities': 2.52, 'Travel': 1.24, 'Shopping': 2.61, 'News': 1.34, 'Navigation': 0.19, 'Lifestyle': 1.58, 'Entertainment': 7.89, 'Food & Drink': 0.81, 'Sports': 2.14, 'Book': 0.43, 'Finance': 1.12, 'Education': 3.66, 'Productivity': 1.74, 'Business': 0.53, 'Catalogs': 0.12, 'Medical': 0.19}




{'Art & Design': 0.6, 'Art & Design;Creativity': 0.07, 'Auto & Vehicles': 0.93, 'Beauty': 0.6, 'Books & Reference': 2.14, 'Business': 4.59, 'Comics': 0.61, 'Comics;Creativity': 0.01, 'Communication': 3.24, 'Dating': 1.86, 'Education': 5.35, 'Education;Creativity': 0.05, 'Education;Education': 0.34, 'Education;Pretend Play': 0.06, 'Education;Brain Games': 0.03, 'Entertainment': 6.07, 'Entertainment;Brain Games': 0.08, 'Entertainment;Creativity': 0.03, 'Entertainment;Music & Video': 0.17, 'Events': 0.71, 'Finance': 3.7, 'Food & Drink': 1.24, 'Health & Fitness': 3.08, 'House & Home': 0.82, 'Libraries & Demo': 0.94, 'Lifestyle': 3.89, 'Lifestyle;Pretend Play': 0.01, 'Card': 0.45, 'Arcade': 1.85, 'Puzzle': 1.13, 'Racing': 0.99, 'Sports': 3.46, 'Casual': 1.76, 'Simulation': 2.04, 'Adventure': 0.68, 'Trivia': 0.42, 'Action': 3.1, 'Word': 0.26, 'Role Playing': 0.94, 'Strategy': 0.9, 'Board': 0.38, 'Music': 0.2, 'Action;Action & Adventure': 0.1, 'Casual;Brain Games': 0.14, 'Educational;Creativity': 0.03, 'Puzzle;Brain Games': 0.17, 'Educational;Education': 0.39, 'Casual;Pretend Play': 0.24, 'Educational;Brain Games': 0.07, 'Art & Design;Pretend Play': 0.01, 'Educational;Pretend Play': 0.09, 'Entertainment;Education': 0.01, 'Casual;Education': 0.02, 'Casual;Creativity': 0.07, 'Casual;Action & Adventure': 0.14, 'Music;Music & Video': 0.02, 'Arcade;Pretend Play': 0.01, 'Adventure;Action & Adventure': 0.03, 'Role Playing;Action & Adventure': 0.03, 'Simulation;Pretend Play': 0.02, 'Puzzle;Creativity': 0.02, 'Simulation;Action & Adventure': 0.08, 'Racing;Action & Adventure': 0.17, 'Sports;Action & Adventure': 0.02, 'Educational;Action & Adventure': 0.03, 'Arcade;Action & Adventure': 0.12, 'Entertainment;Action & Adventure': 0.03, 'Art & Design;Action & Adventure': 0.01, 'Puzzle;Action & Adventure': 0.03, 'Education;Action & Adventure': 0.03, 'Strategy;Action & Adventure': 0.01, 'Music & Audio;Music & Video': 0.01, 'Health & Fitness;Education': 0.01, 'Board;Action & Adventure': 0.02, 'Board;Brain Games': 0.08, 'Casual;Music & Video': 0.01, 'Education;Music & Video': 0.03, 'Role Playing;Pretend Play': 0.05, 'Entertainment;Pretend Play': 0.02, 'Medical': 3.53, 'Social': 2.66, 'Shopping': 2.25, 'Photography': 2.94, 'Travel & Local': 2.32, 'Travel & Local;Action & Adventure': 0.01, 'Tools': 8.45, 'Tools;Education': 0.01, 'Personalization': 3.32, 'Productivity': 3.89, 'Parenting': 0.5, 'Parenting;Music & Video': 0.07, 'Parenting;Education': 0.08, 'Parenting;Brain Games': 0.01, 'Weather': 0.8, 'Video Players & Editors': 1.77, 'Video Players & Editors;Music & Video': 0.02, 'Video Players & Editors;Creativity': 0.01, 'News & Magazines': 2.8, 'Maps & Navigation': 1.4, 'Health & Fitness;Action & Adventure': 0.01, 'Educational': 0.37, 'Casino': 0.43, 'Trivia;Education': 0.01, 'Lifestyle;Education': 0.01, 'Card;Action & Adventure': 0.01, 'Books & Reference;Education': 0.01, 'Simulation;Education': 0.01, 'Puzzle;Education': 0.01, 'Adventure;Education': 0.01, 'Role Playing;Brain Games': 0.01, 'Strategy;Education': 0.01, 'Racing;Pretend Play': 0.01, 'Communication;Creativity': 0.01, 'Strategy;Creativity': 0.01}




{'ART_AND_DESIGN': 0.64, 'AUTO_AND_VEHICLES': 0.93, 'BEAUTY': 0.6, 'BOOKS_AND_REFERENCE': 2.14, 'BUSINESS': 4.59, 'COMICS': 0.62, 'COMMUNICATION': 3.24, 'DATING': 1.86, 'EDUCATION': 1.16, 'ENTERTAINMENT': 0.96, 'EVENTS': 0.71, 'FINANCE': 3.7, 'FOOD_AND_DRINK': 1.24, 'HEALTH_AND_FITNESS': 3.08, 'HOUSE_AND_HOME': 0.82, 'LIBRARIES_AND_DEMO': 0.94, 'LIFESTYLE': 3.9, 'GAME': 9.73, 'FAMILY': 18.9, 'MEDICAL': 3.53, 'SOCIAL': 2.66, 'SHOPPING': 2.25, 'PHOTOGRAPHY': 2.94, 'SPORTS': 3.4, 'TRAVEL_AND_LOCAL': 2.34, 'TOOLS': 8.46, 'PERSONALIZATION': 3.32, 'PRODUCTIVITY': 3.89, 'PARENTING': 0.65, 'WEATHER': 0.8, 'VIDEO_PLAYERS': 1.79, 'NEWS_AND_MAGAZINES': 2.8, 'MAPS_AND_NAVIGATION': 1.4}

In [44]:

sum(list(android_genre.values()))

Out[44]:

99.92000000000009

In [45]:

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [46]:

ios_genre_freq = display_table(ios_final,11)
print("\n")
print("\n")
android_genre_freq = display_table(android_final,9)
print("\n")
print("\n")
android_cat_freq = display_table(android_final,1)

Games : 58.14
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.52
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.34
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12




Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.9
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;Brain Games : 0.14
Casual;Action & Adventure : 0.14
Arcade;Action & Adventure : 0.12
Action;Action & Adventure : 0.1
Educational;Pretend Play : 0.09
Simulation;Action & Adventure : 0.08
Parenting;Education : 0.08
Entertainment;Brain Games : 0.08
Board;Brain Games : 0.08
Parenting;Music & Video : 0.07
Educational;Brain Games : 0.07
Casual;Creativity : 0.07
Art & Design;Creativity : 0.07
Education;Pretend Play : 0.06
Role Playing;Pretend Play : 0.05
Education;Creativity : 0.05
Role Playing;Action & Adventure : 0.03
Puzzle;Action & Adventure : 0.03
Entertainment;Creativity : 0.03
Entertainment;Action & Adventure : 0.03
Educational;Creativity : 0.03
Educational;Action & Adventure : 0.03
Education;Music & Video : 0.03
Education;Brain Games : 0.03
Education;Action & Adventure : 0.03
Adventure;Action & Adventure : 0.03
Video Players & Editors;Music & Video : 0.02
Sports;Action & Adventure : 0.02
Simulation;Pretend Play : 0.02
Puzzle;Creativity : 0.02
Music;Music & Video : 0.02
Entertainment;Pretend Play : 0.02
Casual;Education : 0.02
Board;Action & Adventure : 0.02
Video Players & Editors;Creativity : 0.01
Trivia;Education : 0.01
Travel & Local;Action & Adventure : 0.01
Tools;Education : 0.01
Strategy;Education : 0.01
Strategy;Creativity : 0.01
Strategy;Action & Adventure : 0.01
Simulation;Education : 0.01
Role Playing;Brain Games : 0.01
Racing;Pretend Play : 0.01
Puzzle;Education : 0.01
Parenting;Brain Games : 0.01
Music & Audio;Music & Video : 0.01
Lifestyle;Pretend Play : 0.01
Lifestyle;Education : 0.01
Health & Fitness;Education : 0.01
Health & Fitness;Action & Adventure : 0.01
Entertainment;Education : 0.01
Communication;Creativity : 0.01
Comics;Creativity : 0.01
Casual;Music & Video : 0.01
Card;Action & Adventure : 0.01
Books & Reference;Education : 0.01
Art & Design;Pretend Play : 0.01
Art & Design;Action & Adventure : 0.01
Arcade;Pretend Play : 0.01
Adventure;Education : 0.01




FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6

From these analysis above, we can deduce the following¶

For free English Apps on AppStore

The most common genre is the Games genre and the next most common is Entertainment
There are not so many apps on medical, navigation and catalogs
Most of the dominant apps are for pass time
Most of the apps are built for entertainment
I recommend we build an app for entertainment. Either social media or games

For free English Apps on Google PlayStore

The most common genre is the Tools genre and the next most common is Entertainment
The most common category is Family and the next most common is Games
Play store shows a more balanced scale between apps developed fro fun and apps developed for productivity
I recommend we build a fun app for education. But really, we can't completely tell a perfect app recommendation on the play store using just these parameters so let's explore more

Let's determine the most popular apps by genre¶

For PlayStore, we can do this using the installs column for each app but for AppStore, the install column does not exist so we will use the rating count (rating_count_tot) column

We will calculate the average number of user ratings per app genre on Appstore and the average number of installs per app genre on PlayStore.

In [54]:

ios_genre_unique = freq_table(ios_final, 11)

sorting_ios = []
for genre in ios_genre_unique:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[11]
        if genre_app == genre:
            len_genre += 1
            total += float(app[5])
    avg_user_rating = total/len_genre
    sorting_ios.append((avg_user_rating, genre))
avg_ratings_sorted = sorted(sorting_ios, reverse = True)
for item in avg_ratings_sorted:
    print(item[1], " : ", item[0])

Navigation  :  86090.33333333333
Reference  :  74942.11111111111
Social Networking  :  71548.34905660378
Music  :  57326.530303030304
Weather  :  52279.892857142855
Book  :  39758.5
Food & Drink  :  33333.92307692308
Finance  :  31467.944444444445
Photo & Video  :  28441.54375
Travel  :  28243.8
Shopping  :  26919.690476190477
Health & Fitness  :  23298.015384615384
Sports  :  23008.898550724636
Games  :  22812.92467948718
News  :  21248.023255813954
Productivity  :  21028.410714285714
Utilities  :  18684.456790123455
Lifestyle  :  16485.764705882353
Entertainment  :  14029.830708661417
Business  :  7491.117647058823
Education  :  7003.983050847458
Catalogs  :  4004.0
Medical  :  612.0

Now we can see that the most popular genre of apps on the appstore are the navigation, social networking and reference genres. Cross referencing with the analysis for app population by genre, I would recommend building a social media app

Now let's do same for the android apps¶

First, let's inspect the column for no. of installs

In [49]:

display_table(android_final,5)

1,000,000+ : 15.73
100,000+ : 11.55
10,000,000+ : 10.55
10,000+ : 10.2
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.3
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05

We can see that most values are open-ended (100+, 1,000+, 5,000+, etc.).

For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from a string to a float. This means we need to remove the commas and the plus characters, or the conversion will fail and cause an error.

In [55]:

android_cat_unique = freq_table(android_final, 1)

sorting = []
for cat in android_cat_unique:
    total_android = 0
    len_cat_android = 0
    for app in android_final:
        cat_app = app[1]
        n_installs = app[5].replace('+','')
        n_installs = float(n_installs.replace(',',''))
        if cat_app == cat:
            len_cat_android += 1
            total_android += n_installs
    avg_installs = total_android/len_cat_android
    sorting.append((avg_installs, cat))
avg_installs_sorted = sorted(sorting, reverse = True)
for item in avg_installs_sorted:
    print(item[1], " : ", item[0])

COMMUNICATION  :  38456119.167247385
VIDEO_PLAYERS  :  24727872.452830188
SOCIAL  :  23253652.127118643
PHOTOGRAPHY  :  17840110.40229885
PRODUCTIVITY  :  16787331.344927534
GAME  :  15588015.603248259
TRAVEL_AND_LOCAL  :  13984077.710144928
ENTERTAINMENT  :  11640705.88235294
TOOLS  :  10801391.298666667
NEWS_AND_MAGAZINES  :  9549178.467741935
BOOKS_AND_REFERENCE  :  8767811.894736841
SHOPPING  :  7036877.311557789
PERSONALIZATION  :  5201482.6122448975
WEATHER  :  5074486.197183099
HEALTH_AND_FITNESS  :  4188821.9853479853
MAPS_AND_NAVIGATION  :  4056941.7741935486
FAMILY  :  3697848.1731343283
SPORTS  :  3638640.1428571427
ART_AND_DESIGN  :  1986335.0877192982
FOOD_AND_DRINK  :  1924897.7363636363
EDUCATION  :  1833495.145631068
BUSINESS  :  1712290.1474201474
LIFESTYLE  :  1437816.2687861272
FINANCE  :  1387692.475609756
HOUSE_AND_HOME  :  1331540.5616438356
DATING  :  854028.8303030303
COMICS  :  817657.2727272727
AUTO_AND_VEHICLES  :  647317.8170731707
LIBRARIES_AND_DEMO  :  638503.734939759
PARENTING  :  542603.6206896552
BEAUTY  :  513151.88679245283
EVENTS  :  253542.22222222222
MEDICAL  :  120550.61980830671

We can see that the app categoies with higher popularity are communication, video players and social.

Notice that social is common for both ios and Android applications.