Profitable App Profiles for the App Store and Google Play Markets

The goal of that project is to analyze data to understand what type of apps are likely to attract more users on Google Play and the App Store. To do this, I'll need to collect, explore and analyze data about mobile apps available on these platforms.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. Datasets that will be used in my analysis were found at Kaggle here and here and were scraped from roughly at the same time, in 2018.

Datasets

  • A data set containing data about approximately 10,000 Android apps from Google Play. You can download the data set directly from this link.
  • A data set containing data about approximately 7,000 iOS apps from the App Store. You can download the data set directly from this link.

Open datasets

The first step is to open csv files 'AppleStore.csv' and 'googleplaystore.csv'. For that task I have created a function read_csv().

In [1]:
from csv import reader

# create a function to read csv files
def read_csv(csv):
    opened_csv = open(csv)
    read_csv = reader(opened_csv)
    dataset = list(read_csv)
    return dataset

# read Apple Store
ios = read_csv('AppleStore.csv')
header_ios = ios[0]
ios = ios[1:]

# read Google Play Store
android = read_csv('googleplaystore.csv')
header_android = android[0]
android = android[1:]

Explore datasets with explore_data() function

To make it easier to explore datasets, I used a function named explore_data() that:

  • prints length
  • prints column names (or headers)
  • prints 3 first rows.

Next that function was implemented for both datasets.

In [2]:
# define the function
def explore_data(dataset, start, end, header=None):
    
    print('Number of rows in dataset: {}.'.format(len(dataset)))
    print('Number of columns in dataset: {}.'.format(len(dataset[0])))
    
    if header is None:
          print('Column names are:', dataset[0])
    else:
          print('Column names are', header)
          
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print() # adds a new (empty) line after each row
        
# explore ios dataset       
explore_data(ios, 0, 3, header_ios)

# explore android dataset
explore_data(android, 0, 3, header_android)
Number of rows in dataset: 7197.
Number of columns in dataset: 16.
Column names are ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']

Number of rows in dataset: 10841.
Number of columns in dataset: 13.
Column names are ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']

Data cleaning

To make sure the data we analyze is accurate we have to:

  • detect inaccurate data, and correct or remove it.
  • detect duplicate data, and remove the duplicates.

Remove paid apps

The first step is to remove apps that aren't free.

In [3]:
# find free apps in Apple Store dataset
ios_free = []
for row in ios:
    if row[4] == '0.0':
        ios_free.append(row)
print('Apple Store free apps dataset includes', len(ios_free), 'rows', '\n')

# find free apps in Google Play Store dataset
android_free = []
for row in android:
    if row[6] == 'Free':
        android_free.append(row)
print('Google Play Store free apps dataset includes', len(android_free), 'rows', '\n')
Apple Store free apps dataset includes 4056 rows 

Google Play Store free apps dataset includes 10039 rows 

Check for duplicates

At the first step we check both datasets for duplicates and print some of them to see, how much are they identical to each other.

In [4]:
# check for duplicates in Apple Store dataset
unique_names_ios = []
duplicate_names_ios = []
for row in ios_free:
    name = row[1]
    if name in unique_names_ios:
        duplicate_names_ios.append(name)
    else: 
        unique_names_ios.append(name)
print('There are', len(duplicate_names_ios), 'duplicates in the Apple Store dataset.', '\n')
for row in ios_free:
    name = row[1]
    if name in duplicate_names_ios[:1]:
        print(row)
There are 2 duplicates in the Apple Store dataset. 

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
In [5]:
# check for duplicates in Google Play Store dataset
unique_names_android = []
duplicate_names_android = []
for row in android_free:
    name = row[0]
    if name in unique_names_android:
        duplicate_names_android.append(name)
    else: 
        unique_names_android.append(name)
print('There are', len(duplicate_names_android), 'duplicates in the Google Play Store dataset.', '\n')
print('Here are duplicate rows for one application. We can see how they differ from each other.', '\n')
for row in android_free:
    name = row[0]
    if name in duplicate_names_android[:1]:# print only duplicates for one app
        print(row)
There are 1135 duplicates in the Google Play Store dataset. 

Here are duplicate rows for one application. We can see how they differ from each other. 

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']

Remove duplicates from datasets

The main difference in duplicate rows happens on the the number of users' ratings or reviews (column 6 in Apple Store dataset and column 4 in Google Play Store dataset). The different numbers show that the data was collected at different time.

Next step in my data cleaning process is to only leave rows with the maximum number of reviews or ratings and add them to the cleaned datasets.

Finally I check the number of rows in datasets testing if I will get the same number with different methods of count.

In [6]:
# select only rows with maximum number of ratings in Apple Store dataset
clean_data_ios_dict = {}

for row in ios_free:
    name = row[1]
    n_ratings = float(row[5])
    
    if name not in clean_data_ios_dict or n_ratings > float(clean_data_ios_dict[name][5]):
        clean_data_ios_dict[name] = row
        
# convert dictionary with Apple Store data to list of lists
clean_data_ios = clean_data_ios_dict.values()

# check the result
print('Expected length of cleaned Apple Store dataset is', len(ios_free)-len(duplicate_names_ios))
print('Length of final clean Apple Store dataset is:', len(clean_data_ios), '\n')
Expected length of cleaned Apple Store dataset is 4054
Length of final clean Apple Store dataset is: 4054 

In [7]:
# select only rows with maximum number of reviews in Google Play Store dataset
clean_data_android_dict = {}

for row in android_free:
    name = row[0]
    n_reviews = float(row[3])
    
    if name not in clean_data_android_dict or n_reviews > float(clean_data_android_dict[name][3]):
        clean_data_android_dict[name] = row

# convert dictionary with Google Play Store data to list of lists
clean_data_android = clean_data_android_dict.values()

# check the result
print('Expected length of cleaned Google Play Store dataset is', len(android_free)-len(duplicate_names_android))
print('Length of final clean Google Play Store dataset is:', len(clean_data_android), '\n')
Expected length of cleaned Google Play Store dataset is 8904
Length of final clean Google Play Store dataset is: 8904 

Remove non-English apps

Since my target markets are English-apeaking I'd like to remove from both datasets applications with non-English names. To do that I write a function _isenglish() to check if the name of the application is English and initialize it for Apple Store and Google Play Store datasets.

In [8]:
# define a function to check the name
def is_english(app):
    number_of_false = 0
    
    for letter in app:
        if ord(letter) > 127:
            number_of_false += 1
            
    if number_of_false < 4:
        return True
        
# iterate over Apple Store dataset
apple_store = []

for row in clean_data_ios:
    name = row[1]
    if is_english(name):
        apple_store.append(row)
        
print("Final list of Apple Store apps has {} rows".format(len(apple_store)))

# iterate over Google Play Store
google_play_store = []

for row in clean_data_android:
    name = row[0]
    if is_english(name):
        google_play_store.append(row)
        
print("Final list of Google Play Store apps has {} rows".format(len(google_play_store)))
Final list of Apple Store apps has 3220 rows
Final list of Google Play Store apps has 8863 rows

Data analysis

Identify the most common apps by genre

The goal of the analysis is to find an idea of application which could be successful at both markets, Apple Store and Google Play Store. According to the assignment at first an app will be developed for Google Play Store and if it will be succesful, roll it out to Apple Store. My first step is to identify the most common genres for applications. We will use column prime_genre for AppleStore (12th position) and column Genres for Google Play Store (10th position) to count the most common genre.

In [9]:
# display once more names of the columns
print('Column names for Apple Store dataset are: ', header_ios, '\n')
print('Column names for Google Play Store are: ', header_android, '\n')

# define function to create a frequency table
def freq_table(dataset, index):
    freq_table = {}
    total = len(dataset)
    
    for row in dataset:
        token = row[index]
        if token in freq_table:
            freq_table[token] += 1
        else:
            freq_table[token] = 1
    # calculate percentages
    freq_percentages = {}
    for key in freq_table:
        percentage = (freq_table[key]/total)*100
        freq_percentages[key] = round(percentage, 2)
    # print result in descending order
    for key in sorted(freq_percentages, key=freq_percentages.get, reverse=True):
        print(key,':', freq_percentages[key])
        
    return freq_percentages
Column names for Apple Store dataset are:  ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

Column names for Google Play Store are:  ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

In [10]:
# iterate over Apple Store dataset
prime_genres = freq_table(apple_store, 11)
print('The column prime genre includes {} genres total.'.format(len(prime_genres)), '\n')
Games : 58.14
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.52
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.34
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12
The column prime genre includes 23 genres total. 

In [11]:
# iterate over Google Play Store dataset
# review column 'Genres'
genres = freq_table(google_play_store, 9)
print('The column Genres includes {} genres total.'.format(len(genres)), '\n')

# review column 'Category'
category = freq_table(google_play_store, 1)
print('The column Category includes {} genres total.'.format(len(category)))
Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Lifestyle : 3.89
Productivity : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Libraries & Demo : 0.94
Role Playing : 0.94
Auto & Vehicles : 0.93
Strategy : 0.9
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Art & Design : 0.6
Beauty : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Entertainment;Music & Video : 0.17
Puzzle;Brain Games : 0.17
Racing;Action & Adventure : 0.17
Casual;Action & Adventure : 0.14
Casual;Brain Games : 0.14
Arcade;Action & Adventure : 0.12
Action;Action & Adventure : 0.1
Educational;Pretend Play : 0.09
Entertainment;Brain Games : 0.08
Simulation;Action & Adventure : 0.08
Board;Brain Games : 0.08
Parenting;Education : 0.08
Art & Design;Creativity : 0.07
Casual;Creativity : 0.07
Educational;Brain Games : 0.07
Parenting;Music & Video : 0.07
Education;Pretend Play : 0.06
Education;Creativity : 0.05
Role Playing;Pretend Play : 0.05
Education;Music & Video : 0.03
Education;Action & Adventure : 0.03
Education;Brain Games : 0.03
Entertainment;Creativity : 0.03
Adventure;Action & Adventure : 0.03
Educational;Creativity : 0.03
Role Playing;Action & Adventure : 0.03
Educational;Action & Adventure : 0.03
Entertainment;Action & Adventure : 0.03
Puzzle;Action & Adventure : 0.03
Casual;Education : 0.02
Music;Music & Video : 0.02
Simulation;Pretend Play : 0.02
Puzzle;Creativity : 0.02
Sports;Action & Adventure : 0.02
Board;Action & Adventure : 0.02
Entertainment;Pretend Play : 0.02
Video Players & Editors;Music & Video : 0.02
Art & Design;Pretend Play : 0.01
Art & Design;Action & Adventure : 0.01
Comics;Creativity : 0.01
Lifestyle;Pretend Play : 0.01
Entertainment;Education : 0.01
Arcade;Pretend Play : 0.01
Strategy;Action & Adventure : 0.01
Music & Audio;Music & Video : 0.01
Health & Fitness;Education : 0.01
Adventure;Education : 0.01
Casual;Music & Video : 0.01
Video Players & Editors;Creativity : 0.01
Travel & Local;Action & Adventure : 0.01
Tools;Education : 0.01
Parenting;Brain Games : 0.01
Health & Fitness;Action & Adventure : 0.01
Trivia;Education : 0.01
Lifestyle;Education : 0.01
Card;Action & Adventure : 0.01
Books & Reference;Education : 0.01
Simulation;Education : 0.01
Puzzle;Education : 0.01
Role Playing;Brain Games : 0.01
Strategy;Education : 0.01
Racing;Pretend Play : 0.01
Communication;Creativity : 0.01
Strategy;Creativity : 0.01
The column Genres includes 114 genres total. 

FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6
The column Category includes 33 genres total.

What observations can we make based on most common genres?

Apple Store dataset

  • Data from Apple Store shows less variety in genres.
  • The most common genre at Apple Store is Games (more than half of all apps belong to it).
  • 4 of 5 top genres are entertaining rather than designed for practical purposes.

Google Play Store

  • Data from Google Play Store is harder to analyze because the column Genres can include multiple genres so the total number of genres (114 total) makes the data noisy. For further better-tailored analysis it could be used only after parsing.
  • Top 5 genres as well as categories show that more practically oriented apps are more numerous.

Comparison

  • Apps that are numerous at Apple Store are more entertaining, at Google Play Store on contrary prevail apps for more practical usage.
  • The large amount of apps of the certain genre still doesn't mean that all of them are commercially succesful. Ideally I should also explore the amount of users for every app as well as user retention rate (for that though I should have historic data so it goes beyond the scope of current project).

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play Store dataset, I can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, I'll take the total number of user ratings as a proxy, which can be found in the rating_count_tot column.

In [12]:
# count sum of user ratings for every genre for Apple Store dataset
ratings_dict = {}

for row in apple_store:
    genre = row[11]
    rating = float(row[5])
    ratings_dict.setdefault(genre, []).append(rating)
    
# calculate average

for genre, rating in ratings_dict.items():
    avg_rating = round(sum(rating)/len(rating))
    ratings_dict[genre] = avg_rating
    
# sort and print resulting dictionary

for genre in sorted(ratings_dict, key=ratings_dict.get, reverse=True):
    print(genre, ratings_dict[genre])   
Navigation 86090
Reference 74942
Social Networking 71548
Music 57327
Weather 52280
Book 39758
Food & Drink 33334
Finance 31468
Photo & Video 28442
Travel 28244
Shopping 26920
Health & Fitness 23298
Sports 23009
Games 22813
News 21248
Productivity 21028
Utilities 18684
Lifestyle 16486
Entertainment 14030
Business 7491
Education 7004
Catalogs 4004
Medical 612

To use the column Installs from Google Play Store dataset I need at first explore in which format and how is represented that data. After I print a few values from the column, I realise that:

  • it's a string format
  • it has '+' signs and ',' delimeters
  • it represents the ranges of values not specific ones.

I will clean and convert the values from the column to integers and use these numbers as a certain approximation to real sum of installs.

In [13]:
# check the format of the column Installs
for row in google_play_store[:5]:
    print(row[5])
# convert the column Installs from Google Play Store dataset to floats
for row in google_play_store:
    row[5] = int(row[5].replace('+', '').replace(',', ''))
10,000+
500,000+
5,000,000+
50,000,000+
100,000+
In [14]:
# count sum of installs for every category for Google Play Store
installs_dict = {}

for row in google_play_store:
    category = row[1]
    installs = row[5]
    installs_dict.setdefault(category, []).append(installs)
    
# calculate average

for category, installs in installs_dict.items():
    avg_installs = round(sum(installs)/len(installs))
    installs_dict[category] = avg_installs
    
# sort and print resulting dictionary

for category in sorted(installs_dict, key=installs_dict.get, reverse=True):
    print(category, installs_dict[category])
COMMUNICATION 38456119
VIDEO_PLAYERS 24727872
SOCIAL 23253652
PHOTOGRAPHY 17840110
PRODUCTIVITY 16787331
GAME 15588016
TRAVEL_AND_LOCAL 13984078
ENTERTAINMENT 11640706
TOOLS 10801391
NEWS_AND_MAGAZINES 9549178
BOOKS_AND_REFERENCE 8767812
SHOPPING 7036877
PERSONALIZATION 5201483
WEATHER 5074486
HEALTH_AND_FITNESS 4188822
MAPS_AND_NAVIGATION 4056942
FAMILY 3697848
SPORTS 3638640
ART_AND_DESIGN 1986335
FOOD_AND_DRINK 1924898
EDUCATION 1833495
BUSINESS 1712290
LIFESTYLE 1437816
FINANCE 1387692
HOUSE_AND_HOME 1331541
DATING 854029
COMICS 817657
AUTO_AND_VEHICLES 647318
LIBRARIES_AND_DEMO 638504
PARENTING 542604
BEAUTY 513152
EVENTS 253542
MEDICAL 120551

For Apple Store dataset top-3 of the most popular genres are Navigation, Reference and Social Networking. For Google Play Store top-3 categories are: Communication, Video Players, Social. But there is a high probability that the most popular are few leaders for specific category, e.g. Facebook with millions of users for Social category. Let's now check next 3

To check that at first I'll examine the top-6 apps for every top-3 category from Google Play Store Dataset.

In [15]:
# check top-6 apps mostly installed at Google Play Store in top-3 categories
# communication
print('The most popular apps in communication category are:', '\n')

for row in google_play_store:
    if row[1] == 'COMMUNICATION' and row[5] > 100000000:
        print(row[0], ':', row[5])
print('\n')
# video players
print('The most popular apps in video players category are:', '\n')
for row in google_play_store:
    if row[1] == 'VIDEO_PLAYERS' and row[5] > 100000000:
        print(row[0], ':', row[5])
print('\n')
#game
print('The most popular apps in social category are:', '\n')
for row in google_play_store:
    if row[1] == 'SOCIAL' and row[5] > 100000000:
        print(row[0], ':', row[5])
The most popular apps in communication category are: 

Messenger – Text and Video Chat for Free : 1000000000
WhatsApp Messenger : 1000000000
Google Chrome: Fast & Secure : 1000000000
Gmail : 1000000000
Hangouts : 1000000000
Viber Messenger : 500000000
imo free video calls and chat : 500000000
Google Duo - High Quality Video Calls : 500000000
UC Browser - Fast Download Private & Secure : 500000000
Skype - free IM & video calls : 1000000000
LINE: Free Calls & Messages : 500000000


The most popular apps in video players category are: 

YouTube : 1000000000
Google Play Movies & TV : 1000000000
MX Player : 500000000


The most popular apps in social category are: 

Facebook : 1000000000
Instagram : 1000000000
Facebook Lite : 500000000
Snapchat : 500000000
Google+ : 1000000000

Indeed top-3 genres are dominated by few apps attracting millions of users. If we would like to succesfully compete with them the only way to do that is to offer a fundamentally new approaches or functionality. That is certainly not an easy task. Probably better approach will be to choose categories that are in the middle of the list. Let's check 3 categories from top-10 which : _Travel_andlocal and Game.

In [17]:
# calculate average total number of installs for Google Play Store
total = 0
for category in installs_dict:
    total += int(installs_dict[category])
mean_installs = total/len(installs_dict)
print(round(mean_installs))
7281600

Average number of installs estimates at 7,281,600, in our list of categories there are two with a total number of that range: _Books_andreference and Shopping. Next I will explore them.

In [18]:
sorted = sorted(google_play_store, key = lambda x: x[5], reverse=True)
# books_and_references
print('The most popular apps in Books_and_reference category are:', '\n')

for row in sorted:
    if row[1] == 'BOOKS_AND_REFERENCE' and row[5] > 10000000:
        print(row[0], ':', row[5])
print('\n')
# game
print('The most popular apps in Shopping category are:', '\n')
for row in sorted:
    if row[1] == 'SHOPPING' and row[5] > 10000000:
        print(row[0], ':', row[5])
The most popular apps in travel_and_local category are: 

Google Play Books : 1000000000
Wattpad 📖 Free Books : 100000000
Amazon Kindle : 100000000
Bible : 100000000
Audiobooks from Audible : 100000000


The most popular apps in Game category are: 

Wish - Shopping Made Fun : 100000000
AliExpress - Smarter Shopping, Better Living : 100000000
eBay: Buy & Sell this Summer - Discover Deals Now! : 100000000
Amazon Shopping : 100000000
Flipkart Online Shopping App : 100000000
letgo: Buy & Sell Used Stuff, Cars & Real Estate : 50000000
Lazada - Online Shopping & Deals : 50000000
OLX - Buy and Sell : 50000000
The birth : 50000000
Mercado Libre: Find your favorite brands : 50000000
Myntra Online Shopping App : 50000000
Groupon - Shop Deals, Discounts & Coupons : 50000000

The category _Books_andreferences looks more interesting mostly because apart from few leaders like Google Play books or Amazon Kindle selling all sorts of books it contains an app with free books - Wattpad as well as religious text. I'd like to check which apps are in the middle range regarding total number of installs.

In [20]:
# check middle range
print('Apps of the middle range in Books_and_reference category:', '\n')

for row in sorted:
    if row[1] == 'BOOKS_AND_REFERENCE' and (row[5] < 10000000 and row[5] > 4000000):
        print(row[0], ':', row[5])
print('\n')
Apps of the middle range in Books_and_reference category: 

AlReader -any text book reader : 5000000
Ebook Reader : 5000000
Read books online : 5000000
Ancestry : 5000000
Dictionary - WordWeb : 5000000
50000 Free eBooks & Free AudioBooks : 5000000
Al Quran : EAlim - Translations & MP3 Offline : 5000000
Bible KJV : 5000000
English to Hindi Dictionary : 5000000


Apps in the middle range also contain religiuos texts, not only whole-sellers of books or dictionaries. Next step is to find out if the same genres are popular among users of Apple Store. Let's check top-3 genres in Apple Store dataset.

In [21]:
# check apps mostly rated at Apple Store
# navigation
print('There are few apps at Navigation genre:')
for row in apple_store:
    if row[11] == 'Navigation':
        print(row[1], ':', row[5])
print('\n')
# reference
print('There are more apps at Reference genre dominated by dictionaries and religious texts:')
for row in apple_store:
    if row[11] == 'Reference':
        print(row[1], ':', row[5])
print('\n')
# social networking
print('There are plenty apps at Social Networking genre but dominated by few of them:')
for row in apple_store:
    if row[11] == 'Social Networking':
        print(row[1], ':', row[5])
There are few apps at Navigation genre:
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


There are more apps at Reference genre dominated by dictionaries and religious texts:
Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


There are plenty apps at Social Networking genre but dominated by few of them:
Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 23965
SimSimi : 23530
Grindr - Gay and same sex guys chat, meet and date : 23201
Wishbone - Compare Anything : 20649
imo video calls and chat : 18841
After School - Funny Anonymous School News : 18482
Quick Reposter - Repost, Regram and Reshare Photos : 17694
Weibo HD : 16772
Repost for Instagram : 15185
Live.me – Live Video Chat & Make Friends Nearby : 14724
Nextdoor : 14402
Followers Analytics for Instagram - InstaReport : 13914
YouNow: Live Stream Video Chat : 12079
FollowMeter for Instagram - Followers Tracking : 11976
LINE : 11437
eHarmony™ Dating App - Meet Singles : 11124
Discord - Chat for Gamers : 9152
QQ : 9109
Telegram Messenger : 7573
Weibo : 7265
Periscope - Live Video Streaming Around the World : 6062
Chat for Whatsapp - iPad Version : 5060
QQ HD : 5058
Followers Analysis Tool For Instagram App Free : 4253
live.ly - live video streaming : 4145
Houseparty - Group Video Chat : 3991
SOMA Messenger : 3232
Monkey : 3060
Down To Lunch : 2535
Flinch - Video Chat Staring Contest : 2134
Highrise - Your Avatar Community : 2011
LOVOO - Dating Chat : 1985
PlayStation®Messages : 1918
BOO! - Video chat camera with filters & stickers : 1805
Qzone : 1649
Chatous - Chat with new people : 1609
Kiwi - Q&A : 1538
GhostCodes - a discovery app for Snapchat : 1313
Jodel : 1193
FireChat : 1037
Google Duo - simple video calling : 1033
Fiesta by Tango - Chat & Meet New People : 885
Google Allo — smart messaging : 862
Peach — share vividly : 727
Hey! VINA - Where Women Meet New Friends : 719
Battlefield™ Companion : 689
All Devices for WhatsApp - Messenger for iPad : 682
Chat for Pokemon Go - GoChat : 500
IAmNaughty – Dating App to Meet New People Online : 463
Qzone HD : 458
Zenly - Locate your friends in realtime : 427
League of Legends Friends : 420
豆瓣 : 407
Candid - Speak Your Mind Freely : 398
知乎 : 397
Selfeo : 366
Fake-A-Location Free ™ : 354
Popcorn Buzz - Free Group Calls : 281
Fam — Group video calling for iMessage : 279
QQ International : 274
Ameba : 269
SoundCloud Pulse: for creators : 240
Tantan : 235
Cougar Dating & Life Style App for Mature Women : 213
Rawr Messenger - Dab your chat : 180
WhenToPost: Best Time to Post Photos for Instagram : 158
Inke—Broadcast an amazing life : 147
Mustknow - anonymous video Q&A : 53
CTFxCmoji : 39
Lobi : 36
Chain: Collaborate On MyVideo Story/Group Video : 35
botman - Real time video chat : 7
BestieBox : 0
MATCH ON LINE chat : 0
niconico ch : 0
LINE BLOG : 0
bit-tube - Live Stream Video Chat : 0

After the exploration of Apple Store dataset I see that the Reference genre is quite diverse dominated by dictionaries and religious texts but is not dominated by few leaders, so it could be an option to build an app in the same genre.

Final recommendation

In this project, I analyzed datasets from Apple Store and Google Play Store with the goal of recommending an app genre that can be profitable for both markets. My final recommendation is to build some application based around a niche but commonly used text. Perfect candidate will be some sort of religiuos or quasi-religious qult text (Marie Kondo, Harry Potter or Satan Bible?). User experience can be enhanced with audio-versions or even gamification. Of course that recommendation should be backed by further study.