Profitable App Profiles for the App Store and Google Play Markets

As a datascientist at a company that builds apps for Apple Store and Google Play markets our goal is to help developers to take a datadriven decisions about new app idea.

In this project our purpose is to build a profitable app that will be free to download and to install. Our main revenue relies on in-app ads. That's why we aim to reach more users and want our user to spend more time on our app. In such a a way more users will see and will engage with adds.

To find out what kind of apps mights be interesting for users we need to analyze data to help our developpers to understand what type of apps will attract more users.

For this project we're going to study Apple Store data and Google Market data in order to find the best directions to follow to build a new app. We want to make our app available on both markets.

We will focuse on free English-speaking apps.

In this task we're going to use a sample of already availbale data online.

Hopefully, we have available data for both platforms:

  • A data set with 10 000 Android apps from Google Play. This data set was collected in August 2018. Can be found here
  • A data set with 7 000 iOS apps from the App Store. This data set was collected in July 2017.Can be found here

In the code below we :

  1. Open our working dataset and checking the presence of the header
  2. Transform this data set into list of list list_data.

After these steps we want to find out what columns can be potentially interesting for our data analysis.

In [1]:
from csv import reader

def open_file(dataset, has_header=True):
    opened_file = open(dataset, encoding='utf8')
    read_file = reader(opened_file)
    list_data = list(read_file)
    if has_header:
        return list_data[0], list_data[1:]
    else:
        return list_data[1:]
    
ios_header, ios = open_file('AppleStore.csv', has_header=True)
android_header, android = open_file('googleplaystore.csv', has_header=True)

The function explore_data permits to explore our datasets from different perspectives:

  1. Can visualize a certain slice of rows
  2. Gives the number of columns and rows under certain condition
In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
In [3]:
print(ios_header)
print('\n')
explore_data(ios, 2, 5, rows_and_columns=True)
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16

After a quick glance over our header we can assume that we might be interested in columns such as :

  • price - Price amount
  • rating_count_tot - User Rating counts(and maybe rating_count_ver)
  • user_rating - Average User Rating value(and maybe user_rating_ver)
  • count_rating - Content Rating
  • prime_genre - Primary Genre.

For more detailed info about columns check here.

In [4]:
print(android_header)
print('\n')
explore_data(android, 2, 5, rows_and_columns=True)
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13

After a quick glance over our header we can assume that we might be interested in columns such as :

  • Category - App Category
  • Type - Free or Paid
  • Price - Price amount
  • Content Rating - Content Rating
  • Reviews - User Rating counts
  • Rating - Average User Rating value
  • Genres - Primary Genre.

For more detailed info about columns check here.

In these previous outputs we can observe that the following datasets have:

1. iOS apps:
    * 16 columns
    * 7197 rows
2. Android apps:
    * 13 columns
    * 10841 rows

So we might think that there are much more apps for Google Play market and we should build our app for that platform. The discrepency might come:

  • from the collection date: we need to take into account that there is 11 month between two datasets collections
  • we need to think also about the propotion of iOS users vs Google Play market users.

Cleaning process

Before getting started with our analysis we need to figure out if our data set is clean, have no errors and all the data points correspond to our requirements.

Checking row length

The function below we will check if all of our data set rows have the same length. More exactly we compare the length of each row to the length of our header.

In [5]:
def length_row(header, dataset):
    header_lenght = len(header)
    count = 0
    for row in dataset:
        count += 1
        if header_lenght != len(row):
            return count, row
            
ios_error = length_row(ios_header, ios)
android_error_row, android_error_line = length_row(android_header, android)
In [6]:
print(ios_error)
None

As we can see our iOS data set doesn have such length problem.

Below we're printing the number of the row error as well as the line itself, our header and another line. This will help to visually undrestand where is the problem.

In [7]:
print(android_error_row)
print(android_error_line)
print()
print(android_header)
print()
print(android[0])
10473
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

Now we can see that Google Store dataset has an issue with a missing cell corresponding to 'Category' (ind 1) column on a line 10472. So we can delete this row to avoid any problems.

In [8]:
print(len(android))
del android[10472]
print(len(android))
10841
10840

Removing Duplicate Entries

Before moving on we need to make sure that our dataset contain no duplicates. We want to keep just one entry per app. Otherwise it can mislead our conclusion.

Below we're wrote a fucntion that checks if our datasets have duplicates application and how many.

In [9]:
def if_duplicate(dataset, index):

    unique_apps =[]
    duplicate_apps = []

    for app in dataset:
        name = app[index]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            
    return unique_apps, duplicate_apps

ios_unique, ios_duplicate = if_duplicate(ios, 1)
android_unique, android_duplicate = if_duplicate(android, 0)

print('Number of unique iOS apps is ', len(ios_unique))
print('Number of duplicate iOS apps is ', len(ios_duplicate))
print('\n')
print('Number of unique Android apps is ', len(android_unique))
print('Number of duplicate Android apps is ', len(android_duplicate))
Number of unique iOS apps is  7195
Number of duplicate iOS apps is  2


Number of unique Android apps is  9659
Number of duplicate Android apps is  1181

Those duplicates should be removed but we want to be able to make a choice which one is not or less usefull. For this reason we're going to verify the rows of duplicate apps in order to find a good criterion for deleting/keeping.

First of all we are going to find out the most frequent duplicates. This will give us more occurencies - so better view on value differences.

For this we are creating a frequency table with app name as a key and frequency as a value. Afterwards we want to check what duplicates are the most frequent in Google Market data set. We're checking what are the apps that have more than 7 occurences.

In [10]:
duplicates_freq = {}

for app in android:
    name = app[0]
    if name not in duplicates_freq:
        duplicates_freq[name] = 1
    else:
        duplicates_freq[name] += 1

for app in duplicates_freq:
    if duplicates_freq[app] >= 7:
        print(app)
Duolingo: Learn Languages Free
ROBLOX
Candy Crush Saga
8 Ball Pool
ESPN
CBS Sports App - Scores, News, Stats & Watch Live

In the code below we are printing all rows with the apps name Duolingo: Learn Languages Free.

In [11]:
print(android_header)
for app in android:
    name = app[0]
    if name == 'Duolingo: Learn Languages Free':
        print('\n')
        print(app)
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Duolingo: Learn Languages Free', 'EDUCATION', '4.7', '6289924', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Education;Education', 'August 1, 2018', 'Varies with device', 'Varies with device']


['Duolingo: Learn Languages Free', 'EDUCATION', '4.7', '6290507', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Education;Education', 'August 1, 2018', 'Varies with device', 'Varies with device']


['Duolingo: Learn Languages Free', 'EDUCATION', '4.7', '6290507', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Education;Education', 'August 1, 2018', 'Varies with device', 'Varies with device']


['Duolingo: Learn Languages Free', 'EDUCATION', '4.7', '6290507', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Education;Education', 'August 1, 2018', 'Varies with device', 'Varies with device']


['Duolingo: Learn Languages Free', 'FAMILY', '4.7', '6294400', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Education;Education', 'August 1, 2018', 'Varies with device', 'Varies with device']


['Duolingo: Learn Languages Free', 'FAMILY', '4.7', '6294397', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Education;Education', 'August 1, 2018', 'Varies with device', 'Varies with device']


['Duolingo: Learn Languages Free', 'FAMILY', '4.7', '6297590', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Education;Education', 'August 6, 2018', 'Varies with device', 'Varies with device']

We have 7 occurences with several but nevertheless important differences:

  • This app belongs to 2 different categories : 'EDUCATION' and 'FAMILY' even though genres is the same for all occurences('Education;Education')
  • Number of reviews is different
  • Date is different only for one occurence

This explain why we can not make our deletion randomly. Of course, in this case we're dealing with the same app but maybe different version, or maybe it's justa data scrapping issue (data collection was performed in different periods).

That's why it would be more relevant to keep the app with more important review number.

In [12]:
print(ios_duplicate)
['Mannequin Challenge', 'VR Roller Coaster']

iOS app dataset has only two duplicates.

In [13]:
for app in ios:
    name = app[1]
    if name == 'VR Roller Coaster':
        print('\n')
        print(app)
        
print('\n')
print(ios_header)
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']


['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

The main difference in duplicate apps consists in raiting count and rating count version. For this reason we are going to use review count criterion to delete our ios apps duplicates.

To do so we will:

  • Create a dictionnary with the app name as a key and the biggest review number as a value.
  • Use the dictionary to creat a new dataset with only one entry per app.
In [14]:
def highest_review_app(dataset, ind_app, ind_review):

    max_reviews = {}

    for row in dataset:
        app = row[ind_app]
        review = float(row[ind_review])
        
        if app in max_reviews and max_reviews[app] < review:
            max_reviews[app] = review
        elif app not in max_reviews:
            max_reviews[app] = review
            
    return max_reviews
In [15]:
ios_max_review = highest_review_app(ios, 1, 5)
android_max_review = highest_review_app(android, 0, 3)
In [16]:
print('Expected Google Market dataset length : ', len(android) - len(android_duplicate))
print('Actual Google Market dataset length : ', len(android_max_review))
print('\n')
print('Expected Apple Store dataset length : ', len(ios) - len(ios_duplicate))
print('Actual Apple Store dataset length : ', len(ios_max_review))
Expected Google Market dataset length :  9659
Actual Google Market dataset length :  9659


Expected Apple Store dataset length :  7195
Actual Apple Store dataset length :  7195

The function below:

  • filters our original datasets according to our dictionnary created previously(app-review number)
  • keeps only those that are present in our dictionnary
  • keeps track of already added apps

We need the last step because we might have several apps with the same name and the same number of reviews thats why we need to double check with an already_added list. If we don't use this condition we might end up with several duplicates.

In [17]:
def cleaning(dataset, ind_app, ind_review):

    apps_clean = []
    already_added = []
    max_review = highest_review_app(dataset, ind_app, ind_review)

    for app in dataset:
        name = app[ind_app]
        n_review = float(app[ind_review])

        if (name not in already_added) and (n_review == max_review[name]):
            apps_clean.append(app)
            already_added.append(name)
            
    return apps_clean
In [18]:
android_clean = cleaning(android, 0, 3)
explore_data(android_clean, 0, 3, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13
In [19]:
ios_clean = cleaning(ios, 1, 5)
explore_data(ios_clean, 0, 3, True)
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7195
Number of columns: 16

Removing Non-English Apps

Little reminder: our goal is to creat an app for English speaking users. This is why our second step will be cleaning the dataset from Non-English apps. Unfortunately, we can not be straightforward an ddelete all the apps that have non-ASCII characters.

Some apps may contain non-ASCII characters but still remain English speaking apps. For example, some app names may contain emojis or different characters that are not included in ASCII range of 127.

In [20]:
print(ord('™'))
print(ord('😜'))
8482
128540

In our case we our going to limit our app name filter to only 3 non-ASCII characters. In case if the apps name consists of 3 or less characters we will check if all the characters are non-ASCII.

In [23]:
def is_english(string):
    non_ascii = 0
    
    for ch in string:
        if ord(ch) > 127:
            non_ascii += 1
    
    if non_ascii == len(string) or non_ascii >= 3:
        return False
    else:
        return True


print(is_english('Instachat'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
True
True
False
In [ ]:
 

Now, using this function we can explore our dataset and sort non-english apps (by our definition) by saving only English apps.

In [24]:
android_en = []
ios_en = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_en.append(app)

for app in ios_clean:
    name = app[1]
    if is_english(name):
        ios_en.append(app)
        
In [25]:
explore_data(android_en, 0, 3, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9597
Number of columns: 13
In [26]:
explore_data(ios_en, 0, 3, True)
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6147
Number of columns: 16

Isolating Free Apps

As we mentioned above: we're planing to build only free apps, thats why for our decision making process we want to keep only free apps.

In the code below we are creating new lists that will contain only free apps.

In [27]:
android_final = []
ios_final = []

for app in android_en:
    price = app[7]
    if price == '0':
        android_final.append(app)
    
for app in ios_en:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))
8848
3196

We're left with 8848 Android apps and 3196 iOS apps.

Most Common Apps by Genre

After all the cleaning process we can study our data in order to find presumably the most profitable category to develop. Our low lisk strategy is following:

  1. Develop an app for Google Play Market.
  2. If our app has a good feedback we can develop it further.
  3. If after 6 month being on Market it's still profitble we can build an iOS version of it.

We will study both dataset because in the future we're planning to engage with both markets an dwe need to find the most succesful niche.

Our next step will be discovering the most popular genres within each dataset. In the code below we are going to create a frequency table for each genre.

In [32]:
def freq_table(dataset, index):
    table ={}
    count = 0
    #creating a frequency table
    for row in dataset:
        elem = row[index]
        count += 1
        if elem in table:
            table[elem] += 1
        else:
            table[elem] = 1
    
    #calculating percentage of each app category
    for elem in table:
        percentage = (table[elem] / count) * 100
        table[elem] = percentage
        
    return table

print(freq_table(ios_final, -5))
{'Social Networking': 3.254067584480601, 'Photo & Video': 5.006257822277847, 'Games': 58.260325406758454, 'Music': 2.065081351689612, 'Reference': 0.5319148936170213, 'Health & Fitness': 2.033792240300375, 'Weather': 0.8760951188986232, 'Utilities': 2.471839799749687, 'Travel': 1.2202753441802252, 'Shopping': 2.5969962453066335, 'News': 1.3454317897371715, 'Navigation': 0.18773466833541927, 'Lifestyle': 1.5644555694618274, 'Entertainment': 7.853566958698373, 'Food & Drink': 0.8135168961201502, 'Sports': 2.1589486858573217, 'Book': 0.37546933667083854, 'Finance': 1.0951188986232792, 'Education': 3.6921151439299122, 'Productivity': 1.7521902377972465, 'Business': 0.5319148936170213, 'Catalogs': 0.1251564455569462, 'Medical': 0.18773466833541927}

These results are not quite readable so we are going to sort our genres/categories according to their percentage in a descneding order.

To do so we're going to:

  1. Transfer our key-value pair to a tuple in a switched position value-key. This will help to sort our categories basing on the percentage.
  2. Sort our datapoints in a descending order.
In [33]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        val_to_append = (table[key], key)
        table_display.append(val_to_append)
        
    table_sort = sorted(table_display, reverse = True)
        
    for elem in table_sort:
        print(elem[1], ' : ', elem[0])
In [34]:
print(display_table(ios_final, -5)) #exploring prime_genre
Games  :  58.260325406758454
Entertainment  :  7.853566958698373
Photo & Video  :  5.006257822277847
Education  :  3.6921151439299122
Social Networking  :  3.254067584480601
Shopping  :  2.5969962453066335
Utilities  :  2.471839799749687
Sports  :  2.1589486858573217
Music  :  2.065081351689612
Health & Fitness  :  2.033792240300375
Productivity  :  1.7521902377972465
Lifestyle  :  1.5644555694618274
News  :  1.3454317897371715
Travel  :  1.2202753441802252
Finance  :  1.0951188986232792
Weather  :  0.8760951188986232
Food & Drink  :  0.8135168961201502
Reference  :  0.5319148936170213
Business  :  0.5319148936170213
Book  :  0.37546933667083854
Navigation  :  0.18773466833541927
Medical  :  0.18773466833541927
Catalogs  :  0.1251564455569462
None

Observation of the ios dataset:

- Games is the most important category with more than half (58,26%)
- Entertainement takes the second place with almost 8%
- Photo & Video have reached 5%
- Education reaches 3,69%

This shows that the most present apps in our dataset are apps designed for fun and less for practical purposes(education, productivity, utilities, weather, business, etc.)

In this dataset app categories are not distributed equally.

Nevertheless, giving this information we can not assume that these apps have the greatest number of users. The demand might be less important as the offer.

In [35]:
print(display_table(android_final, -4))#exploring Genre
Tools  :  8.44258589511754
Entertainment  :  6.080470162748644
Education  :  5.357142857142857
Business  :  4.599909584086799
Productivity  :  3.899186256781193
Lifestyle  :  3.8765822784810124
Finance  :  3.7070524412296564
Medical  :  3.5375226039783
Sports  :  3.4584086799276674
Personalization  :  3.322784810126582
Communication  :  3.2323688969258586
Action  :  3.096745027124774
Health & Fitness  :  3.0854430379746836
Photography  :  2.949819168173599
News & Magazines  :  2.802893309222423
Social  :  2.667269439421338
Travel & Local  :  2.328209764918626
Shopping  :  2.2490958408679926
Books & Reference  :  2.1360759493670884
Simulation  :  2.0456600361663653
Dating  :  1.8648282097649187
Arcade  :  1.842224231464738
Video Players & Editors  :  1.7744122965641953
Casual  :  1.763110307414105
Maps & Navigation  :  1.3901446654611211
Food & Drink  :  1.2432188065099457
Puzzle  :  1.1301989150090417
Racing  :  0.9945750452079566
Role Playing  :  0.9380650994575045
Libraries & Demo  :  0.9380650994575045
Auto & Vehicles  :  0.9267631103074141
Strategy  :  0.9154611211573236
House & Home  :  0.8024412296564195
Weather  :  0.7911392405063291
Events  :  0.7120253164556962
Adventure  :  0.6668173598553345
Comics  :  0.599005424954792
Beauty  :  0.599005424954792
Art & Design  :  0.599005424954792
Parenting  :  0.4972875226039783
Card  :  0.45207956600361665
Trivia  :  0.4181735985533454
Casino  :  0.4181735985533454
Educational;Education  :  0.39556962025316456
Board  :  0.3842676311030741
Educational  :  0.3729656419529837
Education;Education  :  0.33905967450271246
Word  :  0.25994575045207957
Casual;Pretend Play  :  0.23734177215189875
Music  :  0.2034358047016275
Racing;Action & Adventure  :  0.16952983725135623
Puzzle;Brain Games  :  0.16952983725135623
Entertainment;Music & Video  :  0.16952983725135623
Casual;Brain Games  :  0.13562386980108498
Casual;Action & Adventure  :  0.13562386980108498
Arcade;Action & Adventure  :  0.12432188065099457
Action;Action & Adventure  :  0.10171790235081375
Educational;Pretend Play  :  0.09041591320072333
Simulation;Action & Adventure  :  0.07911392405063292
Parenting;Education  :  0.07911392405063292
Entertainment;Brain Games  :  0.07911392405063292
Board;Brain Games  :  0.07911392405063292
Parenting;Music & Video  :  0.06781193490054249
Educational;Brain Games  :  0.06781193490054249
Casual;Creativity  :  0.06781193490054249
Art & Design;Creativity  :  0.06781193490054249
Education;Pretend Play  :  0.05650994575045208
Role Playing;Pretend Play  :  0.045207956600361664
Education;Creativity  :  0.045207956600361664
Role Playing;Action & Adventure  :  0.033905967450271246
Puzzle;Action & Adventure  :  0.033905967450271246
Entertainment;Creativity  :  0.033905967450271246
Entertainment;Action & Adventure  :  0.033905967450271246
Educational;Creativity  :  0.033905967450271246
Educational;Action & Adventure  :  0.033905967450271246
Education;Music & Video  :  0.033905967450271246
Education;Brain Games  :  0.033905967450271246
Education;Action & Adventure  :  0.033905967450271246
Adventure;Action & Adventure  :  0.033905967450271246
Video Players & Editors;Music & Video  :  0.022603978300180832
Sports;Action & Adventure  :  0.022603978300180832
Simulation;Pretend Play  :  0.022603978300180832
Puzzle;Creativity  :  0.022603978300180832
Music;Music & Video  :  0.022603978300180832
Entertainment;Pretend Play  :  0.022603978300180832
Casual;Education  :  0.022603978300180832
Board;Action & Adventure  :  0.022603978300180832
Video Players & Editors;Creativity  :  0.011301989150090416
Trivia;Education  :  0.011301989150090416
Travel & Local;Action & Adventure  :  0.011301989150090416
Tools;Education  :  0.011301989150090416
Strategy;Education  :  0.011301989150090416
Strategy;Creativity  :  0.011301989150090416
Strategy;Action & Adventure  :  0.011301989150090416
Simulation;Education  :  0.011301989150090416
Role Playing;Brain Games  :  0.011301989150090416
Racing;Pretend Play  :  0.011301989150090416
Puzzle;Education  :  0.011301989150090416
Parenting;Brain Games  :  0.011301989150090416
Music & Audio;Music & Video  :  0.011301989150090416
Lifestyle;Pretend Play  :  0.011301989150090416
Lifestyle;Education  :  0.011301989150090416
Health & Fitness;Education  :  0.011301989150090416
Health & Fitness;Action & Adventure  :  0.011301989150090416
Entertainment;Education  :  0.011301989150090416
Communication;Creativity  :  0.011301989150090416
Comics;Creativity  :  0.011301989150090416
Casual;Music & Video  :  0.011301989150090416
Card;Action & Adventure  :  0.011301989150090416
Books & Reference;Education  :  0.011301989150090416
Art & Design;Pretend Play  :  0.011301989150090416
Art & Design;Action & Adventure  :  0.011301989150090416
Arcade;Pretend Play  :  0.011301989150090416
Adventure;Education  :  0.011301989150090416
None

This percentage proportion shows us that we Google Play market has more of practical apps than fun apps.

In the top selection we have :

  • Tools
  • Education
  • Business
  • Productivity
  • Finance

Other practical apps are more equally represented through the whole set of categories. This was not the case for iOS apps.

In [36]:
print(display_table(android_final, 1)) #exploring category
FAMILY  :  18.942133815551536
GAME  :  9.697106690777577
TOOLS  :  8.453887884267631
BUSINESS  :  4.599909584086799
PRODUCTIVITY  :  3.899186256781193
LIFESTYLE  :  3.887884267631103
FINANCE  :  3.7070524412296564
MEDICAL  :  3.5375226039783
SPORTS  :  3.390596745027125
PERSONALIZATION  :  3.322784810126582
COMMUNICATION  :  3.2323688969258586
HEALTH_AND_FITNESS  :  3.0854430379746836
PHOTOGRAPHY  :  2.949819168173599
NEWS_AND_MAGAZINES  :  2.802893309222423
SOCIAL  :  2.667269439421338
TRAVEL_AND_LOCAL  :  2.3395117540687163
SHOPPING  :  2.2490958408679926
BOOKS_AND_REFERENCE  :  2.1360759493670884
DATING  :  1.8648282097649187
VIDEO_PLAYERS  :  1.7970162748643763
MAPS_AND_NAVIGATION  :  1.3901446654611211
FOOD_AND_DRINK  :  1.2432188065099457
EDUCATION  :  1.164104882459313
ENTERTAINMENT  :  0.9606690777576853
LIBRARIES_AND_DEMO  :  0.9380650994575045
AUTO_AND_VEHICLES  :  0.9267631103074141
HOUSE_AND_HOME  :  0.8024412296564195
WEATHER  :  0.7911392405063291
EVENTS  :  0.7120253164556962
PARENTING  :  0.6555153707052441
ART_AND_DESIGN  :  0.6442133815551537
COMICS  :  0.6103074141048824
BEAUTY  :  0.599005424954792
None

From this representation we can assume that the biggest part of apps were designed for practical purpose. All the top categories , except Game category with 10% are represented by productive apps.

However the Family category is a bit vague. We are going to see what kind of apps are represented in this category and what types of genre is assigned to this category.

In the code below we've:

  1. Extracted genres that have the same category FAMILY.
  2. Created a frequency table for these genres.
  3. Sorted these categories in ascending order.
In [37]:
def category_genre(dataset, elem, ind_imp, ind_comp):
    table = {}
    l_comp = []
    
    for row in dataset:
        if elem == row[ind_imp]:
            l_comp.append(row[ind_comp])
    

    for el in l_comp:
        if el in table:
            table[el] += 1
        else:
            table[el] = 1
            
    return table, len(l_comp)

cat_to_gen, len_category = category_genre(android_final, 'FAMILY' , 1, -4)
In [38]:
genre_sort = []
for genre in cat_to_gen:
    freq_genre = (cat_to_gen[genre], genre)
    genre_sort.append(freq_genre)
    
genre_sort = sorted(genre_sort, reverse = True)

print('Google Play Market contains ', len_category, ' apps which category is FAMILY')
print()
for genre in genre_sort:
    print(genre[1], ' : ', genre[0])
Google Play Market contains  1676  apps which category is FAMILY

Entertainment  :  458
Education  :  382
Simulation  :  174
Casual  :  134
Puzzle  :  78
Role Playing  :  72
Strategy  :  66
Educational;Education  :  35
Educational  :  33
Education;Education  :  24
Casual;Pretend Play  :  21
Racing;Action & Adventure  :  15
Puzzle;Brain Games  :  15
Entertainment;Music & Video  :  12
Casual;Action & Adventure  :  12
Casual;Brain Games  :  11
Arcade;Action & Adventure  :  11
Educational;Pretend Play  :  8
Action;Action & Adventure  :  8
Simulation;Action & Adventure  :  7
Board;Brain Games  :  7
Entertainment;Brain Games  :  6
Educational;Brain Games  :  6
Casual;Creativity  :  6
Role Playing;Pretend Play  :  4
Education;Pretend Play  :  4
Role Playing;Action & Adventure  :  3
Puzzle;Action & Adventure  :  3
Entertainment;Action & Adventure  :  3
Educational;Creativity  :  3
Educational;Action & Adventure  :  3
Education;Music & Video  :  3
Education;Action & Adventure  :  3
Adventure;Action & Adventure  :  3
Sports;Action & Adventure  :  2
Simulation;Pretend Play  :  2
Puzzle;Creativity  :  2
Music;Music & Video  :  2
Entertainment;Pretend Play  :  2
Entertainment;Creativity  :  2
Education;Creativity  :  2
Casual;Education  :  2
Board;Action & Adventure  :  2
Art & Design;Creativity  :  2
Video Players & Editors;Music & Video  :  1
Trivia;Education  :  1
Strategy;Education  :  1
Strategy;Creativity  :  1
Strategy;Action & Adventure  :  1
Simulation;Education  :  1
Role Playing;Brain Games  :  1
Racing;Pretend Play  :  1
Puzzle;Education  :  1
Music & Audio;Music & Video  :  1
Lifestyle;Education  :  1
Health & Fitness;Education  :  1
Health & Fitness;Action & Adventure  :  1
Entertainment;Education  :  1
Education;Brain Games  :  1
Communication;Creativity  :  1
Casual;Music & Video  :  1
Card;Action & Adventure  :  1
Books & Reference;Education  :  1
Art & Design;Pretend Play  :  1
Art & Design;Action & Adventure  :  1
Arcade;Pretend Play  :  1
Adventure;Education  :  1

As we've expected the biggest part of apps allocated to the FAMILY category has fun purpose: genre Entertainement and different sub-types of entertainement and games take most important part in this category.

Nevertheless, Google Play app distribution between categories is more balanceв in comparison to iOS app dataset.

We need to find out what types of genres/categories are the most popular among users. For this we need to calculate the avarage number of installs for each app genre. There are no information about the install numbers in our iOS dataset. So we're going to use the raiting_count_tot column.

In the code below we're:

  1. Extracting all the genres
  2. Using a nested loop:
    • to sum up the raiting of all the app belonging to the same category
    • calculating the number of apps beloging to this category
  3. Calculating the average of app category raiting
In [39]:
freq_genre_ios = freq_table(ios_final, -5)

for genre in freq_genre_ios:
    total = 0
    len_genre = 0
    for row in ios_final:
        genre_app = row[-5]
        if genre == genre_app:
            raiting = float(row[5])
            total += raiting
            len_genre += 1
    average_raiting = total / len_genre
    print(genre, ' : ', average_raiting)
    
Social Networking  :  72916.54807692308
Photo & Video  :  28441.54375
Games  :  22935.43984962406
Music  :  57326.530303030304
Reference  :  79350.4705882353
Health & Fitness  :  23298.015384615384
Weather  :  52279.892857142855
Utilities  :  19156.493670886077
Travel  :  28964.05128205128
Shopping  :  27230.734939759037
News  :  21248.023255813954
Navigation  :  86090.33333333333
Lifestyle  :  16815.48
Entertainment  :  14195.358565737051
Food & Drink  :  33333.92307692308
Sports  :  23008.898550724636
Book  :  46384.916666666664
Finance  :  32367.02857142857
Education  :  7003.983050847458
Productivity  :  21028.410714285714
Business  :  7491.117647058823
Catalogs  :  4004.0
Medical  :  612.0

It seems like the apps that received more raitings belong to Social Networking, Reference, Music, Navigation and Education.

These categories must be dominated by the largest groups. Navigation by Waze, Google Maps, etc. Social Networking by Facebook, Pinterest, Instagram, etc. Music by Spotify, Shazam, Pandora, etc.

These big influencers have a big impact on the whole map.

In [41]:
def most_comm(dataset, pattern):
    for app in dataset:
        name = app[1]
        genre = app[-5]
        if genre == pattern:
            print(name, ':', app[5])

print('Most common apps within Navigation category :')
print(most_comm(ios_final, 'Navigation'))
Most common apps within Navigation category :
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
None
In [42]:
print('Most common apps within Social Networking category')
print(most_comm(ios_final, 'Social Networking'))
Most common apps within Social Networking category
Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 23965
SimSimi : 23530
Grindr - Gay and same sex guys chat, meet and date : 23201
Wishbone - Compare Anything : 20649
imo video calls and chat : 18841
After School - Funny Anonymous School News : 18482
Quick Reposter - Repost, Regram and Reshare Photos : 17694
Weibo HD : 16772
Repost for Instagram : 15185
Live.me – Live Video Chat & Make Friends Nearby : 14724
Nextdoor : 14402
Followers Analytics for Instagram - InstaReport : 13914
YouNow: Live Stream Video Chat : 12079
FollowMeter for Instagram - Followers Tracking : 11976
LINE : 11437
eHarmony™ Dating App - Meet Singles : 11124
Discord - Chat for Gamers : 9152
QQ : 9109
Telegram Messenger : 7573
Weibo : 7265
Periscope - Live Video Streaming Around the World : 6062
Chat for Whatsapp - iPad Version : 5060
QQ HD : 5058
Followers Analysis Tool For Instagram App Free : 4253
live.ly - live video streaming : 4145
Houseparty - Group Video Chat : 3991
SOMA Messenger : 3232
Monkey : 3060
Down To Lunch : 2535
Flinch - Video Chat Staring Contest : 2134
Highrise - Your Avatar Community : 2011
LOVOO - Dating Chat : 1985
PlayStation®Messages : 1918
BOO! - Video chat camera with filters & stickers : 1805
Qzone : 1649
Chatous - Chat with new people : 1609
Kiwi - Q&A : 1538
GhostCodes - a discovery app for Snapchat : 1313
Jodel : 1193
FireChat : 1037
Google Duo - simple video calling : 1033
Fiesta by Tango - Chat & Meet New People : 885
Google Allo — smart messaging : 862
Peach — share vividly : 727
Hey! VINA - Where Women Meet New Friends : 719
Battlefield™ Companion : 689
All Devices for WhatsApp - Messenger for iPad : 682
Chat for Pokemon Go - GoChat : 500
IAmNaughty – Dating App to Meet New People Online : 463
Qzone HD : 458
Zenly - Locate your friends in realtime : 427
League of Legends Friends : 420
Candid - Speak Your Mind Freely : 398
Selfeo : 366
Fake-A-Location Free ™ : 354
Popcorn Buzz - Free Group Calls : 281
Fam — Group video calling for iMessage : 279
QQ International : 274
Ameba : 269
SoundCloud Pulse: for creators : 240
Tantan : 235
Cougar Dating & Life Style App for Mature Women : 213
Rawr Messenger - Dab your chat : 180
WhenToPost: Best Time to Post Photos for Instagram : 158
Inke—Broadcast an amazing life : 147
Mustknow - anonymous video Q&A : 53
CTFxCmoji : 39
Lobi : 36
Chain: Collaborate On MyVideo Story/Group Video : 35
botman - Real time video chat : 7
BestieBox : 0
MATCH ON LINE chat : 0
niconico ch : 0
LINE BLOG : 0
bit-tube - Live Stream Video Chat : 0
None

As for Reference category this result is influenced by the Bible app and Dictionary.com

In [43]:
print('Most common apps within Reference category')
print(most_comm(ios_final, 'Reference'))
Most common apps within Reference category
Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Jishokun-Japanese English Dictionary & Translator : 0
None

Nevertheless we can explore more this category for few reasons:

  • it's not 'overcrowded' with apps - we have less concurrents
  • iOS market seems to be overloaded with 'for fun' apps.
  • this category can be easily transformed and be used with productive purpose but in a fun way.

For example, we might:

  1. Create an app dedicated to a popular book.
  2. Integrate some features into this app: quizzes, dictionnary, daily quotes, etc.
In [44]:
print('Most common apps within Education category')
print(most_comm(ios_final, 'Education'))
Most common apps within Education category
Duolingo - Learn Spanish, French and more : 162701
Guess My Age  Math Magic : 123190
Lumosity - Brain Training : 96534
Elevate - Brain Training and Games : 58092
Fit Brains Trainer : 46363
ClassDojo : 35440
Memrise: learn languages : 20383
Peak - Brain Training : 20322
Canvas by Instructure : 19981
ABCmouse.com - Early Learning Academy : 18749
Quizlet: Study Flashcards, Languages & Vocabulary : 16683
Photomath - Camera Calculator : 16523
iTunes U : 15801
Blackboard Mobile Learn™ : 13567
Star Chart : 13482
Remind: Fast, Efficient School Messaging : 9796
PBS KIDS Video : 8651
Toca Kitchen Monsters : 8062
Toca Hair Salon - Christmas Gift : 8049
Edmodo : 7197
Prodigy Math Game : 6683
Epic! - Unlimited Books for Kids : 6676
ChineseSkill -Learn Mandarin Chinese Language Free : 6077
Google Classroom : 5942
TED : 5782
Khan Academy: you can learn anything : 5459
Got It - Homework Help Math, Chem, Physics Solver : 4903
PowerSchool Mobile : 4547
SkyView® Free - Explore the Universe : 4188
Hopscotch : 4057
IXL - Math and English : 3546
Simply Piano by JoyTunes - Learn & play piano : 2925
Kids A-Z : 2887
Infinite Campus Mobile Portal : 2286
PlayKids - Educational Cartoons and Games for Kids : 2196
Memorado Brain Training for Memory & Mindfulness : 2067
Bookshelf : 2064
Mathway - Math Problem Solver : 1854
Schoology : 1777
HelloTalk Language Exchange Learning App : 1619
SpellingCity : 1566
Nick Jr. : 1541
Babbel – Learn Languages Spanish, French & more : 1533
Yup - Homework Help with Math & Science Tutors : 1424
Mondly: Learn 33 Languages: Spanish English French : 1395
WWF Together : 1385
Tinycards - Learn with Fun, Free Flashcards : 1131
Nearpod : 1057
Starfall FREE : 1019
Reflex Student : 1010
GoldieBlox and the Movie Machine : 1000
Pearson eText : 981
codeSpark Academy with The Foos - coding for kids : 977
Dr. Panda Restaurant Asia : 853
NOGGIN - Preschool Shows & Educational Kids Videos : 782
Tynker - Learn to Code. Programming Made Easy. : 771
BrainHQ - Brain Training : 684
Top Hat Lecture : 668
Pearson eText for Schools : 609
Curious World: Games, Videos, Books for Children : 604
McGraw-Hill K-12 ConnectED Mobile : 594
Socrative Student : 581
Swift Playgrounds : 578
MarcoPolo Ocean : 529
TestNav : 491
Starfall Learn to Read : 474
Speakaboos Reading App: Stories & Songs for Kids : 440
Bloxels: Build, Play & Share Your Own Video Games : 382
GoNoodle Kids : 372
Global Shark Tracker : 336
The Robot Factory by Tinybop : 335
Daniel Tiger’s Day & Night : 314
Kahoot! Play Fun Learning Games : 300
Spanish SOLO: Learn Spanish With Lessons On The Go : 275
Math 42 : 248
Star Walk 2 Ads+ Night Sky Map - Stars and Planets : 161
Toca Dance Free : 149
Endless Learning Academy : 143
270toWin : 141
Win the White House : 123
Sago Mini Babies Dress Up : 115
Nancy Drew Codes and Clues Mystery Coding Game : 110
1600 : 110
BEAKER by THIX : 94
Highlights™ Shapes - Preschool Learning Puzzles : 90
Little Panda's Candy Shop - Lollipop Factory : 84
Mathpix - Solve and graph math using pictures : 83
Blue Apprentice Elementary Science Game by Galxyz : 79
PINKFONG Birthday Party : 70
Hopster: Kids TV, Nursery Rhymes, Music, Fun Games : 58
Sago Mini Holiday Trucks and Diggers : 56
Dr. Panda Toy Cars Free : 51
Virry Educational. Play, learn with real animals : 50
Highlights Monster Day : 49
PlayKids Learn - Learning through play : 49
PBS KIDS ScratchJr : 38
Lemon Lumberjack's Letter Mill : 34
Ready Jet Go! Space Explorer : 34
Chinese Recipes - Asian cuisine : 32
Nature Cat's Great Outdoors : 31
Show My Homework : 17
PINKFONG 123 Numbers : 17
Aquarium VR : 12
Little Panda Mini Games-3D : 9
Stylish School Timetable : 7
Merry Christmas -Activities : 7
Mastering the piano with Lang Lang : 6
Baby Panda's Carnival : 6
Driving test 2017 : 5
Cutie Patootie - Xmas Surprise : 5
Free IQ Test: Calculate your IQ : 5
Beautiful Japanese Handwriting for iPhone : 0
GhostCallDX : 0
Baby Learns Transportation : 0
Baby Panda's Bath Time : 0
Beautiful Japanese Handwriting : 0
Spring Festival by BabyBus : 0
Dinosaur Planet : 0
None

For example, we can notice that Education category is predominated by learning langues apps or training apps. So we could make some app in the middle of 3 categories : Reference, Education and Book.

Other categories represent less interest for us:

  • Weather apps - people so not spend a lot of time watching weather forecast. So our chances to get a profit from in-app adds is quite low. We should get a reliable weather data which may require us to connect to non-free APIs.
  • Food and drink are dominated by huge businesses (Starbucks, Dunkin' Donuts, McDonald's, etc.). ANd we might need an actual cooking and delivery service.
  • Finance apps - will require us to hire domain expert in order to put in place banking, pay systems, transfers, etc.

In Google Play Market we have the number of installs that we can use in order to find the most popular apps.

However the value of this column are not precise:

In [45]:
print(display_table(android_final, 5)) #installs column
1,000,000+  :  15.75497287522604
100,000+  :  11.539330922242314
10,000,000+  :  10.567359855334539
10,000+  :  10.194394213381555
1,000+  :  8.39737793851718
100+  :  6.928119349005425
5,000,000+  :  6.826401446654612
500,000+  :  5.560578661844485
50,000+  :  4.769439421338156
5,000+  :  4.486889692585895
10+  :  3.5375226039783
500+  :  3.2436708860759493
50,000,000+  :  2.2830018083182644
100,000,000+  :  2.1360759493670884
50+  :  1.9213381555153706
5+  :  0.7911392405063291
1+  :  0.5085895117540687
500,000,000+  :  0.27124773960216997
1,000,000,000+  :  0.22603978300180833
0+  :  0.045207956600361664
0  :  0.011301989150090416
None

Information presented as such doesn't help us to unterstand if the app was downloaded 1,000,000 times or 4,000,000.

However we don't need the meticulous precision. And we'll consider the numbers as they are: 100,000+ will correspond to 100000 downloads.

Thus, we will transform our entries into integers.

In [46]:
freq_category_android = freq_table(android_final, 1)

sort_list = []
for category in freq_category_android:
    total = 0
    len_category = 0
    
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_instals = app[5]
            n_instals = n_instals.replace('+', '')
            n_instals = n_instals.replace(',', '')
            total += float(n_instals)
            len_category += 1
            #print(n_instals)
    average_installs = total / len_category
    install_cat = (average_installs, category)
    sort_list.append(install_cat)
    
    #print(category, ' : ', average_installs)
sort_list = sorted(sort_list, reverse = True)
for el in sort_list:
    print(el[1], ' : ', el[0])
    
COMMUNICATION  :  38590581.08741259
VIDEO_PLAYERS  :  24727872.452830188
SOCIAL  :  23253652.127118643
PHOTOGRAPHY  :  17840110.40229885
PRODUCTIVITY  :  16787331.344927534
GAME  :  15544014.51048951
TRAVEL_AND_LOCAL  :  13984077.710144928
ENTERTAINMENT  :  11640705.88235294
TOOLS  :  10830251.970588235
NEWS_AND_MAGAZINES  :  9549178.467741935
BOOKS_AND_REFERENCE  :  8814199.78835979
SHOPPING  :  7036877.311557789
PERSONALIZATION  :  5201482.6122448975
WEATHER  :  5145550.285714285
HEALTH_AND_FITNESS  :  4188821.9853479853
MAPS_AND_NAVIGATION  :  4049274.6341463416
FAMILY  :  3695641.8198090694
SPORTS  :  3650602.276666667
ART_AND_DESIGN  :  1986335.0877192982
FOOD_AND_DRINK  :  1924897.7363636363
EDUCATION  :  1833495.145631068
BUSINESS  :  1712290.1474201474
LIFESTYLE  :  1446158.2238372094
FINANCE  :  1387692.475609756
HOUSE_AND_HOME  :  1360598.042253521
DATING  :  854028.8303030303
COMICS  :  832613.8888888889
AUTO_AND_VEHICLES  :  647317.8170731707
LIBRARIES_AND_DEMO  :  638503.734939759
PARENTING  :  542603.6206896552
BEAUTY  :  513151.88679245283
EVENTS  :  253542.22222222222
MEDICAL  :  120550.61980830671

As we can notice the most installed apps belong to COMMUNICATION category. This specific category is dominated by : WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts with billion installs. And other apps have over 100 and 500 million installs.

In [47]:
def highest_android_installs(dataset, category, ind_instal):
    
    for app in dataset:
        if app[1] == category and (app[ind_instal] == '1,000,000,000+' or app[ind_instal] == '500,000,000+' or app[ind_instal] == '100,000,000+'):
            print(app[0], ' : ', app[ind_instal])
In [48]:
print(highest_android_installs(android_final, 'COMMUNICATION', 5))
WhatsApp Messenger  :  1,000,000,000+
imo beta free calls and text  :  100,000,000+
Android Messages  :  100,000,000+
Google Duo - High Quality Video Calls  :  500,000,000+
Messenger – Text and Video Chat for Free  :  1,000,000,000+
imo free video calls and chat  :  500,000,000+
Skype - free IM & video calls  :  1,000,000,000+
Who  :  100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji  :  100,000,000+
LINE: Free Calls & Messages  :  500,000,000+
Google Chrome: Fast & Secure  :  1,000,000,000+
Firefox Browser fast & private  :  100,000,000+
UC Browser - Fast Download Private & Secure  :  500,000,000+
Gmail  :  1,000,000,000+
Hangouts  :  1,000,000,000+
Messenger Lite: Free Calls & Messages  :  100,000,000+
Kik  :  100,000,000+
KakaoTalk: Free Calls & Text  :  100,000,000+
Opera Mini - fast web browser  :  100,000,000+
Opera Browser: Fast and Secure  :  100,000,000+
Telegram  :  100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer  :  100,000,000+
UC Browser Mini -Tiny Fast Private & Secure  :  100,000,000+
Viber Messenger  :  500,000,000+
WeChat  :  100,000,000+
Yahoo Mail – Stay Organized  :  100,000,000+
BBM - Free Calls & Messages  :  100,000,000+
None

If we delete the giant from this category we'll notice how the avarage will decrease.

In [49]:
under_100_m = []

for app in android_final:
    category = 'COMMUNICATION'
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = n_installs.replace(',', '')
    if category == app[1] and float(n_installs) < 100000000:
        under_100_m.append(float(n_installs))
        
print(sum(under_100_m) / len(under_100_m))
3617398.420849421
In [50]:
#difference between average in Communication category with giant apps and without
print(sort_list[0][0] - (sum(under_100_m) / len(under_100_m)))
34973182.66656317

The same pattern can be observed for VIDEO_PLAYERS category.This category is largely dominated by 9 apps.

In [51]:
print(highest_android_installs(android_final, 'VIDEO_PLAYERS', 5))
YouTube  :  1,000,000,000+
Motorola Gallery  :  100,000,000+
VLC for Android  :  100,000,000+
Google Play Movies & TV  :  1,000,000,000+
MX Player  :  500,000,000+
Dubsmash  :  100,000,000+
VivaVideo - Video Editor & Photo Movie  :  100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera  :  100,000,000+
Motorola FM Radio  :  100,000,000+
None

That's why these apps may creat the impression that these categories are extremely popular and deserve our attention.

Category 'BOOKS_AND_REFERENCE' seems to be popular as well and not overdominated by several apps.

In [52]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])
E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
English translation from Bengali : 100,000+
Pdf Book Download - Read Pdf Book : 100,000+
Free Book Reader : 100,000+
eBoox new: Reader for fb2 epub zip books : 50,000+
Only 30 days in English, the guideline is guaranteed : 500,000+
Moon+ Reader : 10,000,000+
SH-02J Owner's Manual (Android 8.0) : 50,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Azpen eReader : 500,000+
URBANO V 02 instruction manual : 100,000+
Bible : 100,000,000+
C Programs and Reference : 50,000+
C Offline Tutorial : 1,000+
C Programs Handbook : 50,000+
Amazon Kindle : 100,000,000+
Aab e Hayat Full Novel : 100,000+
Aldiko Book Reader : 10,000,000+
Google I/O 2018 : 500,000+
R Language Reference Guide : 10,000+
Learn R Programming Full : 5,000+
R Programing Offline Tutorial : 1,000+
Guide for R Programming : 5+
Learn R Programming : 10+
R Quick Reference Big Data : 1,000+
V Made : 100,000+
Wattpad 📖 Free Books : 100,000,000+
Dictionary - WordWeb : 5,000,000+
Guide (for X-MEN) : 100,000+
AC Air condition Troubleshoot,Repair,Maintenance : 5,000+
AE Bulletins : 1,000+
Ae Allah na Dai (Rasa) : 10,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Ag PhD Field Guide : 10,000+
Ag PhD Deficiencies : 10,000+
Ag PhD Planting Population Calculator : 1,000+
Ag PhD Soybean Diseases : 1,000+
Fertilizer Removal By Crop : 50,000+
A-J Media Vault : 50+
Al-Quran (Free) : 10,000,000+
Al Quran (Tafsir & by Word) : 500,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al-Muhaffiz : 50,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Al-Quran 30 Juz free copies : 500,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
Hafizi Quran 15 lines per page : 1,000,000+
Quran for Android : 10,000,000+
Surah Al-Waqiah : 100,000+
Hisnul Al Muslim - Hisn Invocations & Adhkaar : 100,000+
Satellite AR : 1,000,000+
Audiobooks from Audible : 100,000,000+
Kinot & Eichah for Tisha B'Av : 10,000+
AW Tozer Devotionals - Daily : 5,000+
Tozer Devotional -Series 1 : 1,000+
The Pursuit of God : 1,000+
AY Sing : 5,000+
Ay Hasnain k Nana Milad Naat : 10,000+
Ay Mohabbat Teri Khatir Novel : 10,000+
Arizona Statutes, ARS (AZ Law) : 1,000+
Oxford A-Z of English Usage : 1,000,000+
BD Fishpedia : 1,000+
BD All Sim Offer : 10,000+
Youboox - Livres, BD et magazines : 500,000+
B&H Kids AR : 10,000+
B y H Niños ES : 5,000+
Dictionary.com: Find Definitions for English Words : 10,000,000+
English Dictionary - Offline : 10,000,000+
Bible KJV : 5,000,000+
Borneo Bible, BM Bible : 10,000+
MOD Black for BM : 100+
BM Box : 1,000+
Anime Mod for BM : 100+
NOOK: Read eBooks & Magazines : 10,000,000+
NOOK Audiobooks : 500,000+
NOOK App for NOOK Devices : 500,000+
Browsery by Barnes & Noble : 5,000+
bp e-store : 1,000+
Brilliant Quotes: Life, Love, Family & Motivation : 1,000,000+
BR Ambedkar Biography & Quotes : 10,000+
BU Alsace : 100+
Catholic La Bu Zo Kam : 500+
Khrifa Hla Bu (Solfa) : 10+
Kristian Hla Bu : 10,000+
SA HLA BU : 1,000+
Learn SAP BW : 500+
Learn SAP BW on HANA : 500+
CA Laws 2018 (California Laws and Codes) : 5,000+
Bootable Methods(USB-CD-DVD) : 10,000+
cloudLibrary : 100,000+
SDA Collegiate Quarterly : 500+
Sabbath School : 100,000+
Cypress College Library : 100+
Stats Royale for Clash Royale : 1,000,000+
GATE 21 years CS Papers(2011-2018 Solved) : 50+
Learn CT Scan Of Head : 5,000+
Easy Cv maker 2018 : 10,000+
How to Write CV : 100,000+
CW Nuclear : 1,000+
CY Spray nozzle : 10+
BibleRead En Cy Zh Yue : 5+
CZ-Help : 5+
Guide for DB Xenoverse : 10,000+
Guide for DB Xenoverse 2 : 10,000+
Guide for IMS DB : 10+
DC HSEMA : 5,000+
DC Public Library : 1,000+
Painting Lulu DC Super Friends : 1,000+
Dictionary : 10,000,000+
Fix Error Google Playstore : 1,000+
D. H. Lawrence Poems FREE : 1,000+
Bilingual Dictionary Audio App : 5,000+
DM Screen : 10,000+
wikiHow: how to do anything : 1,000,000+
Dr. Doug's Tips : 1,000+
Bible du Semeur-BDS (French) : 50,000+
La citadelle du musulman : 50,000+
DV 2019 Entry Guide : 10,000+
DV 2019 - EDV Photo & Form : 50,000+
DV 2018 Winners Guide : 1,000+
EB Annual Meetings : 1,000+
EC - AP & Telangana : 5,000+
TN Patta Citta & EC : 10,000+
AP Stamps and Registration : 10,000+
CompactiMa EC pH Calibration : 100+
EGW Writings 2 : 100,000+
EGW Writings : 1,000,000+
Bible with EGW Comments : 100,000+
My Little Pony AR Guide : 1,000,000+
SDA Sabbath School Quarterly : 500,000+
Duaa Ek Ibaadat : 5,000+
Spanish English Translator : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
JW Library : 10,000,000+
Oxford Dictionary of English : Free : 10,000,000+
English Hindi Dictionary : 10,000,000+
English to Hindi Dictionary : 5,000,000+
EP Research Service : 1,000+
Hymnes et Louanges : 100,000+
EU Charter : 1,000+
EU Data Protection : 1,000+
EU IP Codes : 100+
EW PDF : 5+
BakaReader EX : 100,000+
EZ Quran : 50,000+
FA Part 1 & 2 Past Papers Solved Free – Offline : 5,000+
La Fe de Jesus : 1,000+
La Fe de Jesús : 500+
Le Fe de Jesus : 500+
Florida - Pocket Brainbook : 1,000+
Florida Statutes (FL Code) : 1,000+
English To Shona Dictionary : 10,000+
Greek Bible FP (Audio) : 1,000+
Golden Dictionary (FR-AR) : 500,000+
Fanfic-FR : 5,000+
Bulgarian French Dictionary Fr : 10,000+
Chemin (fr) : 1,000+
The SCP Foundation DB fr nn5n : 1,000+
In [53]:
print(highest_android_installs(android_final, 'BOOKS_AND_REFERENCE', 5))
Google Play Books  :  1,000,000,000+
Bible  :  100,000,000+
Amazon Kindle  :  100,000,000+
Wattpad 📖 Free Books  :  100,000,000+
Audiobooks from Audible  :  100,000,000+
None

This niche seems to be dominated only by specific software allowinf to proccess the books and libraries. So this category may be very interesting for us.

Creating a similar app will be risky, but we might create an app based on a specific book, include different fun features.

Conclusion

In this project we analyzed data about App Store and Google Play mobile apps in order to advise our developpers team on the future app.

We concluded that the best idea would be taking a popular modern book and create an app with fun feautures. This idea seems to be profitable for both markets.