My First Data Analysis Project

. The Goal of this project is to analyze dat that will help Developers understand the types of apps that are likely to attract more Users.

In [1]:
###The apple dataset###
from csv import reader
open_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(open_file)
ios_app = list(read_file)
ios_app_header = ios_app[0]
ios_app = ios_app[1:]
In [2]:
###The Google dataset###
from csv import reader
open_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(open_file)
android_app = list(read_file)
android_app_header = android_app[0]
android_app = android_app[1:]
In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(ios_app_header)
print('\n')
explore_data(ios_app, 0, 3, True)
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16

Our Android dataset has 7197 rows and 16 columns. Columns: Price, rating_count_tot, User_rating, prime_genre will be useful for our analysis. For our Reaaders, you can get the full documentation here

In [4]:
print(android_app_header)
print('\n')
explore_data(android_app, 0,3,True)
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13

This Dataset 'ios_app' has 10841 rows and 13 colums. for our analysis, these Columns: Price, rating_count_tot, User_rating, prime_genre will be useful. For our Readers, you can get the full documentation here

In [5]:
for row in android_app:
    if len(row) != len(android_app_header):
        print(row)
        print(android_app.index(row))
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472
In [6]:
del android_app[10472]
In [7]:
for app in android_app:
    name = app[0]
    if name =="Facebook":
        print(app)
['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']

We can see from our Output that the Facebook app has two rows, which is a duplicate entry. For accurate analysis, we will have to filter out all the duplicate app.

In [8]:
duplicate_apps = []
unique_apps = []

for app in android_app:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print("Number of duplicate apps: ", len(duplicate_apps))
print('\n')
print("Examples of duplicate apps: ", duplicate_apps[:10])
Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']

From our Output, we can see that we have 1181 duplicate apps, we will not remove the duplicate apps randomly. It will be based on some criteria like the User rating and Reviews.

In [9]:
print("Expected length: ", len(android_app) - 1181)
Expected length:  9659
In [10]:
reviews_max = {}

for app in android_app:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print("Actual length: ", len(reviews_max))
Actual length:  9659
In [11]:
android_clean = []
already_added = []

for app in android_app:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
In [12]:
print("The clean data Length : ", len(android_clean))
The clean data Length :  9659

In the first step where I created an empty dictionary with reviews+max, I assigned a key-value pair, to find out the unique app with the highest number of reviews. I mentioned above that user review is one of the criteria for our data cleaning, i removed duplicate apps by picking only one unique app with the highest reviews. In the second step i used the dictionary i created in step one to remove duplicate rows. I used the android_clean List to store our cleaned data set and i used already_added list to keep track of apps that we have added in order to avoid duplicate apps.

In [13]:
def is_english(string):
    for s in string:
        if ord(s) > 127:
            return False
    return True
In [14]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
True
False
False
False
In [15]:
def is_english(string):
    non_Ascii = 0
    
    for s in string:
        if ord(s) > 127:
            non_Ascii += 1
            
    if non_Ascii > 3:
        return False
    else:
        return True
    

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
    
True
True
False
In [16]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios_app:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
        
print("Remaing Android app rows: ", len(android_english))
print("Remaing Ios app rows: ", len(ios_english))

        
    
Remaing Android app rows:  9614
Remaing Ios app rows:  6183
In [17]:
ios_final = []
android_final = []

for app in android_english:
    app_price = app[7]
    if app_price == '0':
        android_final.append(app)
        
for app in ios_english:
    app_price = app[4]
    if app_price =='0.0':
        ios_final.append(app)
        
print("length of android final is: ", len(android_final))
print("length of ios_final is : ", len(ios_final))
length of android final is:  8864
length of ios_final is :  3222

Our aim is to determine the kind of apps that are likely to attract more Users because our revenue is highly influenced by this and in order to achieve this we developed a validation strateg ythat will minimize overhead and risks. Our validation strategy comprises of these three steps:

  1. Build a minimal Android version of the app, and add it to Google Play.
  2. If the app has a good response from users, we develop it further.
  3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store
In [18]:
def freq_table(dataset, index):
    frequency_table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1
            
    frequency_table_percentage = {}
    for t in frequency_table:
        percentage =(frequency_table[t] / total) * 100
        frequency_table_percentage[t] = percentage
    return frequency_table_percentage


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
In [20]:
print(display_table(ios_final, 11))
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665
None

From the result of our analysis, we can see that Apps for entertainment especially Games is the most commmon in ios followed by Entertainment and Photo and video. The difference between the frequency of Education genre and Social Networking is not that much. The General impression is that most of the apps are designed for Entertainment. For Applestore i would reccommend more of entertainment but we can not just decide based on the result we have seen so far, the larger number of apps of that genre does not imply a large number of Users.

In [21]:
print(display_table(android_final, -4))
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075812
Strategy : 0.9138086642599278
House & Home : 0.8235559566787004
Weather : 0.8009927797833934
Events : 0.7107400722021661
Adventure : 0.6768953068592057
Comics : 0.6092057761732852
Beauty : 0.5979241877256317
Art & Design : 0.5979241877256317
Parenting : 0.4963898916967509
Card : 0.45126353790613716
Casino : 0.42870036101083037
Trivia : 0.41741877256317694
Educational;Education : 0.39485559566787
Board : 0.3835740072202166
Educational : 0.3722924187725632
Education;Education : 0.33844765342960287
Word : 0.2594765342960289
Casual;Pretend Play : 0.236913357400722
Music : 0.2030685920577617
Racing;Action & Adventure : 0.16922382671480143
Puzzle;Brain Games : 0.16922382671480143
Entertainment;Music & Video : 0.16922382671480143
Casual;Brain Games : 0.13537906137184114
Casual;Action & Adventure : 0.13537906137184114
Arcade;Action & Adventure : 0.12409747292418773
Action;Action & Adventure : 0.10153429602888085
Educational;Pretend Play : 0.09025270758122744
Simulation;Action & Adventure : 0.078971119133574
Parenting;Education : 0.078971119133574
Entertainment;Brain Games : 0.078971119133574
Board;Brain Games : 0.078971119133574
Parenting;Music & Video : 0.06768953068592057
Educational;Brain Games : 0.06768953068592057
Casual;Creativity : 0.06768953068592057
Art & Design;Creativity : 0.06768953068592057
Education;Pretend Play : 0.056407942238267145
Role Playing;Pretend Play : 0.04512635379061372
Education;Creativity : 0.04512635379061372
Role Playing;Action & Adventure : 0.033844765342960284
Puzzle;Action & Adventure : 0.033844765342960284
Entertainment;Creativity : 0.033844765342960284
Entertainment;Action & Adventure : 0.033844765342960284
Educational;Creativity : 0.033844765342960284
Educational;Action & Adventure : 0.033844765342960284
Education;Music & Video : 0.033844765342960284
Education;Brain Games : 0.033844765342960284
Education;Action & Adventure : 0.033844765342960284
Adventure;Action & Adventure : 0.033844765342960284
Video Players & Editors;Music & Video : 0.02256317689530686
Sports;Action & Adventure : 0.02256317689530686
Simulation;Pretend Play : 0.02256317689530686
Puzzle;Creativity : 0.02256317689530686
Music;Music & Video : 0.02256317689530686
Entertainment;Pretend Play : 0.02256317689530686
Casual;Education : 0.02256317689530686
Board;Action & Adventure : 0.02256317689530686
Video Players & Editors;Creativity : 0.01128158844765343
Trivia;Education : 0.01128158844765343
Travel & Local;Action & Adventure : 0.01128158844765343
Tools;Education : 0.01128158844765343
Strategy;Education : 0.01128158844765343
Strategy;Creativity : 0.01128158844765343
Strategy;Action & Adventure : 0.01128158844765343
Simulation;Education : 0.01128158844765343
Role Playing;Brain Games : 0.01128158844765343
Racing;Pretend Play : 0.01128158844765343
Puzzle;Education : 0.01128158844765343
Parenting;Brain Games : 0.01128158844765343
Music & Audio;Music & Video : 0.01128158844765343
Lifestyle;Pretend Play : 0.01128158844765343
Lifestyle;Education : 0.01128158844765343
Health & Fitness;Education : 0.01128158844765343
Health & Fitness;Action & Adventure : 0.01128158844765343
Entertainment;Education : 0.01128158844765343
Communication;Creativity : 0.01128158844765343
Comics;Creativity : 0.01128158844765343
Casual;Music & Video : 0.01128158844765343
Card;Action & Adventure : 0.01128158844765343
Books & Reference;Education : 0.01128158844765343
Art & Design;Pretend Play : 0.01128158844765343
Art & Design;Action & Adventure : 0.01128158844765343
Arcade;Pretend Play : 0.01128158844765343
Adventure;Education : 0.01128158844765343
None
In [23]:
print(display_table(android_final, 1))
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0.6430505415162455
COMICS : 0.6204873646209386
BEAUTY : 0.5979241877256317
None

The output from our Android Category and Genres show that the most common genres are most of the apps designed for practical purposes like: shopping, education, utilities,etc. Although the games genre is also a bit high. I would suggest more of Entertainment apps for both Googleplay and Appstore although it still does not determine the number of users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

In [24]:
genre_ios = freq_table(ios_final, 11)

for genre in genre_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[11]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)
Travel : 28243.8
Navigation : 86090.33333333333
Education : 7003.983050847458
Catalogs : 4004.0
Photo & Video : 28441.54375
Games : 22788.6696905016
Finance : 31467.944444444445
Business : 7491.117647058823
Social Networking : 71548.34905660378
Book : 39758.5
Utilities : 18684.456790123455
Productivity : 21028.410714285714
News : 21248.023255813954
Sports : 23008.898550724636
Weather : 52279.892857142855
Shopping : 26919.690476190477
Music : 57326.530303030304
Medical : 612.0
Food & Drink : 33333.92307692308
Entertainment : 14029.830708661417
Lifestyle : 16485.764705882353
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384

From our output we can see that the Navigation Apps have the highest number number of Users rating, although people do not normally spend much time on these Navigation apps. Social Networking, Finance also have high User ratings and these are the apps that most people spend most time on and are been used everyday.

In [25]:
display_table(android_final, 5)
1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343

The purpose of the code above is to determine the genre of app we can reccommend for our Googleplaystore but from the output above, the result is a bit ambiguous, we do not know maybe 10,000+ means 10,000 installations or more. For our computation we convert the value to float, use string.replace() method to remove the , and + and also compute the number for installations for each genre(category). this is done in the code below:

In [29]:
android_category = freq_table(android_final, 1)

for category in android_category:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace("+", "")
            n_installs = n_installs.replace(",", "")
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)
            
LIBRARIES_AND_DEMO : 638503.734939759
TOOLS : 10801391.298666667
GAME : 15588015.603248259
PARENTING : 542603.6206896552
SOCIAL : 23253652.127118643
COMMUNICATION : 38456119.167247385
MEDICAL : 120550.61980830671
TRAVEL_AND_LOCAL : 13984077.710144928
EDUCATION : 1833495.145631068
SPORTS : 3638640.1428571427
MAPS_AND_NAVIGATION : 4056941.7741935486
ART_AND_DESIGN : 1986335.0877192982
PRODUCTIVITY : 16787331.344927534
BEAUTY : 513151.88679245283
FAMILY : 3695641.8198090694
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
COMICS : 817657.2727272727
FOOD_AND_DRINK : 1924897.7363636363
HOUSE_AND_HOME : 1331540.5616438356
BUSINESS : 1712290.1474201474
FINANCE : 1387692.475609756
AUTO_AND_VEHICLES : 647317.8170731707
NEWS_AND_MAGAZINES : 9549178.467741935
PHOTOGRAPHY : 17840110.40229885
EVENTS : 253542.22222222222
HEALTH_AND_FITNESS : 4188821.9853479853
SHOPPING : 7036877.311557789
DATING : 854028.8303030303
PERSONALIZATION : 5201482.6122448975
BOOKS_AND_REFERENCE : 8767811.894736841
ENTERTAINMENT : 11640705.88235294
LIFESTYLE : 1437816.2687861272

From our Output we can see that the Social , Game, Finance , Entertainment Genres shows the potential of been profitable both on the Applestore and Googleplaystore. They are highly reccommendable.