. The Goal of this project is to analyze dat that will help Developers understand the types of apps that are likely to attract more Users.
###The apple dataset###
from csv import reader
open_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(open_file)
ios_app = list(read_file)
ios_app_header = ios_app[0]
ios_app = ios_app[1:]
###The Google dataset###
from csv import reader
open_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(open_file)
android_app = list(read_file)
android_app_header = android_app[0]
android_app = android_app[1:]
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
print(ios_app_header)
print('\n')
explore_data(ios_app, 0, 3, True)
Our Android dataset has 7197 rows and 16 columns. Columns: Price, rating_count_tot, User_rating, prime_genre will be useful for our analysis. For our Reaaders, you can get the full documentation here
print(android_app_header)
print('\n')
explore_data(android_app, 0,3,True)
This Dataset 'ios_app' has 10841 rows and 13 colums. for our analysis, these Columns: Price, rating_count_tot, User_rating, prime_genre will be useful. For our Readers, you can get the full documentation here
for row in android_app:
if len(row) != len(android_app_header):
print(row)
print(android_app.index(row))
del android_app[10472]
for app in android_app:
name = app[0]
if name =="Facebook":
print(app)
We can see from our Output that the Facebook app has two rows, which is a duplicate entry. For accurate analysis, we will have to filter out all the duplicate app.
duplicate_apps = []
unique_apps = []
for app in android_app:
name = app[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
print("Number of duplicate apps: ", len(duplicate_apps))
print('\n')
print("Examples of duplicate apps: ", duplicate_apps[:10])
From our Output, we can see that we have 1181 duplicate apps, we will not remove the duplicate apps randomly. It will be based on some criteria like the User rating and Reviews.
print("Expected length: ", len(android_app) - 1181)
reviews_max = {}
for app in android_app:
name = app[0]
n_reviews = float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
print("Actual length: ", len(reviews_max))
android_clean = []
already_added = []
for app in android_app:
name = app[0]
n_reviews = float(app[3])
if (n_reviews == reviews_max[name]) and (name not in already_added):
android_clean.append(app)
already_added.append(name)
print("The clean data Length : ", len(android_clean))
In the first step where I created an empty dictionary with reviews+max, I assigned a key-value pair, to find out the unique app with the highest number of reviews. I mentioned above that user review is one of the criteria for our data cleaning, i removed duplicate apps by picking only one unique app with the highest reviews. In the second step i used the dictionary i created in step one to remove duplicate rows. I used the android_clean List to store our cleaned data set and i used already_added list to keep track of apps that we have added in order to avoid duplicate apps.
def is_english(string):
for s in string:
if ord(s) > 127:
return False
return True
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
def is_english(string):
non_Ascii = 0
for s in string:
if ord(s) > 127:
non_Ascii += 1
if non_Ascii > 3:
return False
else:
return True
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
android_english = []
ios_english = []
for app in android_clean:
name = app[0]
if is_english(name):
android_english.append(app)
for app in ios_app:
name = app[1]
if is_english(name):
ios_english.append(app)
print("Remaing Android app rows: ", len(android_english))
print("Remaing Ios app rows: ", len(ios_english))
ios_final = []
android_final = []
for app in android_english:
app_price = app[7]
if app_price == '0':
android_final.append(app)
for app in ios_english:
app_price = app[4]
if app_price =='0.0':
ios_final.append(app)
print("length of android final is: ", len(android_final))
print("length of ios_final is : ", len(ios_final))
Our aim is to determine the kind of apps that are likely to attract more Users because our revenue is highly influenced by this and in order to achieve this we developed a validation strateg ythat will minimize overhead and risks. Our validation strategy comprises of these three steps:
def freq_table(dataset, index):
frequency_table = {}
total = 0
for row in dataset:
total += 1
value = row[index]
if value in frequency_table:
frequency_table[value] += 1
else:
frequency_table[value] = 1
frequency_table_percentage = {}
for t in frequency_table:
percentage =(frequency_table[t] / total) * 100
frequency_table_percentage[t] = percentage
return frequency_table_percentage
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
print(display_table(ios_final, 11))
From the result of our analysis, we can see that Apps for entertainment especially Games is the most commmon in ios followed by Entertainment and Photo and video. The difference between the frequency of Education genre and Social Networking is not that much. The General impression is that most of the apps are designed for Entertainment. For Applestore i would reccommend more of entertainment but we can not just decide based on the result we have seen so far, the larger number of apps of that genre does not imply a large number of Users.
print(display_table(android_final, -4))
print(display_table(android_final, 1))
The output from our Android Category and Genres show that the most common genres are most of the apps designed for practical purposes like: shopping, education, utilities,etc. Although the games genre is also a bit high. I would suggest more of Entertainment apps for both Googleplay and Appstore although it still does not determine the number of users.
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.
genre_ios = freq_table(ios_final, 11)
for genre in genre_ios:
total = 0
len_genre = 0
for app in ios_final:
genre_app = app[11]
if genre_app == genre:
n_ratings = float(app[5])
total += n_ratings
len_genre += 1
avg_n_ratings = total / len_genre
print(genre, ':', avg_n_ratings)
From our output we can see that the Navigation Apps have the highest number number of Users rating, although people do not normally spend much time on these Navigation apps. Social Networking, Finance also have high User ratings and these are the apps that most people spend most time on and are been used everyday.
display_table(android_final, 5)
The purpose of the code above is to determine the genre of app we can reccommend for our Googleplaystore but from the output above, the result is a bit ambiguous, we do not know maybe 10,000+ means 10,000 installations or more. For our computation we convert the value to float, use string.replace() method to remove the , and + and also compute the number for installations for each genre(category). this is done in the code below:
android_category = freq_table(android_final, 1)
for category in android_category:
total = 0
len_category = 0
for app in android_final:
category_app = app[1]
if category_app == category:
n_installs = app[5]
n_installs = n_installs.replace("+", "")
n_installs = n_installs.replace(",", "")
total += float(n_installs)
len_category += 1
avg_n_installs = total / len_category
print(category, ':', avg_n_installs)
From our Output we can see that the Social , Game, Finance , Entertainment Genres shows the potential of been profitable both on the Applestore and Googleplaystore. They are highly reccommendable.