In this project we are helping our employer who is an app developer to analyze data on free application in the android and apple store that users are attracted to.
*Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store.*
reader()
function fro the csv module.open()
function to open the file, enter the name of the file into the parenthesis of the function as a string and store into a variable "and" for android and "ios" for apple.reader()
function and store into a variable using the "and" for android and the "ios" for apple.list()
function parenthsis to store the read file into a list after which you store the list into a variable using the "and" for android and the "ios" for apple.from csv import reader
and_file = open('googleplaystore.csv', encoding="utf8")
ios_file = open('AppleStore.csv', encoding="utf8")
r_ios_file = reader(ios_file)
r_and_file = reader(and_file)
apple = list(r_ios_file)
android = list(r_and_file)
*Creating a function called explore_data()
that would be used to go through each datasets.*
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
print('\n')
*PRINTING THE FIRST TWO ROWS USING THE explore_data()
FUNCTION.*
and_apps_data = explore_data(dataset= android, start=1, end=3, rows_and_columns=True)
ios_apps_data = explore_data(dataset= apple, start=1, end=3, rows_and_columns=True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] Number of rows: 10842 Number of columns: 13 ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] Number of rows: 7198 Number of columns: 16
*PRINTING THE COLUMN NAMES FOR BOTH DATASETS*
apple_column_names = apple[0]
android_column_names = android[0]
print(apple_column_names)
print("\n")
print(android_column_names)
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
We do data cleaning before the analysis; it includes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis.
iterate through the google playstore list with a for loop to check for missing entries or wrong data entry in each row by saying if the length
len()
of the row being iterated is not equal!=
to the lengthlen()
of the first row which is the title row it should printprint()
the index (that is thename_of_list.index(row)
) of the row in the list for easy location and to know how many they are.
After figuring out the row with the missing or wrong entry you can either adjust or delete the row with the
del
statement along with the list position by usinglist_name[index]
list index method, so to saydel list_name[index]
.
for row in android:
if len(row) != len(android[0]):
print(android.index(row))
print(len(row))
print("\n")
# The above code was used to find missing data entry in any of the rows in the data
print(len(android[0])) # to know the length each row is supposed to be
print("\n")
print(android[10473]) # to scan through the row of the data to verify
#And now to delete the code with the del statement
10473 12 13 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Deleting the row with wrong entry
print(len(android)) #initial length of list
del android[10473]
print(len(android)) #final length of list of deleting the list with wrong data entry
10842 10841
As part of data cleaning we are going to check for applications that turn up more than once in each of our dataset and remove the duplicates leaving only one
Here we start by creating an empty list for both the duplicated apps
duplicate = []
and unique appsunique = []
after which we iterate through our dataset and store the data in the unique list once the name of the application is not found in the unique list and store in the dupplicate list once the name is found in the unique list using theif
andelse
conditional statement along with thein
operator to check for membership in our conditional statements.
After which we select a criterion for removing or deleting the duplicates, for this analysis we would be using the key difference in any of the rows to remove the duplicates, having this in mind we use the most updated data info to screen out the outdated data and delete the duplicates. By doing this we are ensuring we are carrying out our analysis with the most recent and valid data about each applications found to have a duplicate.
unique = []
duplicate = []
for column in android[1:]:
name = column[0]
if name in unique:
duplicate.append(name)
else:
unique.append(name)
# print(unique)
# print("\n")
# print(duplicate)
# print("\n")
print(len(duplicate))
1181
Creating a dictionary to store the apps name as the key and the reviews as the value first step of removing the duplicates.
reviews_max = {}
for column in android[1:]:
name = column[0]
n_reviews = float(column[3])
if name in reviews_max and reviews_max[name] > n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name]= n_reviews
print(len(reviews_max))
9659
Using the dictionary created above to to remove duplicate rows.
Here we make use of the dictionary already created where the duplicates has been removed the
reviews_max
dictionary We create two empty lists one called theandroid_clean
and the other called thealready_added
Then we iterate through the android dataset and store the name column and the review column for each rows into variables
name
andn_reviews
After which we compare the review value in our
reviews_max
dictionary and the n_reviews with an if statement to see if there are any which are equal in value to each other and store the name into ouralready_added
list if they arent akready as well as store in a list the rows which have equal value in reviews both in the dictionaryreviews_max
and the variable n_variable
android_clean = []
already_added = []
for row in android[1:]:
name = row[0]
n_reviews = float(row[3])
if n_reviews == reviews_max[name] and name not in already_added:
android_clean.append(row)
already_added.append(name)
Exploring the data in the android_clean list with the explore()
function.
explore_data(dataset= android_clean, start=0, end=3)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
Creating a function that detects whether a character belongs to the set of common English characters or not.
This can be achieved by using the
ord()
function which is used to check the number of the character and also using the ASCII system which states that If the number is equal to or less than 127, then the character belongs to the set of common English characters, and the other way around if graeter than 127.
Write a function with the name
check_if_english()
which will take in a stringa_string
as the value.
Also create a variable called
non_ascii
and assign it a value of zero
Iterate over the string and find the corresponding value using the
ord()
function and storing inside a variablevalue
. using the conditional statement increment the value ofnon_ascii
variable by one if the value is greater than 127
Then outside the for loop use conditional statement to check if the the value of
non_ascii
is greater than 3 then return False else return True
The above steps where taken to filter strings that contained characters with values above 127 but also contain majorly english characters.
def check_if_english(a_string):
non_ascii = 0
for character in a_string:
value = ord(character)
if value > 127:
non_ascii += 1
if non_ascii > 3:
return False
else:
return True
Testing our function on some strings
print(check_if_english(a_string= "instagram"))
print(check_if_english(a_string= '爱奇艺PPS -《欢乐颂2》电视剧热播'
))
print(check_if_english(a_string= 'Docs To Go™ Free Office Suite'))
print(check_if_english(a_string= 'Instachat 😜'))
True False True True
Exploring the android_clean dataset with the new function
android_eng_list= []
apple_eng_list= []
#android
for row in android_clean:
name = row[0]
if check_if_english(a_string=name) == True:
android_eng_list.append(row)
#apple
for row in apple[1:]:
name = row[1]
if check_if_english(a_string=name) == True:
apple_eng_list.append(row)
print(android_eng_list[:3])
print("\n")
print(apple_eng_list[:3])
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']] [['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]
clean_free_android = []
clean_free_apple = []
#android
for row in android_eng_list:
price = row[7]
if price == "0":
clean_free_android.append(row)
#apple
for row in apple_eng_list:
price = row[4]
if price == "0.0":
clean_free_apple.append(row)
print(len(clean_free_android))
print(len(clean_free_apple))
8862 3222
our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue. our validation strategy for an app idea has three steps:
We start by building a Test version and add it to the google palystore to see how people download the beta.
When we recieve good response in our reviews and ratings, we develop futher by adding upgrades with more features based on analysis done.
After being profitable with the application on playstore we also make it available in other application markets starting with the Applestore.
our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets.
Creating a frequency table to determine the most common genre of apps from our datasets
def freq_table(dataset, index):
table = {}
total = 0
for column in dataset:
genre = column[index]
total +=1
if genre in table:
table[genre] += 1
elif genre not in table:
table[genre] = 1
table_percent = {}
for key in table:
value = table[key]
percentage = (value/total)*100
table_percent[key]= percentage
return table_percent
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
display_table(dataset=clean_free_android, index=1)
print("\n")
display_table(dataset=clean_free_apple, index=11)
FAMILY : 18.788083953960733 GAME : 9.636650868878357 TOOLS : 8.440532611148726 BUSINESS : 4.581358609794629 LIFESTYLE : 3.9043105393816293 PRODUCTIVITY : 3.8930264048747465 FINANCE : 3.7011961182577298 MEDICAL : 3.5319341006544795 SPORTS : 3.419092755585647 PERSONALIZATION : 3.3175355450236967 COMMUNICATION : 3.2498307379823967 HEALTH_AND_FITNESS : 3.0692845858722633 PHOTOGRAPHY : 2.945159106296547 NEWS_AND_MAGAZINES : 2.798465357707064 SOCIAL : 2.663055743624464 TRAVEL_AND_LOCAL : 2.335815842924848 SHOPPING : 2.2455427668697814 BOOKS_AND_REFERENCE : 2.143985556307831 DATING : 1.8618821936357481 VIDEO_PLAYERS : 1.782893252087565 MAPS_AND_NAVIGATION : 1.399232678853532 EDUCATION : 1.2525389302640486 FOOD_AND_DRINK : 1.2412547957571656 ENTERTAINMENT : 1.0381403746332656 LIBRARIES_AND_DEMO : 0.9365831640713158 AUTO_AND_VEHICLES : 0.9252990295644324 HOUSE_AND_HOME : 0.8350259535093659 WEATHER : 0.8011735499887158 EVENTS : 0.7109004739336493 ART_AND_DESIGN : 0.6770480704129994 PARENTING : 0.6544798013992327 COMICS : 0.6206273978785828 BEAUTY : 0.598059128864816 Games : 58.16263190564867 Entertainment : 7.883302296710118 Photo & Video : 4.9658597144630665 Education : 3.662321539416512 Social Networking : 3.2898820608317814 Shopping : 2.60707635009311 Utilities : 2.5139664804469275 Sports : 2.1415270018621975 Music : 2.0484171322160147 Health & Fitness : 2.0173805090006205 Productivity : 1.7380509000620732 Lifestyle : 1.5828677839851024 News : 1.3345747982619491 Travel : 1.2414649286157666 Finance : 1.1173184357541899 Weather : 0.8690254500310366 Food & Drink : 0.8069522036002483 Reference : 0.5586592178770949 Business : 0.5276225946617008 Book : 0.4345127250155183 Navigation : 0.186219739292365 Medical : 0.186219739292365 Catalogs : 0.12414649286157665
ANALYSIS FROM THE APPLESTORE FREQUENCY TABLE OBTAINED FROM THE Prime_genre
COLUMN
The most common genre is that of Games, followed by the Entertainment genre.
The least common are that of the Catalogs and Medical genre.
Most of the apps are created for entertainment in the sense that the percentage difference between that of entertainment (games, photo and video, social networking, sports, music) is far more than the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) .
Would recommend an app profile under the Travel genre based on the data from the frequency tables. Large number of apps in that genre shows that the competition for good gaming apps is high.
ANALYSIS FROM THE GOOGLE PLAYSTORE FREQUENCY TABLE OBTAINED FROM THE category
COLUMN
The most common genre here is that of Family
The least common are the Comics and Beauty genre, while the second most common is the Games genre.
From comparism i see similarities between the Google playstore and the Applestore in the sense that most apps are also created for entertainment and unlike the applestore where the games genre was number one in the google playstore the games genre is number two.
Would recommend an app profile in Under the Finance genre. The frquency table generated revealed the genre with the highest occurance.
Analysis on the rating_count_tot
column of the applestore dataset
we are using the rating_count_tot
because unlike the playstore the applestore has no column stating the number of installs. so we make use of a nested for loop along with a conditional statement to get the avg_user_rating
from the frequency table of dataset stored in genre_ios
obtained from using our freq_table()
function to store in the variable.
After getting our frquency table which is in a list format stored in the genre_ios
variable we iterate over it with a for loop.
Inside the for loop we create two variables total
which will store the total user ratings by adding all the user ratings if the genre we iterate over in the apple dataset is found in our frequency table genre_ios
and the len_genre
variable which tells us the number of times apps occur in that genre.
After creating the variables and assigning zero to them we iterate over the applestore dataset to find the genre_app
that we want to use in our conditional statement and also the user_rating
from the rating_count_tot
column of our applestore dataset.
We then proceed to applying our conditional statements under the nested for loop if genre_app
is found in genre
of the genre_ios
we add the user_ratings
obtained to the total
and increase the len_genre
by 1.
Then calculate the avg_user_rating
by dividing total
by the len_genre
.
genre_ios= freq_table(dataset= clean_free_apple, index= 11)
for genre in genre_ios:
total = 0
len_genre= 0
for column in clean_free_apple:
genre_app = column[11]
if genre_app == genre:
user_rating = float(column[5])
total += user_rating
len_genre += 1
avg_user_rating = total/len_genre
print(genre, ":", avg_user_rating)
Social Networking : 71548.34905660378 Photo & Video : 28441.54375 Games : 22788.6696905016 Music : 57326.530303030304 Reference : 74942.11111111111 Health & Fitness : 23298.015384615384 Weather : 52279.892857142855 Utilities : 18684.456790123455 Travel : 28243.8 Shopping : 26919.690476190477 News : 21248.023255813954 Navigation : 86090.33333333333 Lifestyle : 16485.764705882353 Entertainment : 14029.830708661417 Food & Drink : 33333.92307692308 Sports : 23008.898550724636 Book : 39758.5 Finance : 31467.944444444445 Education : 7003.983050847458 Productivity : 21028.410714285714 Business : 7491.117647058823 Catalogs : 4004.0 Medical : 612.0
Would recommend an app profile un the Navigations genre
printing out some applications under the "navigation" genre.
for column in clean_free_apple:
if column[11] == 'Navigation':
print(column[1], ':', column[5])
Waze - GPS Navigation, Maps & Real-time Traffic : 345046 Google Maps - Navigation & Transit : 154911 Geocaching® : 12811 CoPilot GPS – Car Navigation & Offline Maps : 3582 ImmobilienScout24: Real Estate Search in Germany : 187 Railway Route Search : 5
Analysis on the google playstore installs
column with similar process as stated for the applestore dataset
only that this time we are going to be making use of the str.replace(old=, new=)
function to convert the values we have in under the installs
column to a float but first we have to remove some strings that are non numbers like the '+' symbol and the ','.
android_genre = freq_table(clean_free_android, 1)
for category in android_genre:
total = 0
len_category = 0
for column in clean_free_android:
category_app = column[1]
if category_app == category:
installs = column[5]
installs = installs.replace('+', '')
installs = installs.replace(',', '')
total += float(installs)
len_category += 1
avg_installs = total/len_category
print(category, ':', avg_installs)
ART_AND_DESIGN : 1905351.6666666667 AUTO_AND_VEHICLES : 647317.8170731707 BEAUTY : 513151.88679245283 BOOKS_AND_REFERENCE : 8767811.894736841 BUSINESS : 1704192.3399014778 COMICS : 817657.2727272727 COMMUNICATION : 38326063.197916664 DATING : 854028.8303030303 EDUCATION : 3057207.207207207 ENTERTAINMENT : 19428913.04347826 EVENTS : 253542.22222222222 FINANCE : 1387692.475609756 FOOD_AND_DRINK : 1924897.7363636363 HEALTH_AND_FITNESS : 4167457.3602941176 HOUSE_AND_HOME : 1313681.9054054054 LIBRARIES_AND_DEMO : 638503.734939759 LIFESTYLE : 1437816.2687861272 GAME : 13006872.892271662 FAMILY : 4371709.123123123 MEDICAL : 107167.23322683707 SOCIAL : 23253652.127118643 SHOPPING : 7036877.311557789 PHOTOGRAPHY : 17805627.643678162 SPORTS : 4274688.722772277 TRAVEL_AND_LOCAL : 13984077.710144928 TOOLS : 10695245.286096256 PERSONALIZATION : 5201482.6122448975 PRODUCTIVITY : 16772838.591304347 PARENTING : 542603.6206896552 WEATHER : 5074486.197183099 VIDEO_PLAYERS : 24790074.17721519 NEWS_AND_MAGAZINES : 9549178.467741935 MAPS_AND_NAVIGATION : 4056941.7741935486
Would recommend an app profile under the communications genre
printing some apps under the BOOKS_AND_REFERENCE genre
for column in clean_free_android:
if column[1] == 'BOOKS_AND_REFERENCE' and (column[5] == '5,000,000+'):
print(column[0], ':', column[5])
AlReader -any text book reader : 5,000,000+ Ebook Reader : 5,000,000+ Read books online : 5,000,000+ Ancestry : 5,000,000+ Dictionary - WordWeb : 5,000,000+ 50000 Free eBooks & Free AudioBooks : 5,000,000+ Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+ Bible KJV : 5,000,000+ English to Hindi Dictionary : 5,000,000+
From observation in both the Applestore dataset and Google playstore dataset Would recommend an Application profile in the BOOKS_AND_REFERENCE genre, from the data presented we can see the Reference genre also thriving in the Applestore seeing that most smartphone users are usually students and workers, who read and look for reference when presented with new information and would like to expand their knowledge on it.
Fully aware of the competition here we can step into the BOOKS_AND_REFERENCE genre by bringing new features which other applications do not possess, examples of these features are:
- In-built Dictionaries
- An active room for questions and feedbacks on Numerous topics common and uncommon, Linking people with common intreast.
- A time and history tracker.