Audible Book Recommender¶

Finding similar books using simple text countvectorizer model on audible dataset

toc: false
badges: true
comments: true
categories: [Books, CountVectorizer]
image:

In [1]:

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:

audible_data = pd.read_csv("https://github.com/sparsh-ai/reco-data/raw/audible/audible/audible.csv",
                           encoding='latin1')
audible_data.head()

Out[2]:

	Book Title	Book Subtitle	Book Author	Book Narrator	Audio Runtime	Audiobook_Type	Categories	Rating	Total No. of Ratings	Price	Review 1	Review 2	Review 3	Review 4	Review 5	Review 6	Review 7	Review 8	Review 9	Review 10	Review 11	Review 12	Review 13	Review 14	Review 15	Review 16	Review 17	Review 18	Review 19	Review 20	Review 21	Review 22	Review 23	Review 24	Review 25	Review 26	Review 27	Review 28	Review 29	Review 30	...	Review 61	Review 62	Review 63	Review 64	Review 65	Review 66	Review 67	Review 68	Review 69	Review 70	Review 71	Review 72	Review 73	Review 74	Review 75	Review 76	Review 77	Review 78	Review 79	Review 80	Review 81	Review 82	Review 83	Review 84	Review 85	Review 86	Review 87	Review 88	Review 89	Review 90	Review 91	Review 92	Review 93	Review 94	Review 95	Review 96	Review 97	Review 98	Review 99	Review100
0	Bamboozled by Jesus	How God Tricked Me into the Life of My Dreams	Yvonne Orji	Yvonne Orji	6 hrs and 31 mins	Unabridged Audiobook	Biographies & Memoirs	5	47.0	$29.65	Thank you for being obedient and sharing your ...	This book was amazing. What made it amazing wa...	The narration of the book by the author was a ...	I'm sending Yvonne a tilth because this was th...	Yvonne is truly amazing at blending scripture ...	I enjoyed this book immensely. Thank you for m...	This book really blessed my life. I pray that ...	I have enjoyed Yvonnes work on Insecure and he...	to quote my wife "I feel so seen!" Yvonne must...	This content was amazing and being a fan of Yv...	Already surrendered my life to Jesus but this!...	I loved this book. I finished it in 2 days. I ...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Building Bridges	NaN	Marie Dunlop	Diane Books, Natalie Moore Williams, John Scou...	1 hr and 41 mins	Unabridged Audiobook	Literature & Fiction, Genre Fiction	5	1.0	$0.00	Recent old times brought to life	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	King of Scotland	Modern Plays	Iain Heggie	Liam Brennan	52 mins	Unabridged Audiobook	Literature & Fiction, Drama & Plays	Not rated yet	NaN	$0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Mrs G	NaN	Mike Tibbetts	Sarah Rose Graber, Brett Whitted	34 mins	Unabridged Audiobook	Literature & Fiction	5	1.0	$0.00	great story in 30 mins. you wont know who's si...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Signature	NaN	Bob Davidson	Sakshi Sharma, Lucy Goldie	36 mins	Unabridged Audiobook	Mystery, Thriller & Suspense, Mystery	Not rated yet	NaN	$0.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 110 columns

In [3]:

audible_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2275 entries, 0 to 2274
Columns: 110 entries, Book Title to Review100
dtypes: float64(1), object(109)
memory usage: 1.9+ MB

In [4]:

# Selecting 4 columns: Title, Author, Narrator,Categories(Genre)
audible_data = audible_data[['Book Title', 'Book Author', 'Book Narrator', 'Categories']]

# Remove all 'Categories', and 'Book Narrator' NaN records
audible_data = audible_data[audible_data['Categories'].notna()]
audible_data = audible_data[audible_data['Book Narrator'].notna()]

# lower case and split on commas or &-sign 'Categories'
audible_data['Categories'] = audible_data['Categories'].map(
    lambda x: x.lower().replace(' &', ',').replace('genre', '').split(','))
# Book Author
audible_data['Book Author'] = audible_data['Book Author'].map(lambda x: x.lower().replace(' ', '').split(' '))
# Book Narrator
audible_data['Book Narrator'] = audible_data['Book Narrator'].map(lambda x: x.lower().replace(' ', '').split(' '))

for index, row in audible_data.iterrows():
    # row['Book Narrator'] = [x.replace(' ','') for x in row['Book Narrator']]
    row['Book Author'] = ''.join(row['Book Author'])

In [5]:

# make 'Book Title' as an index
audible_data.set_index('Book Title', inplace=True)

audible_data['bag_of_words'] = ''
for index, row in audible_data.iterrows():
    words = ''
    for col in audible_data.columns:
        if col != 'Book Author':
            words = words + ' '.join(row[col]) + ' '
        else:
            words = words + row[col] + ' '
    row['bag_of_words'] = words

audible_data.drop(columns=[x for x in audible_data.columns if x != 'bag_of_words'], inplace=True)

In [8]:

recommendation_movies = []

# Vectorizing the entire matrix as described above!
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(audible_data['bag_of_words'])

# running pairwise cosine similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)  # getting a similarity matrix

In [37]:

def recommend(k=5):
    # gettin the index of the book that matches the title
    indices = pd.Series(audible_data.index)
    idx = indices.sample(1)

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim2[idx.index[0]]).sort_values(ascending=False)

    # getting the indexes of the k most similar audiobooks
    top_k_indexes = list(score_series.iloc[1:k+1].index)

    topk = indices[top_k_indexes].tolist()

    print("For '{}', Top {} similar audiobooks are {}".format(idx.values[0], k, topk))

In [38]:

recommend()

For 'The Hobbit', Top 5 similar audiobooks are ['A Wizard of Earthsea', 'Harold & the Purple Crayon', 'The Green Ember', 'Harry Potter and the Chamber of Secrets, Book 2', 'Harry Potter and the Prisoner of Azkaban, Book 3']

In [39]:

recommend()

For 'How to Win Friends and Influence People in the Digital Age', Top 5 similar audiobooks are ['How to Remember Names and Faces', '10ä¸\x87å\x86\x86ã\x81\x8bã\x82\x89å§\x8bã\x82\x81ã\x82\x8b! å°\x8få\x9e\x8bæ\xa0ªé\x9b\x86ä¸\xadæ\x8a\x95è³\x87ã\x81§1å\x84\x84å\x86\x86 å®\x9fè·µã\x83\x90ã\x82¤ã\x83\x96ã\x83«', '1,001 Ways to Engage Employees', 'Getting to Yes', '#1 Best Seller']

In [40]:

recommend()

For 'The Power Of: M.I.N.D', Top 5 similar audiobooks are ['The Power Of: M.I.N.D', 'Breaking the Habit of Being Yourself', 'Law of Attraction, Get Your Ex Back', '10 Things Every Woman Needs to Know About Men', 'The Battlefield of the Mind']

In [41]:

recommend()

For 'Acts of Omission', Top 5 similar audiobooks are ['21st Birthday', 'Black Ice', 'Sycamore Row', 'A Lady Compromised', 'Deadly Cross']

In [42]:

recommend(10)

For 'Broken (in the Best Possible Way)', Top 10 similar audiobooks are ['Andrew Cunanan: Short Spree Killer and Versace Nemesis', 'Steve Jobs', 'Say Nothing', 'Billion Dollar Loser', 'Unbroken', 'Nothing Personal', 'The Immortal Life of Henrietta Lacks', 'The Splendid and the Vile', 'Red Notice', 'We Few']