In [ ]:
 

Merging datasets

cross-index all the things

  • toc: true
  • badges: true
  • comments: true
  • categories: [canon]
  • hide: false

There are too many different standard book numbering system. Therefore, we need a new universal standard book numbering system...

But right now what I need is a list of ISBNs for book lists from different eras. I opened all these up on the work computer and then stupidly closed them and went home; should have just done htis here entirely. Basically I want a list for prehistory-1700, 1700-1900, 1900-1950, 1950-1990, 1990-2010 and 2010-2021. I don't know if I'll get equal amounts from each but this distribution matches roughly the goodbooks-10k numbers and my inutuitive sense of what people like to buy. It also plays into the idea of a narrowing of canon over time, as I wrote about in Canon Curation.

I will need the ranked fiction canon, for the book titles, and my large books dataset for getting their ISBN by title. There might be more than one ISBN, but I think I can grab whichever edition has the highest number of reviews and assume that's the one to get (if a buyer wants a less popular edition, they can figure that out). While I'm doing that I might as well grab the same type of data that I did for graphing goodbooks-10k, so I don't have to repeat all these steps later if I want it.

In [1]:
import pandas as pd
In [2]:
large_df = pd.read_csv('../../records/cleaned_goodreads_books.csv')
In [3]:
large_df.tail()
Out[3]:
Unnamed: 0 Unnamed: 0.1 Unnamed: 0.1.1 isbn text_reviews_count series language_code popular_shelves asin average_rating ... publication_month publication_year url image_url book_id ratings_count work_id title top_genre author_name
1215978 1215978 1215978 2360645 0689852959 1.0 [] NaN [{'count': '22', 'name': 'to-read'}, {'count':... NaN 4.36 ... 9.0 2002.0 https://www.goodreads.com/book/show/331839.Jac... https://s.gr-assets.com/assets/nophoto/book/11... 331839 18.0 25313618.0 Jacqueline Kennedy Onassis: Friend of the Arts biography Beatrice Gormley
1215979 1215979 1215979 2360647 0373126476 9.0 [] NaN [{'count': '78', 'name': 'to-read'}, {'count':... NaN 3.42 ... 7.0 2007.0 https://www.goodreads.com/book/show/2685097-th... https://s.gr-assets.com/assets/nophoto/book/11... 2685097 112.0 2710420.0 The Spaniard's Blackmailed Bride harlequin Trish Morey
1215980 1215980 1215980 2360651 178092870X 2.0 [] eng [{'count': '702', 'name': 'to-read'}, {'count'... NaN 3.50 ... 8.0 2015.0 https://www.goodreads.com/book/show/26168430-s... https://images.gr-assets.com/books/1440592011m... 26168430 6.0 46130263.0 Sherlock Holmes and the July Crisis mystery Arthur Conan Doyle
1215981 1215981 1215981 2360652 0765197456 6.0 [] NaN [{'count': '37', 'name': 'to-read'}, {'count':... NaN 4.00 ... 8.0 1996.0 https://www.goodreads.com/book/show/2342551.Th... https://s.gr-assets.com/assets/nophoto/book/11... 2342551 36.0 2349247.0 The Children's Classic Poetry Collection poetry Nicola Baxter
1215982 1215982 1215982 2360653 162378140X 17.0 ['658195'] eng [{'count': '56', 'name': 'to-read'}, {'count':... NaN 4.37 ... 4.0 2014.0 https://www.goodreads.com/book/show/22017381-1... https://images.gr-assets.com/books/1398621236m... 22017381 70.0 41332799.0 101 Nights: Volume One (101 Nights, #1-3) erotica S.E. Reign

5 rows × 28 columns

In [4]:
ranked_df = pd.read_csv('../assets/2021-07-27-found-canon.csv')
In [5]:
ranked_df
Out[5]:
Unnamed: 0 Title Listed count Author
0 1 Ulysses 51.0 Joyce, James
1 2 The Great Gatsby 50.0 F. Scott Fitzgerald
2 3 One Hundred Years of Solitude 44.0 Gabriel Garcia Marquez
3 4 Lolita 43.0 Vladimir Nabokov
4 5 Nineteen Eighty Four 42.0 Orwell, George
... ... ... ... ...
4409 4410 Decline of the West 1.0 O
4410 4411 The History of the Standard Oil Company 1.0 I
4411 4412 Theory of Games and Economic Behavior 1.0 J
4412 4413 AA Big Book 1.0 B
4413 4414 Behaviorism 1.0 J

4414 rows × 4 columns

Ooh, looking at the end of this dataframe it looks like my method for extracting Author information got messed up somehow. Better just delete that column and reconstruct it from the data in large_df anyway.

In [6]:
no_authors = ranked_df.drop(columns='Author')
no_authors
Out[6]:
Unnamed: 0 Title Listed count
0 1 Ulysses 51.0
1 2 The Great Gatsby 50.0
2 3 One Hundred Years of Solitude 44.0
3 4 Lolita 43.0
4 5 Nineteen Eighty Four 42.0
... ... ... ...
4409 4410 Decline of the West 1.0
4410 4411 The History of the Standard Oil Company 1.0
4411 4412 Theory of Games and Economic Behavior 1.0
4412 4413 AA Big Book 1.0
4413 4414 Behaviorism 1.0

4414 rows × 3 columns

In [23]:
# isbns = 
large_df['isbn']
Out[23]:
0          0312853122
1                 NaN
2          0743294297
3          0850308712
4          1599150603
              ...    
1215978    0689852959
1215979    0373126476
1215980    178092870X
1215981    0765197456
1215982    162378140X
Name: isbn, Length: 1215983, dtype: object
In [61]:
ulysseses = large_df.loc[large_df['title']=='Ulysses']

sorted_ulysseses = ulysseses.sort_values(by='ratings_count', ascending=False)

sorted_ulysseses.index = range(len(sorted_ulysseses))
In [138]:
sorted_ulysseses.loc[0]
Out[138]:
Unnamed: 0                                                       805796
Unnamed: 0.1                                                     805796
Unnamed: 0.1.1                                                  1563735
isbn                                                         0679722769
text_reviews_count                                              3674.00
series                                                               []
language_code                                                       eng
popular_shelves       [{'count': '3692', 'name': 'to-read'}, {'count...
asin                                                                NaN
average_rating                                                     3.74
similar_books         ['595038', '10543', '164434', '76527', '16111'...
description           The revised edition follows the complete and u...
format                                                        Paperback
link                  https://www.goodreads.com/book/show/338798.Uly...
authors                             [{'author_id': '5144', 'role': ''}]
publisher                                                       Vintage
num_pages                                                        810.00
isbn13                                                    9780679722762
publication_month                                                   NaN
publication_year                                                1990.00
url                   https://www.goodreads.com/book/show/338798.Uly...
image_url             https://images.gr-assets.com/books/1428891345m...
book_id                                                          338798
ratings_count                                                  78309.00
work_id                                                      2368224.00
title                                                           Ulysses
top_genre                                                      classics
author_name                                                 James Joyce
Name: 0, dtype: object

In [102]:
total_ulysses = sorted_ulysseses[['author_name', 'top_genre', 'format', 'publisher', 'description', 'num_pages', 'average_rating', 'ratings_count','text_reviews_count', 'publication_year','work_id', 'book_id', 'isbn', 'isbn13', 'asin']]
total_ulysses.loc[0]
Out[102]:
author_name                                                 James Joyce
top_genre                                                      classics
format                                                        Paperback
publisher                                                       Vintage
description           The revised edition follows the complete and u...
num_pages                                                        810.00
average_rating                                                     3.74
ratings_count                                                  78309.00
text_reviews_count                                              3674.00
publication_year                                                1990.00
work_id                                                      2368224.00
book_id                                                          338798
isbn                                                         0679722769
isbn13                                                    9780679722762
asin                                                                NaN
Name: 0, dtype: object

I can get most of the data I need here, including the ISBN of the most popular edition in cases like Ulysses where there are many printings. But the publication_year isn't the same as the original year of printing -- for that, we'll need to find the oldest copy with the same work_id as the popular one.

In [100]:
u = total_ulysses.loc[total_ulysses['work_id']==2368224.0]

su = u.sort_values(by='publication_year', ascending=True)

su.index = range(len(su))
In [101]:
su.loc[0].publication_year
Out[101]:
1934.0

There it is! Now to capture all this logic in a simple function. I don't want to access large_df too often, and when I do it will be much faster to access by index rather than searching through every time. So I'll set the index to be titles, make a function for grabbing the right data by the title, then build a new dataframe with the titles in the same order as ranked_df and merge them.

In [434]:
def get_best_from_editions_df(eds):
    
    sorted_eds = eds.sort_values(by='ratings_count', ascending=False).copy()
    sorted_eds.index = range(len(sorted_eds))
    
    same_work_eds = sorted_eds.loc[sorted_eds['work_id'] == sorted_eds.loc[0]['work_id']].copy()
    sorted_smes = same_work_eds.sort_values(by='publication_year', ascending=True).copy()
    sorted_smes.index = range(len(sorted_smes))
    
    first_pub_year = sorted_smes.loc[0].publication_year.copy()
    best = sorted_eds.loc[0].copy()
    best['first_pub_year'] = first_pub_year
    return([o for o in best.values])
In [435]:
get_best_from_editions_df(ulysseses)
Out[435]:
[805796,
 805796,
 1563735,
 '0679722769',
 3674.0,
 '[]',
 'eng',
 "[{'count': '3692', 'name': 'to-read'}, {'count': '3424', 'name': 'classics'}, {'count': '2412', 'name': 'fiction'}, {'count': '621', 'name': 'literature'}, {'count': '365', 'name': 'currently-reading'}, {'count': '352', 'name': 'books-i-own'}, {'count': '305', 'name': 'novels'}, {'count': '299', 'name': 'irish'}, {'count': '297', 'name': 'ireland'}, {'count': '234', 'name': 'to-buy'}, {'count': '223', 'name': 'abandoned'}, {'count': '175', 'name': 'unfinished'}, {'count': '175', 'name': 'modernism'}, {'count': '167', 'name': '1001-books'}, {'count': '163', 'name': '20th-century'}, {'count': '146', 'name': 'default'}, {'count': '145', 'name': 'classics-to-read'}, {'count': '141', 'name': 'irish-literature'}, {'count': '137', 'name': 'library'}, {'count': '133', 'name': 'literary-fiction'}, {'count': '125', 'name': 'rory-gilmore-reading-challenge'}, {'count': '121', 'name': 'banned-books'}, {'count': '107', 'name': '1001'}, {'count': '97', 'name': 'classic-literature'}, {'count': '96', 'name': 'ebook'}, {'count': '94', 'name': 'english'}, {'count': '94', 'name': 'my-library'}, {'count': '94', 'name': 'to-read-classics'}, {'count': '93', 'name': 'my-ebooks'}, {'count': '88', 'name': 'did-not-finish'}, {'count': '87', 'name': 'modern-classics'}, {'count': '84', 'name': 'wish-list'}, {'count': '81', 'name': 'favourites'}, {'count': '78', 'name': 'literary'}, {'count': '78', 'name': 'ebooks'}, {'count': '73', 'name': 'joyce'}, {'count': '71', 'name': 'rory-gilmore-challenge'}, {'count': '71', 'name': 'my-books'}, {'count': '70', 'name': '50-books-to-read-before-you-die'}, {'count': '68', 'name': 'to-read-fiction'}, {'count': '66', 'name': 'on-hold'}, {'count': '65', 'name': 'classic-fiction'}, {'count': '64', 'name': 'irish-lit'}, {'count': '63', 'name': 'james-joyce'}, {'count': '63', 'name': '1001-books-to-read-before-you-die'}, {'count': '62', 'name': 'lit'}, {'count': '62', 'name': 'didn-t-finish'}, {'count': '58', 'name': 'rory-gilmore'}, {'count': '58', 'name': 'must-read'}, {'count': '57', 'name': 'bbc-big-read'}, {'count': '56', 'name': 'gave-up-on'}, {'count': '55', 'name': 'school'}, {'count': '53', 'name': 'audiobooks'}, {'count': '53', 'name': 'classic-lit'}, {'count': '52', 'name': 'never-finished'}, {'count': '51', 'name': 'english-literature'}, {'count': '50', 'name': 'dnf'}, {'count': '50', 'name': 'top-100'}, {'count': '49', 'name': 'audio'}, {'count': '49', 'name': 'stream-of-consciousness'}, {'count': '49', 'name': 'modernist'}, {'count': '48', 'name': 'general-fiction'}, {'count': '47', 'name': '100-books-to-read-before-you-die'}, {'count': '47', 'name': 'modern-library-100-best-novels'}, {'count': '47', 'name': 'modern-library-100'}, {'count': '46', 'name': '1920s'}, {'count': '46', 'name': 're-read'}, {'count': '45', 'name': 'ℱavorites'}, {'count': '45', 'name': 'rory-gilmore-reading-list'}, {'count': '45', 'name': 'gilmore-girls'}, {'count': '44', 'name': 'audible'}, {'count': '44', 'name': 'e-book'}, {'count': '44', 'name': 'e-books'}, {'count': '43', 'name': 'gave-up'}, {'count': '42', 'name': 'i-own'}, {'count': '42', 'name': 'modern-library-top-100'}, {'count': '41', 'name': 'fiction-to-read'}, {'count': '41', 'name': 'bbc-top-100'}, {'count': '40', 'name': 'british'}, {'count': '40', 'name': 'partially-read'}, {'count': '40', 'name': 'books'}, {'count': '39', 'name': 'audiobook'}, {'count': '39', 'name': 'couldn-t-finish'}, {'count': '39', 'name': 'college'}, {'count': '38', 'name': 'university'}, {'count': '37', 'name': 'europe'}, {'count': '37', 'name': 'adult'}, {'count': '36', 'name': 'owned'}, {'count': '36', 'name': 'own-to-read'}, {'count': '36', 'name': 'to-re-read'}, {'count': '35', 'name': 'adult-fiction'}, {'count': '35', 'name': 'banned'}, {'count': '34', 'name': 'bookshelf'}, {'count': '34', 'name': 'modern-library'}, {'count': '34', 'name': 'owned-to-read'}, {'count': '34', 'name': 'unread'}, {'count': '34', 'name': 'modern'}, {'count': '33', 'name': 'stopped-reading'}, {'count': '33', 'name': '1001-books-you-must-read-before-you'}, {'count': '33', 'name': 'all-time-favorites'}]",
 nan,
 3.74,
 "['595038', '10543', '164434', '76527', '16111', '126512', '115476', '447884', '446542', '261441', '446103', '13368', '97333', '778463', '80890', '243381']",
 'The revised edition follows the complete and unabridged text of ULYSSESas corrected and reset in 1961. Like the first American edition of 1934, it also contains the original foreword by the author and the historic court ruling by Judge John M. Woolsey to remove the federal ban on ULYSSES. It also contains page references to the 1934 edition, which are indicated in the margins.',
 'Paperback',
 'https://www.goodreads.com/book/show/338798.Ulysses',
 "[{'author_id': '5144', 'role': ''}]",
 'Vintage',
 810.0,
 '9780679722762',
 nan,
 1990.0,
 'https://www.goodreads.com/book/show/338798.Ulysses',
 'https://images.gr-assets.com/books/1428891345m/338798.jpg',
 338798,
 78309.0,
 2368224.0,
 'Ulysses',
 'classics',
 'James Joyce',
 1934.0]
In [378]:
large_df = large_df.set_index('title')
In [127]:
large_df.index
Out[127]:
Index([                                         'W.C. Fields: A Life on Film',
                        'The Unschooled Wizard (Sun Wolf and Starhawk, #1-2)',
                                                       'Best Friends Forever',
       'Runic Astrology: Starcraft and Timekeeping in the Northern Tradition',
                                              'The Aeneid for Boys and Girls',
                                                      'The Wanting of Levine',
                     'All's Fairy in Love and War (Avalon: Web of Magic, #8)',
                                                       'The Devil's Notebook',
                                  'Crowner Royal (Crowner John Mystery, #13)',
                                                           'The Te Of Piglet',
       ...
                  'The Sensible Necktie and Other Stories of Sherlock Holmes',
                                                    'Mrs. Hudson in New York',
                                         'Contracted: A Wife for the Bedroom',
                                     'The Brazilian Boss's Innocent Mistress',
                              'North Country Cutthroats (The Trailsman #314)',
                             'Jacqueline Kennedy Onassis: Friend of the Arts',
                                           'The Spaniard's Blackmailed Bride',
                                        'Sherlock Holmes and the July Crisis',
                                   'The Children's Classic Poetry Collection',
                                  '101 Nights: Volume One (101 Nights, #1-3)'],
      dtype='object', name='title', length=1215983)
In [131]:
len([title for title in ranked_df['Title'] if title not in large_df.index])
Out[131]:
1262

That's kind of annoying... :thinking:

Some of the titles don't match to any of the titles in my goodreads dataset. That's probably not because they don't exist in the data, but they've been named something slightly different. I think I can use Levenshtein distance to measure the difference between strings and find the closest match in the dataset -- a process called "fuzzy matching"

In [180]:
!pip install rapidfuzz
Defaulting to user installation because normal site-packages is not writeable
Collecting rapidfuzz
  Downloading rapidfuzz-1.4.1-cp37-cp37m-manylinux2010_x86_64.whl (749 kB)
     |████████████████████████████████| 749 kB 750 kB/s eta 0:00:01
Installing collected packages: rapidfuzz
Successfully installed rapidfuzz-1.4.1
WARNING: You are using pip version 20.3.3; however, version 21.2.1 is available.
You should consider upgrading via the '/usr/bin/python3.7 -m pip install --upgrade pip' command.
In [181]:
from tqdm import tqdm
from rapidfuzz import process, utils
In [198]:
large_titles_list = [str(o) for o in large_df.index.array]
In [199]:
processed_titles = [utils.default_process(o) for o in large_titles_list]
In [202]:
large_titles_list[7680]
Out[202]:
"Harry Potter and the Philosopher's Stone"
In [205]:
titles_to_drop = []
for title in tqdm(ranked_df['Title']):
    if title.strip() not in large_df.index:
        processed_query = utils.default_process(title)
#         print(fuzz.ratio(title.strip(), ))
        match = process.extractOne(processed_query, processed_titles, processor=None, score_cutoff=90.1)
        if match == None:
            titles_to_drop.append(title)
        else:
            pass
    else:
        pass
len(titles_to_drop)
100%|██████████| 4414/4414 [09:08<00:00,  8.05it/s]
Out[205]:
578
In [258]:
drop_df = ranked_df[ranked_df['Title'].isin(titles_to_drop)]


drop_df
Out[258]:
Unnamed: 0 Title Listed count Author
93 94 First Folio 16.0 William Shakespeare
200 201 U.S.A. Trilogy 10.0 John Dos Passos
239 240 The Histories of Herodotus 9.0 Herodotus
243 244 Household Tales 9.0 Brothers Grimm
283 284 The Naked Dead 8.0 Norman Mailer
... ... ... ... ...
4394 4395 Encyclicals of Pope John XXIII 1.0 P
4395 4396 Psychology of the Unconscious 1.0 C
4401 4402 A Critique of the Theory of Evolution 1.0 T
4404 4405 Judgement and Reasoning in the Child 1.0 J
4413 4414 Behaviorism 1.0 J

578 rows × 4 columns

In [259]:
drop_df.describe()
Out[259]:
Unnamed: 0 Listed count
count 578.000000 578.000000
mean 2763.669550 1.346021
std 1083.393657 1.199939
min 94.000000 1.000000
25% 1933.250000 1.000000
50% 2941.500000 1.000000
75% 3560.000000 1.000000
max 4414.000000 16.000000
In [255]:
ranked_df.describe()
Out[255]:
Unnamed: 0 Listed count
count 4414.000000 4414.000000
mean 2207.500000 2.543045
std 1274.356373 4.267015
min 1.000000 1.000000
25% 1104.250000 1.000000
50% 2207.500000 1.000000
75% 3310.750000 2.000000
max 4414.000000 51.000000

So there are 578 titles with no good match in my large Goodreads dataset. Their average rank is significantly lower, as well, so I think it will be okay to simply drop them from the dataset.

Then I should rerank ranked_df and iterate through it again, replacing titles with the form they are found in large_df. At that point I can build my final dataset with info from both of those dataframes.

In [264]:
titles_to_drop[0]
Out[264]:
'First Folio'
In [283]:
ranked_titles_df = no_authors.set_index('Title')
In [285]:
ranked_titles_df.columns = ['rank', 'listed_count']
In [286]:
ranked_titles_df
Out[286]:
rank listed_count
Title
Ulysses 1 51.0
The Great Gatsby 2 50.0
One Hundred Years of Solitude 3 44.0
Lolita 4 43.0
Nineteen Eighty Four 5 42.0
... ... ...
Decline of the West 4410 1.0
The History of the Standard Oil Company 4411 1.0
Theory of Games and Economic Behavior 4412 1.0
AA Big Book 4413 1.0
Behaviorism 4414 1.0

4414 rows × 2 columns

In [287]:
dropped_df = ranked_titles_df.drop(titles_to_drop)
dropped_df.describe()
Out[287]:
rank listed_count
count 3836.000000 3836.000000
mean 2123.697602 2.723410
std 1280.040730 4.526196
min 1.000000 1.000000
25% 999.750000 1.000000
50% 2092.500000 1.000000
75% 3199.250000 2.000000
max 4413.000000 51.000000
In [288]:
dropped_df['rank'] = range(1, len(dropped_df) + 1)
In [289]:
dropped_df
Out[289]:
rank listed_count
Title
Ulysses 1 51.0
The Great Gatsby 2 50.0
One Hundred Years of Solitude 3 44.0
Lolita 4 43.0
Nineteen Eighty Four 5 42.0
... ... ...
Happiness in Marriage 3832 1.0
Decline of the West 3833 1.0
The History of the Standard Oil Company 3834 1.0
Theory of Games and Economic Behavior 3835 1.0
AA Big Book 3836 1.0

3836 rows × 2 columns

In [326]:
def get_corrected_title_index(processed_query, processed_index):
    match = process.extractOne(processed_query, processed_index, processor=None, score_cutoff=90.1)
    return(match[2])
In [327]:
get_corrected_title_index(utils.default_process('''The Stories of Anton Chekhov
'''), processed_titles)
Out[327]:
217359
In [328]:
large_titles_list[217359]
Out[328]:
'the stories of anton tchekov'
In [329]:
def find_correct_titles(questionable_titles, correct_titles):
    processed_titles = [utils.default_process(o) for o in correct_titles]
    
    for title in tqdm(questionable_titles):
        if title.strip() not in correct_titles:
#             print(title)
            processed_query = utils.default_process(title)
            i = get_corrected_title_index(processed_query, processed_titles)
            new_title = correct_titles[i]
            yield(new_title)
        else:
            yield(title.strip())
In [330]:
corrected_titles = [o for o in find_correct_titles(dropped_df.index, large_titles_list)]
100%|██████████| 3836/3836 [04:04<00:00, 15.66it/s]
In [331]:
len(corrected_titles)
Out[331]:
3836
In [342]:
dropped_df['work_title'] = corrected_titles
In [473]:
correct_df = dropped_df.set_index('rank')
In [474]:
correct_df
Out[474]:
listed_count work_title
rank
1 51.0 Ulysses
2 50.0 The Great Gatsby
3 44.0 One Hundred Years of Solitude
4 43.0 Lolita
5 42.0 Nineteen Eighty Four
... ... ...
3832 1.0 Happiness in Marriage
3833 1.0 Decline Of The West
3834 1.0 The History of the Standard Oil Company: Brief...
3835 1.0 Theory of Games and Economic Behavior
3836 1.0 Book Book

3836 rows × 2 columns

In [388]:
large_df['title'] = large_df.index
total_df = large_df.set_index('Unnamed: 0')[['title',
          'author_name',
           'top_genre',
           'format', 
           'publisher', 
           'description',
           'num_pages', 
           'average_rating', 
           'ratings_count',
           'text_reviews_count', 
           'publication_year',
           'work_id', 
           'book_id', 
           'isbn', 
           'isbn13',
           'asin']]
In [401]:
def get_editions_by_title(title, total_df):
    eds = total_df.loc[total_df['title']==title]
    return(eds)
In [438]:
scraped = correct_df['work_title'].apply(lambda x: get_best_from_editions_df(get_editions_by_title(x, total_df)))
In [475]:
scraped
Out[475]:
rank
1       [Ulysses, James Joyce, classics, Paperback, Vi...
2       [The Great Gatsby, F. Scott Fitzgerald, classi...
3       [One Hundred Years of Solitude, Gabriel Garcia...
4       [Lolita, Vladimir Nabokov, fiction, Paperback,...
5       [Nineteen Eighty Four, George Orwell, classics...
                              ...                        
3832    [Happiness in Marriage, Margaret Sanger, wishl...
3833    [Decline Of The West, Oswald Spengler, history...
3834    [The History of the Standard Oil Company: Brie...
3835    [Theory of Games and Economic Behavior, John v...
3836    [Book Book, Fiona Farrell, fiction, Paperback,...
Name: work_title, Length: 3836, dtype: object
In [476]:
scraped_df = pd.DataFrame([v for v in scraped.values])
In [477]:
scraped_df.columns = ['title',
          'author_name',
           'top_genre',
           'format', 
           'publisher', 
           'description',
           'num_pages', 
           'average_rating', 
           'ratings_count',
           'text_reviews_count', 
           'publication_year',
           'work_id', 
           'book_id', 
           'isbn', 
           'isbn13',
           'asin',
            'first_pub_year']
In [495]:
correct_df
Out[495]:
listed_count work_title
rank
1 51.0 Ulysses
2 50.0 The Great Gatsby
3 44.0 One Hundred Years of Solitude
4 43.0 Lolita
5 42.0 Nineteen Eighty Four
... ... ...
3832 1.0 Happiness in Marriage
3833 1.0 Decline Of The West
3834 1.0 The History of the Standard Oil Company: Brief...
3835 1.0 Theory of Games and Economic Behavior
3836 1.0 Book Book

3836 rows × 2 columns

In [496]:
scraped_df.index = correct_df.index
In [512]:
cat_df = pd.concat([correct_df,scraped_df], axis=1)
In [513]:
cat_df
Out[513]:
listed_count work_title title author_name top_genre format publisher description num_pages average_rating ratings_count text_reviews_count publication_year work_id book_id isbn isbn13 asin first_pub_year
rank
1 51.0 Ulysses Ulysses James Joyce classics Paperback Vintage The revised edition follows the complete and u... 810.0 3.74 78309.0 3674.0 1990.0 2368224.0 338798 0679722769 9780679722762 NaN 1934.0
2 50.0 The Great Gatsby The Great Gatsby F. Scott Fitzgerald classics Paperback Scribner THE GREAT GATSBY, F. Scott Fitzgerald's third ... 180.0 3.89 2758812.0 43881.0 2004.0 245494.0 4671 0743273567 9780743273565 NaN 1925.0
3 44.0 One Hundred Years of Solitude One Hundred Years of Solitude Gabriel Garcia Marquez favorites Hardcover Harper Probably Garcia Marquez finest and most famous... 457.0 4.04 497852.0 13886.0 2003.0 3295655.0 320 0060531045 9780060531041 NaN 1970.0
4 43.0 Lolita Lolita Vladimir Nabokov fiction Paperback Penguin Humbert Humbert - scholar, aesthete and romant... 331.0 3.88 476744.0 12538.0 1995.0 1268631.0 7604 NaN NaN NaN 1955.0
5 42.0 Nineteen Eighty Four Nineteen Eighty Four George Orwell classics Hardcover Methuen Publishing Winston Smith is a low-rung member of the Part... 326.0 4.14 198.0 15.0 1993.0 153313.0 6214262 074931723X 9780749317232 NaN 1954.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3832 1.0 Happiness in Marriage Happiness in Marriage Margaret Sanger wishlist-gender-women-history Paperback Pierides Press Margaret Sanger, mother of the birth control m... NaN 3.00 6.0 2.0 2006.0 1391616.0 1401421 1406795747 9781406795745 NaN 2006.0
3833 1.0 Decline Of The West Decline Of The West Oswald Spengler history NaN NaN Since its first publication in two volumes bet... NaN 4.07 1.0 1.0 NaN 1081173.0 6194776 0049010085 9780049010086 NaN NaN
3834 1.0 The History of the Standard Oil Company: Brief... The History of the Standard Oil Company: Brief... Ida Minerva Tarbell history Paperback Dover Publications Muckrakers -- a term coined in 1906 by Preside... 272.0 3.88 44.0 8.0 2003.0 659882.0 673868 0486428214 9780486428215 NaN 2003.0
3835 1.0 Theory of Games and Economic Behavior Theory of Games and Economic Behavior John von Neumann mathematics NaN NaN NaN NaN 4.17 165.0 4.0 NaN 471402.0 483055 0691130612 9780691130613 NaN NaN
3836 1.0 Book Book Book Book Fiona Farrell fiction Paperback Vintage Books As war is waged in the Middle East, a woman in... 367.0 3.60 44.0 9.0 2004.0 1833136.0 1833201 1869416198 9781869416195 NaN 2004.0

3836 rows × 19 columns

In [514]:
final_df = cat_df.drop('work_title', axis=1)
In [515]:
final_df.describe()
Out[515]:
listed_count num_pages average_rating ratings_count text_reviews_count publication_year work_id book_id first_pub_year
count 3836.000000 3378.000000 3836.000000 3.836000e+03 3836.000000 3218.000000 3.836000e+03 3.836000e+03 3614.000000
mean 2.723410 364.333629 3.917122 4.447718e+04 1441.705422 1998.267247 5.124938e+06 3.057160e+06 1983.232706
std 4.526196 589.408612 0.281635 1.963760e+05 5373.247895 12.648351 1.094187e+07 6.603514e+06 24.966777
min 1.000000 0.000000 2.330000 0.000000e+00 1.000000 1912.000000 1.140000e+02 2.100000e+01 1759.000000
25% 1.000000 209.000000 3.760000 1.820000e+02 17.000000 1994.000000 6.224172e+05 5.185325e+04 1968.000000
50% 1.000000 307.000000 3.930000 1.656500e+03 108.000000 2001.000000 1.388363e+06 2.979740e+05 1989.000000
75% 2.000000 444.750000 4.110000 1.178900e+04 648.250000 2006.000000 2.931549e+06 1.399795e+06 2002.000000
max 51.000000 32000.000000 5.000000 3.255518e+06 129572.000000 2018.000000 5.812730e+07 3.621645e+07 2018.000000
In [516]:
final_df.head()
Out[516]:
listed_count title author_name top_genre format publisher description num_pages average_rating ratings_count text_reviews_count publication_year work_id book_id isbn isbn13 asin first_pub_year
rank
1 51.0 Ulysses James Joyce classics Paperback Vintage The revised edition follows the complete and u... 810.0 3.74 78309.0 3674.0 1990.0 2368224.0 338798 0679722769 9780679722762 NaN 1934.0
2 50.0 The Great Gatsby F. Scott Fitzgerald classics Paperback Scribner THE GREAT GATSBY, F. Scott Fitzgerald's third ... 180.0 3.89 2758812.0 43881.0 2004.0 245494.0 4671 0743273567 9780743273565 NaN 1925.0
3 44.0 One Hundred Years of Solitude Gabriel Garcia Marquez favorites Hardcover Harper Probably Garcia Marquez finest and most famous... 457.0 4.04 497852.0 13886.0 2003.0 3295655.0 320 0060531045 9780060531041 NaN 1970.0
4 43.0 Lolita Vladimir Nabokov fiction Paperback Penguin Humbert Humbert - scholar, aesthete and romant... 331.0 3.88 476744.0 12538.0 1995.0 1268631.0 7604 NaN NaN NaN 1955.0
5 42.0 Nineteen Eighty Four George Orwell classics Hardcover Methuen Publishing Winston Smith is a low-rung member of the Part... 326.0 4.14 198.0 15.0 1993.0 153313.0 6214262 074931723X 9780749317232 NaN 1954.0
In [509]:
final_df.to_csv('../../records/invisible-canon-draft-1.csv')
In [ ]: