In [ ]:



# Data collection¶

How do books become classics?

• toc: true
• categories: [canon]
• image: images/popularity-prestige.png

# Data¶

## What data can I gather?¶

I have acquired some datasets to explore and combine into the table of correspondences:

Some of these are whopping huge datasets. It would be cool to put every single field into a dataframe and run a tabular neural net, just to see what happens. But the wise thing is to build up from the minimum necessary data, and add things as they seem useful. To do that I need an ontology of what books are, and how they are organized.

### What is the ontology of a book?¶

If I have learned anything about the book business, it's that only two things are real: people, and books. Everything else is an abstraction: genres, series, formats, publishing houses, deadlines, paychecks, prestige, canons, they're all just convenient abstractions we use to simplify the incredible world of books and the people who read and write them.

In the datasets we can find Authors and Users who review them . We can also find their books, either individually for each unique format (Books) or all formats for each title (Works). Making sense of different Books and Works is a difficult job. Fortunately, someone has already done this for us.

Unfortunately, someone else also did it, differently, and we have to reconcile the two standards. It's the classic XKCD problem:

In this case we have the following standards:

• ISBN: International Standard Book Number
• EAN: European Article Number
• ASIN: Amazon Standard Identification Number
• goodreads_book_id: Specific edition of a book
• goodreads_work_id: Wrapper for all editions of a title

For more fun information about book labeling standards, a good place to start is the Wikipedia page for Bookland:

(There are fewer standards for authors and users, but probably similar problems.)

Ultimately, the specifics of any given edition of a book are abstractions that lay on top of the text itself. The most sensible way to organize books is by W\work id, if possible. Works can have author and genre information associated, and each edition can be a child of the work, with its own id associations and review characteristics. Go from general to specific detail as you descend into the object.

note: this is starting to sound like a network graph. keep that tactic in mind in case this gets unwieldy

Plus, if we want to find classicness in books, we need to keep track of them through different historical eras of publishing. The UCSD Book Graph data has aggregate data for works and authors, so that's probably a good place to start.

But perhaps for testing purposes, we should use a smaller subset of books. Let's start by finding the ten thousand top books, according to Goodbooks-10k,and collating their work, author and genre data.

In [1]:
import pandas as pd

In [2]:
goodbooks = pd.read_csv('/run/media/mage/INDESTRUCTIBLESLIME/Replaceable/datasets/goodbooks-10k/books.csv')

In [3]:
goodbooks.head()

Out[3]:
book_id goodreads_book_id best_book_id work_id books_count isbn isbn13 authors original_publication_year original_title ... ratings_count work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url
0 1 2767052 2767052 2792775 272 439023483 9.780439e+12 Suzanne Collins 2008.0 The Hunger Games ... 4780653 4942365 155254 66715 127936 560092 1481305 2706317 https://images.gr-assets.com/books/1447303603m... https://images.gr-assets.com/books/1447303603s...
1 2 3 3 4640799 491 439554934 9.780440e+12 J.K. Rowling, Mary GrandPré 1997.0 Harry Potter and the Philosopher's Stone ... 4602479 4800065 75867 75504 101676 455024 1156318 3011543 https://images.gr-assets.com/books/1474154022m... https://images.gr-assets.com/books/1474154022s...
2 3 41865 41865 3212258 226 316015849 9.780316e+12 Stephenie Meyer 2005.0 Twilight ... 3866839 3916824 95009 456191 436802 793319 875073 1355439 https://images.gr-assets.com/books/1361039443m... https://images.gr-assets.com/books/1361039443s...
3 4 2657 2657 3275794 487 61120081 9.780061e+12 Harper Lee 1960.0 To Kill a Mockingbird ... 3198671 3340896 72586 60427 117415 446835 1001952 1714267 https://images.gr-assets.com/books/1361975680m... https://images.gr-assets.com/books/1361975680s...
4 5 4671 4671 245494 1356 743273567 9.780743e+12 F. Scott Fitzgerald 1925.0 The Great Gatsby ... 2683664 2773745 51992 86236 197621 606158 936012 947718 https://images.gr-assets.com/books/1490528560m... https://images.gr-assets.com/books/1490528560s...

5 rows × 23 columns

In [4]:
len(goodbooks)

Out[4]:
10000
In [5]:
goodbooks.index = goodbooks['work_id']

del(goodbooks['work_id'])

In [6]:
goodbooks.head()

Out[6]:
book_id goodreads_book_id best_book_id books_count isbn isbn13 authors original_publication_year original_title title ... ratings_count work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url
work_id
2792775 1 2767052 2767052 272 439023483 9.780439e+12 Suzanne Collins 2008.0 The Hunger Games The Hunger Games (The Hunger Games, #1) ... 4780653 4942365 155254 66715 127936 560092 1481305 2706317 https://images.gr-assets.com/books/1447303603m... https://images.gr-assets.com/books/1447303603s...
4640799 2 3 3 491 439554934 9.780440e+12 J.K. Rowling, Mary GrandPré 1997.0 Harry Potter and the Philosopher's Stone Harry Potter and the Sorcerer's Stone (Harry P... ... 4602479 4800065 75867 75504 101676 455024 1156318 3011543 https://images.gr-assets.com/books/1474154022m... https://images.gr-assets.com/books/1474154022s...
3212258 3 41865 41865 226 316015849 9.780316e+12 Stephenie Meyer 2005.0 Twilight Twilight (Twilight, #1) ... 3866839 3916824 95009 456191 436802 793319 875073 1355439 https://images.gr-assets.com/books/1361039443m... https://images.gr-assets.com/books/1361039443s...
3275794 4 2657 2657 487 61120081 9.780061e+12 Harper Lee 1960.0 To Kill a Mockingbird To Kill a Mockingbird ... 3198671 3340896 72586 60427 117415 446835 1001952 1714267 https://images.gr-assets.com/books/1361975680m... https://images.gr-assets.com/books/1361975680s...
245494 5 4671 4671 1356 743273567 9.780743e+12 F. Scott Fitzgerald 1925.0 The Great Gatsby The Great Gatsby ... 2683664 2773745 51992 86236 197621 606158 936012 947718 https://images.gr-assets.com/books/1490528560m... https://images.gr-assets.com/books/1490528560s...

5 rows × 22 columns

Okay, good! We have a list of 10,000 books sorted by work_id. Let's add some genre data from the UCSD Book Graph

In [7]:
genres = pd.read_json('/run/media/mage/INDESTRUCTIBLESLIME/Replaceable/datasets/goodreads/downloads/goodreads_book_genres_initial.json', lines=True)

In [8]:
genres.head()

Out[8]:
book_id genres
0 5333265 {'history, historical fiction, biography': 1}
1 1333909 {'fiction': 219, 'history, historical fiction,...
2 7327624 {'fantasy, paranormal': 31, 'fiction': 8, 'mys...
3 6066819 {'fiction': 555, 'romance': 23, 'mystery, thri...
4 287140 {'non-fiction': 3}
In [9]:
len(genres)

Out[9]:
2360655
In [10]:
goodbooks['genre'] = [[genres.loc[genres['book_id'] == i]][0]['genres'].values[0] if len([genres.loc[genres['book_id'] == i]][0]['genres'].values) > 0 else '' for i in goodbooks['goodreads_book_id'] ]

In [11]:
goodbooks.head()

Out[11]:
book_id goodreads_book_id best_book_id books_count isbn isbn13 authors original_publication_year original_title title ... work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url genre
work_id
2792775 1 2767052 2767052 272 439023483 9.780439e+12 Suzanne Collins 2008.0 The Hunger Games The Hunger Games (The Hunger Games, #1) ... 4942365 155254 66715 127936 560092 1481305 2706317 https://images.gr-assets.com/books/1447303603m... https://images.gr-assets.com/books/1447303603s... {'young-adult': 30173, 'fiction': 26304, 'fant...
4640799 2 3 3 491 439554934 9.780440e+12 J.K. Rowling, Mary GrandPré 1997.0 Harry Potter and the Philosopher's Stone Harry Potter and the Sorcerer's Stone (Harry P... ... 4800065 75867 75504 101676 455024 1156318 3011543 https://images.gr-assets.com/books/1474154022m... https://images.gr-assets.com/books/1474154022s... {'fantasy, paranormal': 54156, 'young-adult': ...
3212258 3 41865 41865 226 316015849 9.780316e+12 Stephenie Meyer 2005.0 Twilight Twilight (Twilight, #1) ... 3916824 95009 456191 436802 793319 875073 1355439 https://images.gr-assets.com/books/1361039443m... https://images.gr-assets.com/books/1361039443s... {'young-adult': 19627, 'fantasy, paranormal': ...
3275794 4 2657 2657 487 61120081 9.780061e+12 Harper Lee 1960.0 To Kill a Mockingbird To Kill a Mockingbird ... 3340896 72586 60427 117415 446835 1001952 1714267 https://images.gr-assets.com/books/1361975680m... https://images.gr-assets.com/books/1361975680s... {'fiction': 8870, 'history, historical fiction...
245494 5 4671 4671 1356 743273567 9.780743e+12 F. Scott Fitzgerald 1925.0 The Great Gatsby The Great Gatsby ... 2773745 51992 86236 197621 606158 936012 947718 https://images.gr-assets.com/books/1490528560m... https://images.gr-assets.com/books/1490528560s... {'fiction': 20684, 'history, historical fictio...

5 rows × 23 columns

## Explore the goodbooks-10K dataset¶

I want to chart the goodbooks dataset and color the datapoints by their biggest genre attribute. I think that will give some insight into the types of books we want to acquire.

Let's make a little function to return the top genre of a given row from the dataframe.

In [12]:
def genre_from_id(work_id, df):
genre = df.loc[work_id]['genre']
if type(genre) is dict and len(genre) > 0:
top = sorted(genre.items(), key=lambda item: item[1], reverse=True)[0][0]
else:
top = 'missing'
return(top)

In [13]:
genre_from_id(2792775,goodbooks)

Out[13]:
'young-adult'
In [14]:
goodbooks['top_genre'] = [genre_from_id(work_id, goodbooks) for work_id in goodbooks.index]

In [15]:
goodbooks['top_genre'].head()

Out[15]:
work_id
4640799    fantasy, paranormal
3212258    fantasy, paranormal
3275794                fiction
245494                 fiction
Name: top_genre, dtype: object

Okay now we've got a column that categorizes top genre. Next we have to choose some axes on which to plot the scatter points. What columns do we have to work from?

In [16]:
goodbooks.columns

Out[16]:
Index(['book_id', 'goodreads_book_id', 'best_book_id', 'books_count', 'isbn',
'isbn13', 'authors', 'original_publication_year', 'original_title',
'title', 'language_code', 'average_rating', 'ratings_count',
'work_ratings_count', 'work_text_reviews_count', 'ratings_1',
'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5', 'image_url',
'small_image_url', 'genre', 'top_genre'],
dtype='object')

As a sanity check I'm going to start with publication year and average rating. This should make it clear pretty quick if the data is messed up.

I'm going to try the plotly library, for more interactive graphs

In [1]:
from IPython.display import HTML
import plotly.express as px

In [ ]:
import plotly.io as pio
pio.renderers.default = 'notebook_connected'

In [ ]:
fig = px.scatter(goodbooks, x="original_publication_year", y="average_rating", color="top_genre", hover_data = ['authors', 'title'], size='books_count',
title="What is a classic book? a graph by @deepfates", height=800)
HTML(fig.to_html())

In [42]:
with open('goodbooks-genre-pop-time.html', 'w') as f:
f.write(fig.to_html())


Here i have graphed the average rating of the top 10,000 books over time. The size of the marker represents the number of distinct formats the book has been printed in.

This graph confirms the idea of classicness eroding over time: a few books, in a narrow but high range of quality, survive from the time before printing. The range widens after printing, but continues a general "cone" effect that you would expect if low-quality books were forgotten over time.

We can also see the clear effect of intellectual property laws on the canon here. Up til 1926 we have mostly big bubbles, but afterward almost all the bubbles are small. In fact, it's hard to see that

In [58]:
len(goodbooks.loc[goodbooks['original_publication_year'] > 1926])

Out[58]:
9440

of them are published after that year! Only 560 public domain books are represented. They've just been reprinted so many times that they dwarf the thousands of tiny modern books.

This does suggest one unconventional path to canonization: if you make your book public domain, and (crucially) if your book is desirable enough to sell a bunch of copies, you could convince multiple publishers to sell different printings of it, and become more well known than if you had kept the copyright! This may have its downsides, of course; please email me with results if you try it.

Another open question is the count of books per genre. Let's see if we can get any more information about that from goodbooks-10K.

In [ ]:
fig = px.histogram(goodbooks, x = 'top_genre')
HTML(fig.to_html())


There are some obvious parallels here to my own sales count data, and some insights into the life cycle of readers. There are also some major discrepancies.

First the bad news: the "romance" section is nonexistent at my store. This is not because we're biased against harlequins (though we are). It's a combination of factors: our customers don't prefer them, another store in town specializes in romances, and we have very limited shelf space, so we can't keep the bulk series that are the real sellers in romance (and mystery, for that matter).

We decided to stop having an official section for these and file any that we do have in with Fiction, and in fact some of the books listed above as Romance (Wuthering Heights, for instance) we keep in Classics.

Maybe we can repeat the top_genre step, but this time throw out any Romance data and take the next genre instead.

In [92]:
len(goodbooks[goodbooks['top_genre'] == 'romance'])

Out[92]:
669
In [87]:
goodbooks.loc[1565818]

Out[87]:
book_id                                                                     63
best_book_id                                                              6185
books_count                                                               2498
isbn                                                                 393978893
isbn13                                                        9780393978900.00
authors                                          Emily Brontë, Richard J. Dunn
original_publication_year                                              1847.00
original_title                                               Wuthering Heights
title                                                        Wuthering Heights
language_code                                                              eng
average_rating                                                            3.82
ratings_count                                                           899195
work_ratings_count                                                     1001135
work_text_reviews_count                                                  26157
ratings_1                                                                46469
ratings_2                                                                84084
ratings_3                                                               215320
ratings_4                                                               309180
ratings_5                                                               346082
image_url                    https://s.gr-assets.com/assets/nophoto/book/11...
small_image_url              https://s.gr-assets.com/assets/nophoto/book/50...
genre                        {'romance': 4001, 'fiction': 2229, 'history, h...
top_genre                                                              romance
Name: 1565818, dtype: object
In [89]:
def genre_no_romance(work_id, df):
genre = df.loc[work_id]['genre']
if type(genre) is dict and len(genre) > 0:
top_n = sorted(genre.items(), key=lambda item: item[1], reverse=True)

if top_n[0][0] == 'romance' and len(top_n) > 1 :
top = top_n[1][0]
else:
top = top_n[0][0]
else:
top = 'missing'

return(top)

In [90]:
genre_no_romance(1565818,goodbooks)

Out[90]:
'fiction'
In [95]:
goodbooks['top_genre'] = [genre_no_romance(work_id, goodbooks) for work_id in goodbooks.index]

In [ ]:
fig = px.histogram(goodbooks, x = 'top_genre')
HTML(fig.to_html())


That's better! There are still some issues, like the lack of a 'drama' section, but we're getting there.

I wonder if drama is even included in this genre data. Let's check wth the Bard

In [100]:
goodbooks[goodbooks['authors'] == 'William Shakespeare']

Out[100]:
book_id goodreads_book_id best_book_id books_count isbn isbn13 authors original_publication_year original_title title ... work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url genre top_genre
work_id
1896522 154 8852 8852 1732 743477103 9.780743e+12 William Shakespeare 1606.0 The Tragedy of Macbeth Macbeth ... 7609 10551 35408 127354 183871 167642 https://images.gr-assets.com/books/1459795224m... https://images.gr-assets.com/books/1459795224s... {'fiction': 3359, 'poetry': 605, 'history, his... fiction
995103 353 12996 12996 1053 743477553 9.780743e+12 William Shakespeare 1603.0 The Tragedy of Othello, The Moor of Venice Othello ... 4334 4281 16576 64922 92076 78902 https://images.gr-assets.com/books/1459795105m... https://images.gr-assets.com/books/1459795105s... {'fiction': 1682, 'poetry': 403, 'romance': 17... fiction
2342136 714 12938 12938 1108 074348276X 9.780743e+12 William Shakespeare 1603.0 The Tragedie of King Lear King Lear ... 3079 2825 10502 36179 47682 50094 https://images.gr-assets.com/books/1331563731m... https://images.gr-assets.com/books/1331563731s... {'fiction': 1257, 'poetry': 343, 'history, his... fiction
3152341 773 47021 47021 689 074347757X 9.780743e+12 William Shakespeare 1593.0 The Taming of the Shrew The Taming of the Shrew ... 2370 2869 9611 35666 47453 38641 https://images.gr-assets.com/books/1327935253m... https://images.gr-assets.com/books/1327935253s... {'fiction': 996, 'romance': 288, 'poetry': 275... fiction
1359590 804 12985 12985 956 743482832 9.780743e+12 William Shakespeare 1623.0 The Tempest The Tempest ... 2831 2394 10084 37832 46984 38243 https://images.gr-assets.com/books/1327793692m... https://images.gr-assets.com/books/1327793692s... {'fiction': 206, 'history, historical fiction,... fiction
3267921 855 1625 1625 861 743482778 9.780743e+12 William Shakespeare 1601.0 Twelfth Night; or, What You Will Twelfth Night ... 2615 1494 6329 29985 47546 43539 https://images.gr-assets.com/books/1416628008m... https://images.gr-assets.com/books/1416628008s... {'fiction': 920, 'romance': 321, 'poetry': 298... fiction
702863 1885 42607 42607 709 074348486X 9.780743e+12 William Shakespeare 1599.0 As You Like It As You Like It ... 1312 894 3908 17213 22421 17061 https://images.gr-assets.com/books/1327935363m... https://images.gr-assets.com/books/1327935363s... {'fiction': 505, 'romance': 167, 'poetry': 182... fiction
3000541 2209 569564 569564 828 517053616 9.780517e+12 William Shakespeare 1623.0 The Complete Works The Complete Works ... 726 448 695 4279 10908 29680 https://images.gr-assets.com/books/1327884293m... https://images.gr-assets.com/books/1327884293s... {'history, historical fiction, biography': 69,... fiction
6302847 6417 44133 44133 498 521293731 9.780521e+12 William Shakespeare 1600.0 The Winter's Tale The Winter's Tale ... 810 300 1692 6488 6925 4566 https://images.gr-assets.com/books/1327893509m... https://images.gr-assets.com/books/1327893509s... {'fiction': 203, 'poetry': 82, 'romance': 65, ... fiction
275237 6530 72978 72978 376 671722921 9.780672e+12 William Shakespeare 1589.0 NaN Titus Andronicus ... 1062 579 1848 4963 5942 5035 https://images.gr-assets.com/books/1397028943m... https://images.gr-assets.com/books/1397028943s... {'fiction': 226, 'poetry': 73, 'history, histo... fiction
525707 6692 82356 82356 474 1853262439 9.781853e+12 William Shakespeare 1594.0 NaN The Comedy of Errors ... 782 252 1392 5964 6717 4514 https://images.gr-assets.com/books/1328543324m... https://images.gr-assets.com/books/1328543324s... {'fiction': 224, 'poetry': 96, 'romance': 22, ... fiction

11 rows × 24 columns

Nope! It's being collected as poetry, history, romance but not as drama. Swing and a miss. Oh well, nonetheless we have some information for our proportioning:

In [102]:
gs = goodbooks['top_genre'].value_counts()
gs

Out[102]:
fiction                                   3455
fantasy, paranormal                       1788
mystery, thriller, crime                  1319
non-fiction                               1279
children                                   566
history, historical fiction, biography     436
comics, graphic                            403
poetry                                      76
missing                                      5
romance                                      2
Name: top_genre, dtype: int64
In [125]:
nums = {k:v for k, v in gs.items()}
fic = nums['fiction'] + nums['fantasy, paranormal'] + nums['mystery, thriller, crime']
nf = nums['non-fiction'] + nums['poetry'] + nums['comics, graphic'] + nums['history, historical fiction, biography']
misc = nums['romance'] + nums['missing']

fic, nf, kids, misc

Out[125]:
(6562, 2194, 1237, 7)

So about 12% of the books are in children and young adult categories, another 20% in non-fiction (including history, poetry and comics, which we shelve on that side of the room) and quite a majority (65%) in the fiction categories!

Ontology of a book:

A work with

• a title
• a creation year
• one or more:
• authors
• editions
• cover images
• genres
• programatic IDs
• various
• rankings
• reviews
• listings
• sales rank