Notebook

The purpose of this notebook is to explore what are the linguistic aspects of a book that influence a reader's opinion of that book.

Open Library Books API¶

The data of this notebook is primarily coming from Open Library Books API, which addresses the question of "data by book". But we need thounsands of books which means we need 1000 titles that

has corresponding entries in the Open Library Books API, and
their full texts are downloadable for analytics

What we need is probably a "book title database". Project Gutenberg hosts more than 70,000 books with downloadable texts. In addition, a 3rd-party API surfaces all of the titles. This will be our good starting poing.

Getting List of Book Titles¶

We iterate over the a 3rd-party API and get all titles first. This is a very costly operation. All titles have been pre-fetched and stored in a cache file. By default, we read titles from the cache. To update the titles, change the read_titles_from_cache from True to False below:

In [2]:

import requests

titles = []

read_titles_from_cache = True

if read_titles_from_cache:
    with open("titles.txt", "r") as f:
        titles = f.read().split("\n")
else:
    url = "https://gutendex.com/books/?page=1"
    
    while True:
        data = requests.get(url).json()
        books = data["results"]
        for book in books:
            titles.append(book["title"])
    
        if data["next"]:
            url = data["next"]
        else:
            break

    with open('titles.txt', 'w') as f:
        f.write("\n".join(titles))

In [ ]: