The purpose of this notebook is to explore what are the linguistic aspects of a book that influence a reader's opinion of that book.
The data of this notebook is primarily coming from Open Library Books API, which addresses the question of "data by book". But we need thounsands of books which means we need 1000 titles that
What we need is probably a "book title database". Project Gutenberg hosts more than 70,000 books with downloadable texts. In addition, a 3rd-party API surfaces all of the titles. This will be our good starting poing.
We iterate over the a 3rd-party API and get all titles first. This is a very costly operation. All titles have been pre-fetched and stored in a cache file. By default, we read titles from the cache. To update the titles, change the read_titles_from_cache
from True
to False
below:
import requests
titles = []
read_titles_from_cache = True
if read_titles_from_cache:
with open("titles.txt", "r") as f:
titles = f.read().split("\n")
else:
url = "https://gutendex.com/books/?page=1"
while True:
data = requests.get(url).json()
books = data["results"]
for book in books:
titles.append(book["title"])
if data["next"]:
url = data["next"]
else:
break
with open('titles.txt', 'w') as f:
f.write("\n".join(titles))