#!/usr/bin/env python # coding: utf-8 # __The purpose of this notebook is to explore what are the linguistic aspects of a book that influence a reader's opinion of that book__. # # Open Library Books API # ---------------------- # # The data of this notebook is primarily coming from [Open Library Books API](https://openlibrary.org/dev/docs/api/books), which addresses the question of "data by book". But we need thounsands of books which means we need 1000 titles that # # 1. has corresponding entries in the Open Library Books API, and # 2. their full texts are downloadable for analytics # # What we need is probably a "book title database". [Project Gutenberg](https://www.gutenberg.org/) hosts more than 70,000 books with downloadable texts. In addition, [a 3rd-party API](https://github.com/garethbjohnson/gutendex) surfaces all of the titles. This will be our good starting poing. # Getting List of Book Titles # --------------------------- # # We iterate over the [a 3rd-party API](https://github.com/garethbjohnson/gutendex) and get all titles first. This is a very costly operation. All titles have been pre-fetched and stored in a cache file. By default, we read titles from the cache. To update the titles, change the `read_titles_from_cache` from `True` to `False` below: # In[2]: import requests titles = [] read_titles_from_cache = True if read_titles_from_cache: with open("titles.txt", "r") as f: titles = f.read().split("\n") else: url = "https://gutendex.com/books/?page=1" while True: data = requests.get(url).json() books = data["results"] for book in books: titles.append(book["title"]) if data["next"]: url = data["next"] else: break with open('titles.txt', 'w') as f: f.write("\n".join(titles)) # In[ ]: