#!/usr/bin/env python
# coding: utf-8

# __The purpose of this notebook is to explore what are the linguistic aspects of a book that influence a reader's opinion of that book__. 
# 
# Open Library Books API
# ----------------------
# 
# The data of this notebook is primarily coming from [Open Library Books API](https://openlibrary.org/dev/docs/api/books), which addresses the question of "data by book". But we need thounsands of books which means we need 1000 titles that
# 
# 1. has corresponding entries in the Open Library Books API, and
# 2. their full texts are downloadable for analytics
# 
# What we need is probably a "book title database". [Project Gutenberg](https://www.gutenberg.org/) hosts more than 70,000 books with downloadable texts. In addition, [a 3rd-party API](https://github.com/garethbjohnson/gutendex) surfaces all of the titles. This will be our good starting poing.

# Getting List of Book Titles
# ---------------------------
# 
# We iterate over the [a 3rd-party API](https://github.com/garethbjohnson/gutendex) and get all titles first. This is a very costly operation. All titles have been pre-fetched and stored in a cache file. By default, we read titles from the cache. To update the titles, change the `read_titles_from_cache` from `True` to `False` below:

# In[2]:


import requests

titles = []

read_titles_from_cache = True

if read_titles_from_cache:
    with open("titles.txt", "r") as f:
        titles = f.read().split("\n")
else:
    url = "https://gutendex.com/books/?page=1"
    
    while True:
        data = requests.get(url).json()
        books = data["results"]
        for book in books:
            titles.append(book["title"])
    
        if data["next"]:
            url = data["next"]
        else:
            break

    with open('titles.txt', 'w') as f:
        f.write("\n".join(titles))


# In[ ]: