Skill 3: Building a spider

So, now, we can use python to open a page, pass an API query, and parse a page for the elements we're interested in.

One API hit is cool, but do you know whats really cool?

One million API hits.

Ok, maybe not a million.1 But now that you can write a request and modify search parameters, you might need to run a bunch of searches.

Scraping jobs typically fall into one of two camps:

  1. loop over URLs or some search parameters (like firm names)
  2. navigate from an initial page through subsequent pages (e.g through search results)

Of course, both can be true: sometimes a spider might have a list of URLs (search for firms that filed an 8-K in 2000, then those that filed in 2001) and for each URL (year) click through all 8-Ks.

{admonition} The trick they don't want you to know :class: tip **When your job falls into the first camp - you want to loop over a list of URLs - a good way to do that is**: Define a function to do one search, then call that for each search in a list of searches.

A silly spider

For example:

  1. I've done well over a million API hits in the name of science.

In [ ]:
def search_itunes(search_term):
    '''Run one simple iTunes search'''
    base_url = ''
    search_parameters = {'term': search_term}
    r = requests.get(base_url, params = search_parameters)
    results_df = pd.DataFrame(r.json()['results'])
    return results_df

We can run this one artist at a time:

In [ ]:
search_itunes('billie eilish')      # one search at a time
search_itunes('father john misty')) # "another one" - dj khaled

Or we can loop over them and (TBD, but saving the results_df to files is a good idea):

In [ ]:
artists = ['billie eilish','father john misty'] # you can loop over them!

# download the results and save locally
for artist in artists:
    df = search_itunes(artist)
    # you could do anything with the results here
    # a good idea in many projects: save the webpage/search results
    # even better: add the saving function inside the "search_itunes" fcn
    # but this is just a toy illustration, so nothing happens

LATER, you will want to analyze those files. Just loop over the files again:

In [ ]:
for artist in artists:
    # load the saved file
    # call a function you wrote to parse/analyze/reformat one file
    # do something with the output from the parser
    # but this is just a toy illustration, so nothing happens    

The main web scraping problems (and workarounds)

Also, check out the table from a few pages ago on useful packages and tips.

```{dropdown} Issue 1: The jobs are slow

In many web scraping projects, a lot of data needs to get scraped, over thousands (or millions) (or billions) of pages. It's unlikely that you can do this all in one session. (What if your WiFi disconnects, or Windows decides to do an update, or the webpage freezes you out for a period of time?)


  1. Write code that only hits the server one time, and saves the results to your computer. "Step 1" of the search_itunes example above does that. Then "step 2" uses/parses those files without going to the webpages again.
  2. You want your spider to resume, not restart. Ensure that your code can resume where it left off without having to restart from scratch. My usual solution:
    # as I'm looping over webpages:
     if not os.path.exists(<filename this page would get>): 
         okay_do_the_download() # whatever the function is
     # if not, skip to the next webpage
  3. Your spider WILL fail - you don't want it to stop. I typically use a try-except-else block. The try part accesses the url/send the API request, the except part prints or logs a failure to a log file, and the else part only executes the code I need to run after the url request if the try code was successful. For example, I could improve the search_itunes function:
    if not os.path.exists(<filename this page would get>): 
             r = requests.get(base_url, params = search_parameters)
             print("hey this didn't work! prob print better info than")
             print("this string")
             # or... create strings and append them to an "error_list",
             # which you save to a text file or csv after the code finishes
             # and you can look at it then
             results_df = pd.DataFrame(r.json()['results'])
  4. Your spider WILL fail - you will want to know what. You should log failures, warnings, and errors. The prior example can be adjusted to do this well. ```

```{dropdown} Issue 2: Too much speed

Servers aren't free and can get overloaded. You've seen or heard of websites crashing due to high traffic - Fandango for Star Wars - Rogue One, Black Fridays, and the Canadian Immigration site in Nov 2016.

As such, webmasters often throttle or block computers that are sending too much traffic.


  1. Slow your code down with sleep(#). This is the main solution.
  2. Get API access with special permissions.
  3. If you can't slow down your spider (the code crawling the site), use multiple computers/IP addresses


````{dropdown} Issue 3: So... I'm downloading a loooot of files

You are!

It's important to save them in an organized way. There is no "one way", and the directory/storage scheme I choose depends on the job. The main thing is that you probably want two abilities ater the download:

  1. If you sequentially open all files, can you tell what they are? (E.g. the firm, the year, the form type.)
  2. If you want to only open some files, can you do that without opening all files? (E.g. only open 10-Ks but not 10-Qs.)

How you achieve these is somewhat up to you but you basically have two choices (and these can work in tandom):

Solution 1: Build the folder structure so that the path to the file tells you what you need to know.

E.g. /gvkey_10145/10-ks/2008/934573495-923875934.txt is "obviously" the 2008 10-K for firm 10145, and you know this without needing to open the file and even though the filename itself is not very clear.

Solution 2: Keep a master list of documents

Sometimes it's not possible or reasonable to know exactly how to build the directory in advance. For example, forms filed to the SEC in 2008 are often for fiscal year 2007. So what does the "2008" folder mean? How can you tell before running everything? So maybe you just download all the 10-ks for that firm inside the /gvkey_10145/10-ks/ folder.

To find the 2008 10-K, you'd open up a master list of documents which contains variables with enough info to assemble the path to each file, and info about each file. Then you can query("form='10-K' & fyear=2008"), assemble the filename, and run your code.

This master list must either be assembled before you run your spider (like in Assignment 5), as you run the spider (collect the info and save it as you go), or after the download you run some code one time to assemble it (either using their paths a la /gvkey_10145/10-ks/2008/934573495-923875934.txt, or open every single file to extract the info about the document). ````


You can combine all this discussion into a "general structure" for spiders. For each page you want to visit, you need to know

  1. The URL (or the search term)
  2. The folder and filename you want to save it to

And then, for each page you want to visit you'll run this:

def one_search(<the url>,<filename this page would get>):
    if not os.path.exists(<filename this page would get>): 
            r = requests.get(<the url>)
            # log the error somehow
            # save the results, I typically save the RAW source 
            sleep(3) # be nice to server

And that gets run within some loop.

for url in urls:
    filename_to_save = <some function of the url>

This structure is pretty adaptable depending on the nature of the problem and the input data you have that yields the list of URLs to visit.

Would you like another tutorial to try?

Again, Greg Reda has a nice walkthrough discussing building a robust code to download a list, and incorporates many of the elements in code we've talked about.