Scraping Data

  1. Skills we will develop
  2. Overview of different ways to get data
  3. Overview of python packages we can use

What skills do I need to learn to be a master Hacker(wo)man?

  1. Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you
  2. How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs)
  3. Doing that for a large number of webpages (building a "scraper" or "crawler" or "spider")

Ways to get data from the web

```{dropdown} 1: Manually click and download.

The way you would have done it before this class.


```{dropdown} 2: **Let pandas download your data,** like pd.read_csv(url)

Did you know? Pandas can often directly read tables on webpages! 
- Try `pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')`
- Very easy and fast! You don't even need to save the webapge to your hard drive.
- Notes on `read_html`: 
    1. It can only handle basic HTML tables encoded directly in the page (no Javascript, e.g.) and **only grabs displayed text -- embedded URLs are lost.**
    2. If the website changes the data, the next time you run it, you'll get the newer version of data. (Unstable, potentially, but also updates automatically.)

```{dropdown} 3: "Install and play" APIs, like pandas_datareader

API stands for Application Programming Interface, and it is a way for your computer to send a request (a query) to a server and get some response (hopefully useful data).

Plug and play APIs let you interact with a website without specifying the exact API requests to send to the server.

  • The pandas_datareader plug in for Yahoo stock prices is one version of this.
  • datadotworld was another.
  • Kaggle and most of the data sources listed on our resources page have API packages for Python.
  • I upload your peer reviews and manage assignment permissions using PyGithub to interact with GH

```{tip} 

If you need <20ish tables (the threshold depends on your coding speed), download what you need manually.

If you need more, it's time to scrape. 

**Options 1-3 are BY FAR the easiest.** If you want more than 10 tables or so (but the threshold depends on your coding speed), I'd abandon the manual option and go with `pandas` or a nice API package. 

Never ever try \#4 or \#5 without searching for "\<website\> python api" first.

```{dropdown} 4: Manual API queries for websites without "install and play" APIs

Many sites have an API port of some kind serving up the data they show visitors.


````{dropdown} 5: **Scraping the data on the website by visiting each page and downloading the data needed**

The last resort. You can't find the API serving the data, but your eyes see it. And you want it, cause websites contain a lot of data, like [GoT's IMDB page](https://www.imdb.com/title/tt0944947/?ref_=nv_sr_srsg_0).

```{warning}
This is an essential tool, but should be the last thing you try!

````

```{note} Wisdom from Greg Reda about scraping data

  1. You should check a site's terms and conditions before you scrape them. It's their data and they likely have some rules to govern it.
  2. Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don't hammer the site's server.
  3. Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.
  4. Web pages are inconsistent - There's sometimes some manual clean up that has to happen even after you've gotten your data.

```

Useful packages, tricks, and tips

Web scraping packages are always developing and evolving.

Task Thoughts
To "open" a page urllib or requests. requests is probably the best for sending API queries.

Warning: lots of walkthroughs online use urllib2, which worked for Python2 but not Python3. Use urllib instead, and you might have to include a few tweaks. For example, if you see from urllib2 import urlopen replace it with from urllib.request import urlopen
To parse a page beautifulsoup, lxml, or pyquery
Combining opening/parsing requests_html is a relatively new package and might be excellent. Its code is simply a combination of many of the above.
Blocked because you look like a bot or need to accept cookies? selenium is one way to "impersonate" a human, and also can help develop scraping macros, but you might not need it except on difficult scraping projects. It opens a literal browser window.

requests_html and requests can also store and use cookies. I'd recommend you try this before selenium.
Blocked because you're sending requests too fast? from time import sleep allows you to sleep(<# of seconds>) your code.
Wonder what your current HTML looks like? from IPython.display import HTML then HTML(<html object>) will show you what the HTML you have looks like.
E.g. if you're using r = requests(url), then you can use HTML(r.text) to see the request object.
How do I find a particular "piece" of a webpage E.g. Q: Where is that table?
A: Oh, it's inside the HTML tag called "table3".

You can search for elements via attributes, CSS selectors, XPath, and text. This will make more sense soon.

To find that info: Right click on an element you're interested and click "Inspect Element". (F12 is the Windows shortcut.)

My suggestion

This is subject to change, but I think you should pick ONE opening and ONE parsing module and stick with it for now. requests_html is a pretty good option that opens pages and can parse them, and it allows you to use lxml, or pyquery within it.

You can change and try other stuff as you go, but get as familiar with one package as you can (in a cheap/efficient way).

Now to contradict myself: Some of the packages above can't do things others can, or do them much slower, or the code is hard to write, read, and debug. Sometimes, you're holding a hammer but you need a screwdriver. What I'm saying is, if another package can easily do the job, use it. (Just realize that learning a new package comes with a fixed cost, so be sure you need that screwdriver before grabbing it.)