To now, we've covered means of grabbing data that are formatted to grab. The term 'web scraping' refers to the messier means of pulling material from web sites that were really meant for people, not for computers. Web sites, of course, can include a variety of objects: text, images, video, flash, etc., and your success at scraping what you want will vary. In other words, scraping involves a bit of MacGyvering.
Useful packages for scraping are requests
and bs4
/BeautifulSoup
, which code is included to install these below.
We'll run through a few quick examples, but for more on this topic, I recommend:
# Import the requests and beautiful soup
import requests
try:
from bs4 import BeautifulSoup
except:
!pip install bs4
from bs4 import BeautifulSoup
# Import re, a package for using regular expressions
import re
The requests
package works a lot like the urllib package in that it sends a request to a server and stores the servers response in a variable, here named response
.
# Send a request to a web page
response = requests.get('https://xkcd.com/869')
# The response object simply has the contents of the web page at the address provided
print(response.text)
BeautifulSoup
is designed to intelligently read raw HTML code, i.e., what is stored in the response
variable generated above. The command below reads in the raw HTML and parses it into logical components that we can command.
The lxml
in the command specifies a particular parser for deconstructing the HTML...
# BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')
type(soup)
Here we search the text of the web page's body for any instances of https://....png
, that is any link to a PNG image embedded in the page. This is done using re
and implementing regular expressions (see https://developers.google.com/edu/python/regular-expressions for more info on this useful module...)
The match
object returned by search(
) holds information about the nature of the match, including the original input string, the regular expression used, and the location within the original string where the pattern occurs. The group
property of the match is the full string that's returned
#Search the page for emebedded links to PNG files
match = re.search('https://.*\.png', soup.body.text)
#What was found in the search
print(match.group())
#And here is some Juptyer code to display the picture resulting from it
from IPython.display import Image
Image(url=match.group())