Notebook

Demo: How to scrape multiple things from multiple pages¶

The goal is to scrape info about the five top-grossing movies for each year, for 10 years. I want the title and rank of the movie, and also, how much money did it gross at the box office. In the end I will put the scraped data into a CSV file.

In [32]:

from bs4 import BeautifulSoup
import requests

In [33]:

url = 'https://www.boxofficemojo.com/year/2018/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

Using Developer Tools, I discover the data I want is in an HTML table. I also discover that it is the only table on the page.

I store it in a variable named table.

In [34]:

table = soup.find( 'table' )

I use trial-and-error testing with print() to discover whether I can get row and cell data cleanly from the table.

In [35]:

# get all the rows from that one table
rows = table.find_all('tr')
# some more trial-and-error testing to find out which row holds the first movie
print(rows[1])
# now that I have the right row, get all the cells in that row
cells = rows[1].find_all('td')
# see whether I can print the movie title cleanly
title = cells[1].text
print(title)

<tr><td class="a-text-right mojo-header-column mojo-truncate mojo-field-type-rank mojo-sort-column">1</td><td class="a-text-left mojo-field-type-release mojo-cell-wide"><a class="a-link-normal" href="/release/rl2992866817/?ref_=bo_yld_table_1">Black Panther</a></td><td class="a-text-left mojo-field-type-genre hidden">-</td><td class="a-text-right mojo-field-type-money hidden">-</td><td class="a-text-right mojo-field-type-duration hidden">-</td><td class="a-text-right mojo-field-type-money mojo-estimatable">$700,059,566</td><td class="a-text-right mojo-field-type-positive_integer">4,084</td><td class="a-text-right mojo-field-type-money mojo-estimatable">$700,059,566</td><td class="a-text-left mojo-field-type-date a-nowrap">Feb 16</td><td class="a-text-left mojo-field-type-studio"><a class="a-link-normal" href="https://pro.imdb.com/company/co0226183/boxoffice/?view=releases&amp;ref_=mojo_yld_table_1&amp;rf=mojo_yld_table_1" rel="noopener" target="_blank">Walt Disney Studios Motion Pictures<svg class="mojo-new-window-svg" viewbox="0 0 32 32" xmlns="http://www.w3.org/2000/svg">
<path d="M24,15.57251l3,3V23.5A3.50424,3.50424,0,0,1,23.5,27H8.5A3.50424,3.50424,0,0,1,5,23.5V8.5A3.50424,3.50424,0,0,1,8.5,5h4.92755l3,3H8.5a.50641.50641,0,0,0-.5.5v15a.50641.50641,0,0,0,.5.5h15a.50641.50641,0,0,0,.5-.5ZM19.81952,8.56372,12.8844,17.75a.49989.49989,0,0,0,.04547.65479l.66534.66528a.49983.49983,0,0,0,.65479.04553l9.18628-6.93518,2.12579,2.12585a.5.5,0,0,0,.84741-.27526l1.48273-9.35108a.50006.50006,0,0,0-.57214-.57214L17.969,5.59058a.5.5,0,0,0-.27526.84741Z"></path>
</svg></a></td><td class="a-text-right mojo-field-type-boolean hidden">false</td></tr>
Black Panther

Next I try a for-loop to see if I can cleanly get the first five movies in the table.

In [36]:

# get top 5 movies on this page - I know the first row is [1]
for i in range(1, 6):
    cells = rows[i].find_all('td')
    title = cells[1].text
    print(title)

Black Panther
Avengers: Infinity War
Incredibles 2
Jurassic World: Fallen Kingdom
Deadpool 2

Try a similar for-loop to get total gross for the top five movies. Developer Tools show me this value is in the eighth cell in each row.

In [37]:

# I would like to get the total gross number also
for i in range(1, 6):
    cells = rows[i].find_all('td')
    gross = cells[7].text
    print(gross)

$700,059,566
$678,815,482
$608,581,744
$417,719,760
$318,491,426

Now I test getting all the values I want from each row, and it works!

In [38]:

# next I want to get rank (1-5), title and gross all on one line
for i in range(1, 6):
    cells = rows[i].find_all('td')
    print(cells[0].text, cells[1].text, cells[7].text)

1 Black Panther $700,059,566
2 Avengers: Infinity War $678,815,482
3 Incredibles 2 $608,581,744
4 Jurassic World: Fallen Kingdom $417,719,760
5 Deadpool 2 $318,491,426

I want this same data for each of 10 years, so first I will create list of the years I want.

In [39]:

# create a list of the 10 years I want
years = []
start = 2019
for i in range(0, 10):
    years.append(start - i)
print(years)

[2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010]

Still prepping for the 10 years, I create a base URL to use when I open each year's page.

In [40]:

# create base url
base_url = 'https://www.boxofficemojo.com/year/'
# test it
# print(base_url + years[0] + '/') -- ERROR!
print( base_url + str(years[0]) + '/')

https://www.boxofficemojo.com/year/2019/

Now I should have all the pieces I need ... I will test the code with a print statement --

In [41]:

# collect all necessary pieces (tested above) to make a loop that gets 
# top 5 movies for each of the 10 years in my list of years

for year in years:
    url = base_url + str(year) + '/'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    table = soup.find( 'table' )
    rows = table.find_all('tr')
    for i in range(1, 6):
        cells = rows[i].find_all('td')
        print(cells[0].text, cells[1].text, cells[7].text)

1 Avengers: Endgame $858,373,000
2 The Lion King $543,638,043
3 Toy Story 4 $434,038,008
4 Frozen II $470,089,732
5 Captain Marvel $426,829,839
1 Black Panther $700,059,566
2 Avengers: Infinity War $678,815,482
3 Incredibles 2 $608,581,744
4 Jurassic World: Fallen Kingdom $417,719,760
5 Deadpool 2 $318,491,426
1 Star Wars: Episode VIII - The Last Jedi $620,181,382
2 Beauty and the Beast $504,014,165
3 Wonder Woman $412,563,408
4 Guardians of the Galaxy Vol. 2 $389,813,101
5 Spider-Man: Homecoming $334,201,140
1 Finding Dory $486,295,561
2 Rogue One: A Star Wars Story $532,177,324
3 Captain America: Civil War $408,084,349
4 The Secret Life of Pets $368,384,330
5 The Jungle Book $364,001,123
1 Jurassic World $652,270,625
2 Star Wars: Episode VII - The Force Awakens $936,662,225
3 Avengers: Age of Ultron $459,005,868
4 Inside Out $356,461,711
5 Furious 7 $353,007,020
1 Guardians of the Galaxy $333,176,600
2 The Hunger Games: Mockingjay - Part 1 $337,135,885
3 Captain America: The Winter Soldier $259,766,572
4 The Lego Movie $257,760,692
5 Transformers: Age of Extinction $245,439,076
1 Iron Man 3 $409,013,994
2 The Hunger Games: Catching Fire $424,668,047
3 Despicable Me 2 $368,065,385
4 Man of Steel $291,045,518
5 Monsters University $268,492,764
1 The Avengers $623,357,910
2 The Dark Knight Rises $448,139,099
3 The Hunger Games $408,010,692
4 Skyfall $304,360,277
5 The Twilight Saga: Breaking Dawn - Part 2 $292,324,737
1 Harry Potter and the Deathly Hallows: Part 2 $381,011,219
2 Transformers: Dark of the Moon $352,390,543
3 The Twilight Saga: Breaking Dawn - Part 1 $281,287,133
4 The Hangover Part II $254,464,305
5 Pirates of the Caribbean: On Stranger Tides $241,071,802
1 Avatar $749,766,139
2 Toy Story 3 $415,004,880
3 Alice in Wonderland $334,191,110
4 Iron Man 2 $312,433,331
5 The Twilight Saga: Eclipse $300,531,751

When I see the result, I realize I need to make two adjustments.

Each line needs to have the year also
Maybe I should clean the gross so it's a pure integer

I can get rid of the dollar sign and the commas with a combination of two string methods -- .strip() and .replace()

In [42]:

# testing the clean-up code

num = '$293,004,164'
print(num.strip('$').replace(',', ''))

293004164

In [43]:

# testing a way to add the year to each line, using a list with only two years in it to save time

miniyears = [2017, 2014]
for year in miniyears:
    url = base_url + str(year) + '/'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    table = soup.find( 'table' )
    rows = table.find_all('tr')
    for i in range(1, 6):
        cells = rows[i].find_all('td')
        gross = cells[7].text.strip('$').replace(',', '')
        print(year, cells[0].text, cells[1].text, gross)

2017 1 Star Wars: Episode VIII - The Last Jedi 620181382
2017 2 Beauty and the Beast 504014165
2017 3 Wonder Woman 412563408
2017 4 Guardians of the Galaxy Vol. 2 389813101
2017 5 Spider-Man: Homecoming 334201140
2014 1 Guardians of the Galaxy 333176600
2014 2 The Hunger Games: Mockingjay - Part 1 337135885
2014 3 Captain America: The Winter Soldier 259766572
2014 4 The Lego Movie 257760692
2014 5 Transformers: Age of Extinction 245439076

Now that I know it all works, I want to save the data in a CSV file.

Python has a handy built-in module for reading and writing CSVs. We need to import it before we can use it.

In [44]:

import csv

# open new file for writing - this creates the file
csvfile = open("movies.csv", 'w', newline='', encoding='utf-8')

# make a new variable, c, for Python's CSV writer object -
c = csv.writer(csvfile)

# write a header row to the csv
c.writerow( ['year', 'rank', 'title', 'gross'] )

# modified code from above
for year in years:
    url = base_url + str(year) + '/'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    table = soup.find( 'table' )
    rows = table.find_all('tr')
    for i in range(1, 6):
        cells = rows[i].find_all('td')
        gross = cells[7].text.strip('$').replace(',', '')
        # print(year, cells[0].text, cells[1].text, gross)
        # instead of printing, I need to make a LIST and write that list to the CSV as one row
        # I use the same cells that I had printed before 
        c.writerow( [year, cells[0].text, cells[1].text, gross] )

# close the file
csvfile.close()

print("The CSV is done!")

The CSV is done!

The result is a CSV file, named movies.csv, that has 51 rows: the header row plus 5 movies for each year from 2010 through 2019. It has four columns: year, rank, title, and gross.

Note that only the final cell above is needed to create this CSV, by scraping 10 separate web pages. Everything above the final cell above is just instruction, demonstration. It is intended to show the problem-solving you need to go through to get to a desired scraping result.

You would not need to keep all the other work. Those cells could be deleted.