Find how to split the problem up.
Do simple tests of each part, if necessary, then assemble them later on...
#Load file
import requests
url='https://www.ssepd.co.uk/Powertrack/'
r =requests.get(url)
from bs4 import BeautifulSoup
soup= BeautifulSoup(r.text,'html.parser')
Find a div of interest...
powertracksummary= soup.findAll('div', {'class':"power-track-summary"})
#How many are there? There should be just one
len(powertracksummary)
1
powertracksummary = powertracksummary[0]
#halfrows because there is another block of divs that contains data for same data row
halfrows=powertracksummary.findAll('div',{'class':'row'})
len(halfrows)
21
Okay - there's something wrong with that tag because there are not 21 rows in the table. The row
class is not uniquely identifying the thing we are interested in, which makes it less than useful...
(I didnlt do this when you were here - I should have checked!)
Let's just look at some of the row
classed rows...
#preview one of them
halfrows[0]
<div class="row"> <div class="col-xs-6 col-sm-4 col-md-2"> <div class="power-track-date-info"> <span class="glyphicon glyphicon-calendar"></span> <div class="date"> <div> 30 Jun </div> <div class="time"> 16:20 </div> </div> </div> </div> <div class="col-xs-6 col-sm-4 col-md-3"> <div class="heading"> Affected Area: </div> <div> OX10 & OX49 Areas </div> </div> <div class="col-xs-6 col-sm-4 col-md-3 pull-right"> <div class="heading reference"> Reference Number: </div> <div> EM9395 </div> </div> <div class="col-xs-6 col-sm-8 col-md-4 pull-right manage-row-height"> <div class="heading duration"> Duration since reported: </div> <div> 19 minute(s) ago </div> </div> </div>
#Let's look at the last one...
halfrows[-1]
<div class="affected-areas row"> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> UB5 4PP </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> UB5 4PW </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> UB5 4PY </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> UB5 4PZ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> UB5 4QH </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> UB5 4QJ </div> </div>
#for reference...
halfrows[-1]['class']
['affected-areas', 'row']
This is not good and can be a massive distractor / time waster trying to scrape this if we assume the row
classed rows are all the same structure. In fact, the row
class is actually grabbing us different sorts of row, so it's not very useful...
We really need to look for an alternative, becuase the main of of scraping is to look for repeatable patterns that we can parse information from in a regulaar, repeated way...
If the row
class pulls back rows that are all structured the same, we can write a scraper for one row that will work on them all; if the class pulls back two or more srots of row, we need to detect which sort and scrape each separately, which just gets messy.
The accordion-group
group does identify rows properly:
accordionrows = powertracksummary.findAll('div',{'class':'accordion-group'})
len(accordionrows)
7
So use that...
Let's try and scrape one row:
accordionrow = accordionrows[0]
<div class="accordion-group"> <div class="accordion-heading"> <a class="accordion-toggle" data-parent="#accordion" data-toggle="collapse" href="#collapseEM9395"> <div class="row"> <div class="col-xs-6 col-sm-4 col-md-2"> <div class="power-track-date-info"> <span class="glyphicon glyphicon-calendar"></span> <div class="date"> <div> 30 Jun </div> <div class="time"> 16:20 </div> </div> </div> </div> <div class="col-xs-6 col-sm-4 col-md-3"> <div class="heading"> Affected Area: </div> <div> OX10 & OX49 Areas </div> </div> <div class="col-xs-6 col-sm-4 col-md-3 pull-right"> <div class="heading reference"> Reference Number: </div> <div> EM9395 </div> </div> <div class="col-xs-6 col-sm-8 col-md-4 pull-right manage-row-height"> <div class="heading duration"> Duration since reported: </div> <div> 19 minute(s) ago </div> </div> </div> </a> </div> <div class="accordion-body collapse" id="collapseEM9395"> <div class="accordion-inner"> <div class="summary"> We apologise for the loss of supply. We currently have a fault affecting the areas listed. Our engineers are on site working to get the power back on as quickly as they can. If you need more information, please call us on 0800 072 7282 and quote reference 'EM9395' </div> <div class="sub-heading"> Status (All times are estimated) </div> <div class="content clearfix"> <div class="row"> <div class="col-xs-12 col-sm-6 col-md-6"> <p class="content-label">Fault reference</p> <p class="content-text">EM9395</p> </div> <div class="col-xs-12 col-sm-6 col-md-6"> <div> <p class="content-label">Reported at</p> <p class="content-text">30 Jun 2018 16:20</p> </div> </div> <div class="col-xs-12 col-sm-6 col-md-6"> <div id="ctl00_BodyPlaceHolder_ctl00_PowerTrackSummaryListView_ctrl0_Panel1"> <p class="content-label">Engineer expected at</p> <p class="content-text">30 Jun 2018 17:00</p> </div> </div> <div class="col-xs-12 col-sm-6 col-md-6"> <div id="ctl00_BodyPlaceHolder_ctl00_PowerTrackSummaryListView_ctrl0_Panel2"> <p class="content-label">Restoration expected at</p> <p class="content-text">30 Jun 2018 19:30</p> </div> </div> </div> </div> <div class="sub-heading"> 114 known affected area(s) </div> <div class="affected-areas row"> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> HP14 3YE </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX10 6AB </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX10 6JF </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX10 6JG </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX10 6JX </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX10 6PE </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5BA </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5BB </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5BF </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5BG </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EA </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EB </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EH </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EJ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EL </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EN </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EP </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5ER </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5ES </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EU </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EW </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EX </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EY </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5EZ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HA </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HB </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HE </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HF </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HG </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HH </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HJ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HL </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HN </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HR </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HT </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HU </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5HW </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JD </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JH </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JJ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JL </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JR </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JS </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JT </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JU </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JW </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JX </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JY </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5JZ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LA </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LB </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LD </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LE </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LF </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LG </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LH </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LJ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LP </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LQ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LR </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LU </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LY </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5LZ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NA </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NB </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5ND </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NE </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NF </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NG </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NH </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NJ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NL </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NN </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NP </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NQ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NR </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NS </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NT </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NU </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NW </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NX </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NY </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5NZ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5PA </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5PB </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5PD </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5PE </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5PF </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5PG </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5PT </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5QA </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5QE </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5QF </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5QH </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5QJ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5QL </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5QN </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5QQ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5QW </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5RF </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5RH </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX49 5RJ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> OX9 7DQ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6ER </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6HB </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6HD </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6HE </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6HF </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6HG </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6HH </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6HJ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6HQ </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6HR </div> <div class="col-xs-12 col-sm-3 col-md-2"> <span class="glyphicon glyphicon-chevron-right"></span> RG9 6HS </div> </div> <p class="summary-action"> <span class="glyphicon glyphicon-chevron-left"></span> <a href="#" onclick="SwitchTab(0);" title="Return to the map.">Back To Map</a> </p> </div> </div> </div>
accordionrow.find('div',{'class':'date'})
<div class="date"> <div> 30 Jun </div> <div class="time"> 16:20 </div> </div>
divs = accordionrow.find('div',{'class':'date'}).findAll('div')
divs
[<div> 30 Jun </div>, <div class="time"> 16:20 </div>]
_date = divs[0].text.strip()
_date
'30 Jun'
_time = divs[1].text.strip()
_time
'16:20'
Example of finding all the dates and times:
#Create a list to store data from each row ,one row per list item
records = []
#Assemble the recipe from the ingredients we started to prepare above
for accordionrow in accordionrows:
divs = accordionrow.find('div',{'class':'date'}).findAll('div')
_date = divs[0].text.strip()
_time = divs[1].text.strip()
record = { 'time':_time, 'date':_date}
records.append(record)
records
[{'time': '16:20', 'date': '30 Jun'}, {'time': '14:02', 'date': '30 Jun'}, {'time': '12:58', 'date': '30 Jun'}, {'time': '10:43', 'date': '30 Jun'}, {'time': '10:16', 'date': '30 Jun'}, {'time': '05:10', 'date': '30 Jun'}, {'time': '04:56', 'date': '30 Jun'}]