Department of Data Science

Course: Tools and Techniques for Data Science

Instructor: Muhammad Arif Butt, Ph.D.

Lecture 5.3 (Web Scraping using Selenium - I)

In [ ]:

Learning agenda of this notebook¶

Recap of Previous Session

Overview of Selenium (Why, What and How)
- Why use Selenium?
- What is Selenium? (Selenium Architecture)
- How to use Selenium?
- Download and Install Selenium
- Download Selenium WebDriver for your browser (Chrome, Safari, Firefox, Internet Explorer)
- Setting options of Chrome Driver (Headless mode)
A Step-by-Step Hello World with Selenium
- Create an instance of Browser
- Load a Web page in the browser window
- Access browser information
- Perform Different operations on the browser
- Create new tab in the browser window and shift between tabs
- Close browser tab or close the entire session
Example 1: Scraping a JavaScript Driven WebSite (https://arifpucit.github.io/bss2/js/)
- What is JavaScript Driven Website?
- What happens when we use Requests and BeautifulSoup to scrape JS websites?
- Using Selenium and BeautifulSoup to scrape JS websites
Example 2: Scraping Dynamic WebSites (https://arifpucit.github.io/bss2/login/)
- Different Ways to Locate Web elements using Selenium
- Selenium find_element() and find_elements() methods
- Selenium Locators
  - ID
  - NAME
  - TAG_NAME
  - CLASS_NAME
  - LINK_TEXT
  - PARTIAL_LINK_TEXT
  - CSS_SELECTOR
  - XPATH
- Entrying text in a Text Box on a Web Page
- Clicking a Button element on a web page
- Consolidated Script to Login and Scrape Books Data
Example 3: Scraping Web Pages that Employ Infinite Scrolling: https://arifpucit.github.io/bss2/scrolling/
Example 4: Scraping Web Pages that Employ Pagination: https://arifpucit.github.io/bss2/pagination/
Example 5: Scraping Web Pages that use Pop-ups: https://arifpucit.github.io/bss2/popup/
Bonus:
- Email Scraped CSV file from Python

To Be Continued...¶

Web Scraping Best Practices and Scraping of Real Websites

In [ ]:

1. Overview of Selenium (Why, What and How)¶

Selenium: https://www.selenium.dev/

In [ ]:

a. Architecture of Selenium¶

In [ ]:

b. Download and Install Selenium¶

Download and Install Selenium Client Library for Python: https://www.selenium.dev
Read Selenium Documentation for Python: https://selenium-python.readthedocs.io/

In [ ]:

import sys
!{sys.executable} -m pip install --upgrade pip -q
!{sys.executable} -m pip install --upgrade selenium -q

In [ ]:

import selenium
selenium.__version__ , selenium.__path__

In [ ]:

c. Download Selenium WebDriver¶

Download Selenium Web Driver for Python: https://www.selenium.dev/
Copy the ChromeDriver executable at a known location on your disk

In [ ]:

2. A Step-by-Step Hello World with Selenium¶

Steps to follow while scraping websites:
- Create an instance of WebDriver
- Navigate to the desired Web page that you want to scrape
- Locate the Web element on the Web page
- Perform an action on that web element
- Write the scraped data into appropriate format in a file
- Close the instance of WebDriver

In [ ]:

a. Create an Instance of a WebDriver, Load a Webpage, Play and Quit¶

Create an instance of Browser:

The Service('pathtochromedriver) method is used to create a Service object that needs to be passed to Chrome() method.

The Chrome(service, options) method is used to create a new instance of the chrome driver, starts the service and then creates a new instance of chrome browser.

ChromeOptions is a new concept added in Selenium WebDriver starting from Selenium version 3.6. 0 which is used for customizing the ChromeDriver session.

The Options() method is used to change the default settings of chrome driver. The object is then passed to the webdriver.chrome() method.

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)

In [ ]:

driver.quit()

In [ ]:

from selenium.webdriver.chrome.options import Options

s = Service('/Users/arif/Documents/chromedriver')
myoptions = Options()
myoptions.headless = True

driver = Chrome(service=s, options=myoptions)

In [ ]:

driver.quit()

In [ ]:

s = Service('/Users/arif/Documents/chromedriver')
myoptions = Options()

driver = Chrome(service=s, options=myoptions)

In [ ]:

Load a Web page in the browser window:

The driver.get('URL') method is used to load a web page in the current browser session, after which you can access the browser and the HTML code using the driver object.

This is similar to resp = requests.get('URL'), after which you simply get the response object.

In [ ]:

driver.get('https://google.com')

In [ ]:

Access browser information:

There is a bunch information about the browser you can request, including window handles, browser size / position, cookies, alerts, etc.

In [ ]:

print(dir(driver))

In [ ]:

driver.title

In [ ]:

driver.current_url

In [ ]:

driver.current_window_handle

In [ ]:

driver.session_id

In [ ]:

driver.page_source

In [ ]:

Perform Different operations on the browser:

The driver.refresh() method is used to refresh the page contents.

The driver.set_window_position(x,y) is used to set the positions of the top left corner of the browser window.

The driver.set_window_size(x,y) is used to set the width and height of current window.

The driver.maximize_window() is used to maximinize the size of the window.

The driver.minimize_window() is used to minimize the browser in the taskbar.

In [ ]:

driver.refresh()

In [ ]:

driver.set_window_position(0,0)

In [ ]:

driver.maximize_window()

In [ ]:

driver.minimize_window()

In [ ]:

Create new tab in the browser window and shift between tabs:

Clicking a link may opens in a new browser tab

You can also create a new browser tab programmatically using the driver.switch_to.new_window('tab').

All calls to the driver will now be interpreted as being directed to the current browser tab.

WebDriver supports moving between windows using:

- `driver.switch_to.window("windowname")`
- `driver.switch_to.frame('framename')`
- `driver.switch_to.default_content()`
- All calls to driver will now be interpreted as being directed to the particular window.

In [ ]:

google_tab = driver.current_window_handle

In [ ]:

driver.switch_to.new_window('tab')

In [ ]:

driver.get('https://www.yahoo.com')

In [ ]:

driver.switch_to.window(google_tab)

In [ ]:

driver.close()

In [ ]:

driver.quit()

In [ ]:

Close browser tab or close the entire session:

The driver.close() will simply closes the current tab of the browser and will not close the browser process.

The driver.quit() will close all the browser tabs and the background driver process.

In [ ]:

3. Example 1: Scraping a JavaScript Driven WebSite (https://arifpucit.github.io/bss2/js/)¶

In [ ]:

a. Using Requests and BeautifulSoup¶

In [ ]:

import requests
from bs4 import BeautifulSoup
import pandas as pd

titles = []
prices = []
availability=[]
reviews=[]
links=[]
stars=[]

resp = requests.get("https://arifpucit.github.io/bss2/js")
soup = BeautifulSoup(resp.text, 'lxml') #resp.text do not contain the HTML for the books data

sp_titles = soup.find_all('p', class_="book_name")
sp_prices = soup.find_all('p', class_="price green")
sp_availability = data = soup.find_all('p', class_='stock')
sp_reviews = soup.find_all('p',{'class','review'})
data = soup.find_all('p', class_="book_name")
sp_links=[]
for val in data:
    sp_links.append(val.find('a').get('href'))
books = soup.find_all('div',{'class','book_container'})
for book in books:
    stars.append(5 - len(book.find_all('span',{'class','not_filled'})))

    
for i in range(len(sp_titles)):
    titles.append(sp_titles[i].text)
    prices.append(sp_prices[i].text)
    availability.append(sp_availability[i].text)
    reviews.append(sp_reviews[i].text)
    links.append(sp_links[i])


data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('books1.csv', index=False)
df = pd.read_csv('books1.csv')
df

In [ ]:

b. Using Selenium and BeautifulSoup¶

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd

s = Service('/Users/arif/Documents/chromedriver')
myoptions = Options()      
driver = Chrome (service=s, options=myoptions)
driver.get('https://arifpucit.github.io/bss2/js/')

titles = []
prices = []
availability=[]
reviews=[]
links=[]
stars=[]
soup = BeautifulSoup(driver.page_source, 'lxml')#driver.page_source DO contain the HTML for the books data
sp_titles = soup.find_all('p', class_="book_name")
sp_prices = soup.find_all('p', class_="price green")
sp_availability = data = soup.find_all('p', class_='stock')
sp_reviews = soup.find_all('p',{'class','review'})
data = soup.find_all('p', class_="book_name")
sp_links=[]
for val in data:
    sp_links.append(val.find('a').get('href'))
books = soup.find_all('div',{'class','book_container'})
for book in books:
    stars.append(5 - len(book.find_all('span',{'class','not_filled'})))    
for i in range(len(sp_titles)):
    titles.append(sp_titles[i].text)
    prices.append(sp_prices[i].text)
    availability.append(sp_availability[i].text)
    reviews.append(sp_reviews[i].text)
    links.append(sp_links[i])
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('books1.csv', index=False)
df = pd.read_csv('books1.csv')
df

In [ ]:

driver.quit()

In [ ]:

a. Different Ways to Locate Web elements using Selenium¶

Once we have the webpage loaded inside our browser, the next task is to locate the web element(s) of our interest and later perform actions on it.
The two most commonly used methods used to locate elements are:
- The driver.find_element(By.LOCATOR, "value") method is used to locate a single element.
- The driver.find_elements(By.LOCATOR, "value") method is used to locate multiple elements.
The first argument to these methods are the locators, and second argument is the value of that locator.
In Selenium, there are eight different types of Locators or ways using which we can locate a web element:
- ID, NAME, and CLASS_NAME attributes of a web element are called direct locators, as they are fast. Their limitation is they may not always work in case of dynamic web sites.
- XPATH, and CSS_SELECTOR are called indirect locators as they are comparatively slow, but are really useful in case of dynamic web sites.
- LINK_TEXT, and PARTIAL_LINK_TEXT
- TAG_NAME itself, which is seldomly used.
Locating Web Elements: https://selenium-python.readthedocs.io/locating-elements.html
Interacting with Web Elements: https://www.selenium.dev/documentation/webdriver/elements/interactions/
Read about CSS_SELECTOR: https://www.w3schools.com/cssref/css_selectors.asp
Read about XPATH: https://www.guru99.com/xpath-selenium.html, https://www.browserstack.com/guide/find-element-by-xpath-in-selenium
Install Chrome Extension (Selector Hub): https://selectorshub.com/

In [ ]:

Load the Login version of Book Scraping Site: https://arifpucit.github.io/bss2/login/

In [ ]:

s = Service('/Users/arif/Documents/chromedriver')
myoptions = Options()      
driver = Chrome (service=s, options=myoptions)
driver.get('https://arifpucit.github.io/bss2/login/')
driver.maximize_window()

In [ ]:

driver.quit()

In [ ]:

s = Service('/Users/arif/Documents/chromedriver')
myoptions = Options()      
driver = Chrome (service=s, options=myoptions)
driver.get('https://arifpucit.github.io/bss2/login/')
driver.maximize_window()

In [ ]:

from selenium.webdriver.common.by import By

tbox = driver.find_element(By.ID, 'name')
type(tbox)

In [ ]:

tbox.send_keys("arif")

In [ ]:

tbox.clear()

In [ ]:

mylink = driver.find_element(By.LINK_TEXT, 'Ask Google for Password')
mylink.click()

In [ ]:

driver.back()

In [ ]:

tbox2 = driver.find_element(By.CSS_SELECTOR, '#password')
tbox2.send_keys('datascience')

In [ ]:

btn = driver.find_element(By.XPATH, '//*[@id="submit_button"]')
btn.click()

In [ ]:

for i in range(1,10):
    price = driver.find_element(By.XPATH, '/html/body/section/div/div[2]/div[2]/div[' + str(i)+ ']/div/p[1]').text
    print(price)

In [ ]:

driver.quit()

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd


s = Service('/Users/arif/Documents/chromedriver') 
driver = Chrome(service=s)    
driver.get('https://arifpucit.github.io/bss2/login') 
driver.maximize_window()


driver.find_element(By.ID, "name").send_keys("arif")
driver.find_element(By.ID, "password").send_keys("datascience")
btn = driver.find_element(By.ID, "submit_button")
time.sleep(2)
btn.click()
time.sleep(2)


titles = []
prices = []
availability=[]
reviews=[]
links=[]

for i in range(1,10):
    title = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/p').text
    titles.append(title)
    price = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/div/p[1]').text
    prices.append(price)
    avail = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/div/p[2]').text
    availability.append(avail)
    review = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/div/p[3]').text
    reviews.append(review)
    link = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/p/a').get_attribute('href')
    links.append(link)

data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links'])
df.to_csv('books2.csv', index=False)
df = pd.read_csv('books2.csv')
df

In [ ]:

driver.quit()

Above bot scrape the data of nine OS books only.

Try extending above crawler to scrape the books data of SP and CA as an exercise.

In [ ]:

5. Example 3: Scraping Multiple Web Pages that Employ Infinite Scrolling¶

https://arifpucit.github.io/bss2/scroll/

In [ ]:

a. Fetch Contents w/o Scrolling¶

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd


s = Service('/Users/arif/Documents/chromedriver') 
driver = Chrome(service=s)    
driver.get('https://arifpucit.github.io/bss2/scroll') 
driver.maximize_window()


titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]


books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
books_count = len(books)
for i in range(1,books_count+1):
    title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
    titles.append(title)
    price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
    prices.append(price)
    avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
    availability.append(avail)
    review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
    reviews.append(review)
    link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
    links.append(link)
    star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
    star_rates.append(star_rate)
    
    
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books3.csv', index=False)
df = pd.read_csv('books3.csv')
df

In [ ]:

b. How to Scroll an Infinite Scrolling Web Page using Selenium¶

The driver.execute_script(JS) method is used to synchronously execute JavaScript in the current window/frame.

driver.execute_script('alert("Hello JavaScript")')

The window.scrollTo() method is used to perform scrolling operation. The pixels to be scrolled horizontally along the x-axis and pixels to be scrolled vertically along the y-axis are passed as parameters to the method.

In [ ]:

driver.execute_script('return document.body.scrollHeight')

In [ ]:

driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')

In [ ]:

driver.quit()

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
import time

s = Service('/Users/arif/Documents/chromedriver') 
driver = Chrome(service=s)    
driver.get('https://arifpucit.github.io/bss2/scroll') 
driver.maximize_window()

last_height =driver.execute_script('return document.body.scrollHeight')
while True:
    driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
    time.sleep(2)
    new_height =driver.execute_script('return document.body.scrollHeight')
    if (new_height == last_height):
        break
    last_height = new_height
    
# Count of books in the entire page
books_count = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
len(books_count)

In [ ]:

driver.quit()

In [ ]:

e. Scrape all the Books using Self-Scrolling¶

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd


s = Service('/Users/arif/Documents/chromedriver') 
driver = Chrome(service=s)    
driver.get('https://arifpucit.github.io/bss2/scroll') 
driver.maximize_window()

#Scroll the entire page and then starts scraping
last_height =driver.execute_script('return document.body.scrollHeight')
while True:
    driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
    time.sleep(2)
    new_height =driver.execute_script('return document.body.scrollHeight')
    if (new_height == last_height):
        break
    last_height = new_height


titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]

books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
books_count = len(books)
for i in range(1,books_count+1):
    title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
    titles.append(title)
    price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
    prices.append(price)
    avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
    availability.append(avail)
    review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
    reviews.append(review)
    link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
    links.append(link)
    star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
    star_rate = round(float(star_rate),2)
    star_rates.append(star_rate)
    
    
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books3.csv', index=False)
df = pd.read_csv('books3.csv')
df

In [ ]:

driver.quit()

In [ ]:

https://arifpucit.github.io/bss2/pagination/

In [ ]:

a. Fetch Contents of First Page¶

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd

titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]

def books():
    time.sleep(2)
    books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
    books_count = len(books)
    for i in range(1,books_count+1):
        title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
        titles.append(title)
        price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
        prices.append(price)
        avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
        availability.append(avail)
        review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
        reviews.append(review)
        link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
        links.append(link)
        star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
        star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points
        star_rates.append(star_rate)
    

s = Service('/Users/arif/Documents/chromedriver') 
driver = Chrome(service=s)       
url  = 'https://arifpucit.github.io/bss2/pagination'
driver.get(url)
driver.maximize_window()

books()

# Writing in the file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books4.csv', index=False)
df = pd.read_csv('books4.csv')
df

In [ ]:

driver.quit()

In [ ]:

At times we need to wait for all the web elements to be properly displayed on a web page. There are two ways to wait:
- Use time.sleep(ARBITRARY_TIME) method.
- Use WebDriverWait() method.
If you use time.sleep() you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough. Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.
So better solution is use the WebDriverWait() method, that will wait for the exact amount of time necessary for your element/data to be loaded.
There are many interesting expected conditions on which you can wait, like:
- presence_of_element_located
- element_to_be_clickable
- text_to_be_present_in_element
- element_to_be_clickable

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
# new header files
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

s = Service('/Users/arif/Documents/chromedriver') 
driver = Chrome(service=s)       
url  = 'https://arifpucit.github.io/bss2/pagination'
driver.get(url)
driver.maximize_window()

In [ ]:

while(True):  
        WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item'))) 
        try:    
            driver.find_element(By.XPATH,'//*[@id="page_8"]')
            driver.execute_script("arguments[0].scrollIntoView();", driver.find_element(By.XPATH,'//*[@id="page_8"]'))
            try:
                time.sleep(2)
                driver.find_element(By.CSS_SELECTOR,'.page-item.disabled') 
                break
            except:
                driver.find_element(By.XPATH,'//*[@id="page_8"]').click()  
        except:
            break

print("Successfully, reached on the last page")

In [ ]:

driver.quit()

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
# new header files
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]

def books():
    time.sleep(2)
    books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
    books_count = len(books)
    for i in range(1,books_count+1):
        title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
        titles.append(title)
        price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
        prices.append(price)
        avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
        availability.append(avail)
        review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
        reviews.append(review)
        link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
        links.append(link)
        star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
        star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points
        star_rates.append(star_rate)
    

s = Service('/Users/arif/Documents/chromedriver') 
driver = Chrome(service=s)       
url  = 'https://arifpucit.github.io/bss2/pagination'
driver.get(url)
driver.maximize_window()


# Let us call the books() function and click the next button and repeat till it disappears/disabled 
while(True):
        books()    
        page = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item'))) 
        try:    
            driver.find_element(By.XPATH,'//*[@id="page_8"]')
            driver.execute_script("arguments[0].scrollIntoView();", driver.find_element(By.XPATH,'//*[@id="page_8"]'))
            try:
                time.sleep(2)
                driver.find_element(By.CSS_SELECTOR,'.page-item.disabled') 
                break
            except:
                driver.find_element(By.XPATH,'//*[@id="page_8"]').click()  
        except:
            break



# Writing in the file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books4.csv', index=False)
df = pd.read_csv('books4.csv')
df

In [ ]:

driver.quit()

In [ ]:

Pop-ups are kind of informational or promotional offers that displays on top of your content.
They are designed to capture user's attention quickly.

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
# new header files
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]

def books():
    time.sleep(2)
    books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
    books_count = len(books)
    for i in range(1,books_count+1):
        title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
        titles.append(title)
        price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
        prices.append(price)
        avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
        availability.append(avail)
        review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
        reviews.append(review)
        link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
        links.append(link)
        star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
        star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points
        star_rates.append(star_rate)
    

s = Service('/Users/arif/Documents/chromedriver') 
driver = Chrome(service=s)       
url  = 'https://arifpucit.github.io/bss2/popup'
driver.get(url)
driver.maximize_window()

# Let us call the books() function and click the next button and repeat till it disappears/disabled 
while(True):
        books()    
        page = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item'))) ## this is what wait looks like and this is important otherwise this code dont even But github worked without wait as i loaded all the data very fast
        try:    
            driver.find_element(By.XPATH,'//*[@id="page_8"]')
            driver.execute_script("arguments[0].scrollIntoView();", driver.find_element(By.XPATH,'//*[@id="page_8"]'))
            try:
                time.sleep(2)
                driver.find_element(By.CSS_SELECTOR,'.page-item.disabled') ## finding the disbale class which disable the last page if we get this then break otherwise exception will be throwed
                break
            except:
                driver.find_element(By.XPATH,'//*[@id="page_8"]').click()  ## as exception throwed now should to the next button i mean a tag of next page      
        except:
            break



# Writing in the file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books5.csv', index=False)
df = pd.read_csv('books5.csv')
df

In [ ]:

driver.quit()

In [ ]:

b. Solution¶

In [ ]:

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
# new header files
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]

def books():
    time.sleep(2)
    books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
    books_count = len(books)
    for i in range(1,books_count+1):
        title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
        titles.append(title)
        price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
        prices.append(price)
        avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
        availability.append(avail)
        review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
        reviews.append(review)
        link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
        links.append(link)
        star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
        star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points
        star_rates.append(star_rate)
    

s = Service('/Users/arif/Documents/chromedriver') 
driver = Chrome(service=s)       
url  = 'https://arifpucit.github.io/bss2/popup'
driver.get(url)
driver.maximize_window()


#### Lets close the pop up
time.sleep(5)
driver.switch_to.frame(driver.find_element(By.ID,'frame'))
clos_button  = driver.find_element(By.XPATH,'//*[@id="staticBackdrop"]/div/div/div[1]/button')
clos_button.click()
driver.switch_to.default_content()




# Let us call the books() function and click the next button and repeat till it disappears/disabled 
while(True):
        books()    
        page = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item'))) ## this is what wait looks like and this is important otherwise this code dont even But github worked without wait as i loaded all the data very fast
        try:    
            driver.find_element(By.XPATH,'//*[@id="page_8"]')
            driver.execute_script("arguments[0].scrollIntoView();", driver.find_element(By.XPATH,'//*[@id="page_8"]'))
            try:
                time.sleep(2)
                driver.find_element(By.CSS_SELECTOR,'.page-item.disabled') ## finding the disbale class which disable the last page if we get this then break otherwise exception will be throwed
                break
            except:
                driver.find_element(By.XPATH,'//*[@id="page_8"]').click()  ## as exception throwed now should to the next button i mean a tag of next page      
        except:
            break



# Writing in the file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books5.csv', index=False)
df = pd.read_csv('books5.csv')
df

In [ ]:

8. Bonus¶

How to Generate App Passwords in Gmail

Before 30 May 2022, one could enable "Less Secure Accounts option" on his/her Google account and can send emails via 3rd party tools, eg., from Python code. For security reasons, Google has disabled the "Less Secure Accounts option".
The alternative solution is "Google App Password", using which you can sign in to your Google account from applications on devices that do not support 2-Step Verification
So we can use App Passwords feature to create separate passwords exclusively for the script instead of main password without sacrificing two-factor authentication.
The steps to generate app passwords in gmail are:
- Login to your google account and click ::: icon at top right, then click Accounts to open your account settings
- and visit google account page at accounts.google.com
- Click security tab and make sure two factor authentication is turned ON
- Click on App Passwords
- Generate an app password by selecting the app as ‘Mail’ and the device as Windows or Mac (whatever applies) and and click Generate
- It will display a 16 character app password
- Copy and use the generated app password instead of original gmail password in python scripts

The smtplib is a Python library for sending emails using the Simple Mail Transfer Protocol (SMTP). The smtplib is a built-in module; we do not need to install it. It abstracts away all the complexities of SMTP
MIMEBase is just a base class. As the specification says: "Ordinarily you won’t create instances specifically of MIMEBase"
MIMEText is for text (e.g. text/plain or text/html), if the whole message is in text format, or if a part of it is.
MIMEMultipart is for saying "I have more than one part", and then listing the parts - you do that if you have attachments, you also do it to provide alternative versions of the same content (e.g. a plain text version plus an HTML version)

In [ ]:

a. Sending Scraped Data vis E-Mail from Python¶

-- Last generated app password is: clqthhcveozaehuk

In [ ]:

import smtplib, ssl
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email import encoders

sender = 'arifbuttscraper@gmail.com'
passwd = 'xxxxxxxxxxxxx'
receiver = 'xxxxx@gmail.com'

# create MIMEMultipart object, fill its header information and attach the body to it
msg = MIMEMultipart()
msg['From'] = sender
msg['To'] = receiver
msg['Subject'] = 'Books Data'
msg.attach(MIMEText(
'''
AoA,
Please see the attached file containing information about books.
Best
''', 
'plain'))


# create MIMEBase object for creating an attachment and attach it with the MIMEMultipart object
part = MIMEBase('application', 'octet-stream')
fd = open('books4.csv', 'rb') 
file_contents = fd.read()
part.set_payload(file_contents)
encoders.encode_base64(part)
part.add_header('Content-Disposition', 'attachment; filename ="books4.csv"')
msg.attach(part)

# Send message object as email using smptplib by first creating an smtp session
s = smtplib.SMTP_SSL(host='smtp.gmail.com', port=465)
s.login(user = sender, password = passwd)
s.sendmail(sender, receiver, msg.as_string())
s.quit()
print('Done..!!')

In [ ]:

b. Schedule your E-Mail¶

In [ ]:

import sys
!{sys.executable} -m pip install schedule -q

In [ ]:

import smtplib, ssl
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email import encoders
import os
sender = 'xxxxx@gmail.com'
receiver = 'yyyyy@gmail.com'
passwd = 'xxxxxxxxxxxxx'
msg = MIMEMultipart()
msg['From'] = sender
msg['To'] = receiver
msg['Subject'] = 'Scheduled Email'
msg.attach(MIMEText('This is an email scheduled to be sent at a specific date and time', 'plain'))
part = MIMEBase('application', 'octet-stream')
fd = open('books4.csv', 'rb') 
file_contents = fd.read()
part.set_payload(file_contents)
encoders.encode_base64(part)
part.add_header('Content-Disposition', 'attachment; filename ="books.csv"')
msg.attach(part)

In [ ]:

def mail():
    s = smtplib.SMTP_SSL(host='smtp.gmail.com', port=465)
    s.login(user = sender, password = passwd)
    s.sendmail(sender, receiver, msg.as_string())

In [ ]:

!date

In [ ]:

import schedule
import time
schedule.every().day.at("16:27").do(mail)
while (True):
    schedule.run_pending()
    time.sleep(1)

In [ ]:

To Be Continued...¶

Web Scraping Best Practices and Scraping of Real Websites

In [ ]:

Department of Data Science

Course: Tools and Techniques for Data Science

Instructor: Muhammad Arif Butt, Ph.D.

Lecture 5.3 (Web Scraping using Selenium - I)

Learning agenda of this notebook¶

To Be Continued...¶

1. Overview of Selenium (Why, What and How)¶

a. Architecture of Selenium¶

b. Download and Install Selenium¶

c. Download Selenium WebDriver¶

2. A Step-by-Step Hello World with Selenium¶

a. Create an Instance of a WebDriver, Load a Webpage, Play and Quit¶

3. Example 1: Scraping a JavaScript Driven WebSite (https://arifpucit.github.io/bss2/js/)¶

a. Using Requests and BeautifulSoup¶

b. Using Selenium and BeautifulSoup¶

4. Example 2: Scraping Dynamic WebSites (https://arifpucit.github.io/bss2/login/)¶

a. Different Ways to Locate Web elements using Selenium¶

b. Consolidated Script to Login and Scrape Books Data: (https://arifpucit.github.io/bss2/login/)¶

5. Example 3: Scraping Multiple Web Pages that Employ Infinite Scrolling¶

a. Fetch Contents w/o Scrolling¶

b. How to Scroll an Infinite Scrolling Web Page using Selenium¶

e. Scrape all the Books using Self-Scrolling¶

6. Example 4: Scraping Multiple Web Pages that Employ Pagination¶

a. Fetch Contents of First Page¶

b. Logic to Iterate from First Page to Last Page in Pagination¶

c. Scrape all the Books using Pagination¶

7. Example 5: Handling Popups (https://arifpucit.github.io/bss2/popup/)¶

a. Apply Above Pagination Bot on the Pop-Up Version¶

b. Solution¶

8. Bonus¶

a. Sending Scraped Data vis E-Mail from Python¶

b. Schedule your E-Mail¶

To Be Continued...¶