Notebook

Lesson 8—Web Automation with Selenium¶

Version 1.1. Prepared by Makzan. Updated at 2021 Janurary.

In this series, we will use 3 lectures to learn fetching data online. This includes:

Finding patterns in URL
Open web URL
Downloading files in Python
Fetch data with API
Web scraping with Requests and BeautifulSoup
Web automation with Selenium
Converting Wikipedia tabular data into CSV

We use Selenium when:

When Requests and BeautifulSoup does not work.
When page requires JavaScript to render the data.

Pros:

It launches real browser and automate browser.
Better compatibility .

Cons:

Slow because it launches real browser.

Downloading browser driver¶

We need web browser driver to use Selenium.

In [1]:

pip install --user selenium

Requirement already satisfied: selenium in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (3.141.0)
Requirement already satisfied: urllib3 in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from selenium) (1.24.1)
Note: you may need to restart the kernel to use updated packages.

In [2]:

pip install --user webdriver-manager

Requirement already satisfied: webdriver-manager in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (3.5.4)
Requirement already satisfied: requests in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from webdriver-manager) (2.21.0)
Requirement already satisfied: idna<2.9,>=2.5 in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from requests->webdriver-manager) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from requests->webdriver-manager) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from requests->webdriver-manager) (2019.3.9)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from requests->webdriver-manager) (1.24.1)
Note: you may need to restart the kernel to use updated packages.

In [3]:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

from webdriver_manager.chrome import ChromeDriverManager

In [4]:

# Common library to import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select

In [5]:

options = Options()
# options.add_argument('-headless')

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.quit()


====== WebDriver manager ======
Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache

Selenium Cheat Sheet¶

https://codoid.com/selenium-webdriver-python-cheat-sheet/

Here are some essential commands to control web browser through Selenium:

In [6]:

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()
browser.get('https://example.com')
browser.find_element(By.CSS_SELECTOR, 'a')
browser.find_elements(By.CSS_SELECTOR, 'a')
browser.quit()


====== WebDriver manager ======
Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache

Taking screenshot¶

In [7]:

'''Capture the screenshot of a website via Headless Browser.'''

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument('-headless')

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()
browser.get('http://macaodaily.com')
browser.save_screenshot('MacaoDaily.png')
browser.quit()


====== WebDriver manager ======
Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache

Example: Fetching stock data from aastock¶

Let's try to fetch stock quote from aastock.com. If we try to directly access the stock page, the data may not load. We can load any one page from aastock and then simulate inputting the stock number and press enter. By using this automation, we can simulate a normal web browser browsing behavior.

In [8]:

'''Fetch current stock from aastock.'''

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time

stock_number = '0011'

options = Options()
# options.add_argument('-headless')

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()

browser.get('http://www.aastocks.com/tc/stocks/aboutus/companyinfo.aspx')
element = browser.find_element(By.CSS_SELECTOR, '#sb-txtSymbol-aa')
element.send_keys(stock_number)
element.send_keys(Keys.RETURN)

time.sleep(3)

element = browser.find_element(By.CSS_SELECTOR, '.lastBox')
print(element.text)


browser.quit()


====== WebDriver manager ======
Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache

收市價(港元)
(指數|行業)
波幅
121.800 - 123.300
123.000

Example: Fetch dicj data with Selenium¶

We had used API to fetch DICJ data. This example shows an alternative to fetch the same data by using Selenium.

In [9]:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time

options = Options()
options.add_argument('-headless')

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)

browser.get('http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html')

time.sleep(5)

element = browser.find_element(By.CSS_SELECTOR, "#report #table1")

rows = element.find_elements(By.CSS_SELECTOR, "tr")
print(rows[0].text)
for row in rows[3:]:
    print(row.text)


====== WebDriver manager ======
Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache

2020年及2019年每月幸運博彩毛收入
一月份 22,126 24,942 -11.3% 22,126 24,942 -11.3%
二月份 3,104 25,370 -87.8% 25,229 50,312 -49.9%
三月份 5,257 25,840 -79.7% 30,486 76,152 -60.0%
四月份 754 23,588 -96.8% 31,240 99,739 -68.7%
五月份 1,764 25,952 -93.2% 33,004 125,691 -73.7%
六月份 716 23,812 -97.0% 33,720 149,503 -77.4%
七月份 1,344 24,453 -94.5% 35,064 173,956 -79.8%
八月份 1,330 24,262 -94.5% 36,394 198,218 -81.6%
九月份 2,211 22,079 -90.0% 38,605 220,297 -82.5%
十月份 7,270 26,443 -72.5% 45,875 246,740 -81.4%
十一月份 6,748 22,877 -70.5% 52,623 269,617 -80.5%
十二月份 7,818 22,838 -65.8% 60,441 292,455 -79.3%

Example: Fetch flight price from ctrip¶

In this example, we will fetch airline query by querying flights.ctrip.com with 4 parameters: departure date, arrival date, departure airport, arrival airport.

In [10]:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import datetime
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

In [11]:

today = datetime.date.today()
five_days_later = today + datetime.timedelta(days=5)

print(today.isoformat())
print(five_days_later.isoformat())

2022-08-31
2022-09-05

In [12]:

options = Options()
#options.add_argument('-headless')

from_city = "hkg"
to_city = "hel"

url = f"https://flights.ctrip.com/international/search/round-{from_city}-{to_city}?depdate={today}_{five_days_later}&cabin=y_s&adult=1&child=0&infant=0"

print(url)

browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()
browser.get(url)

time.sleep(3)

elements = browser.find_elements(By.CSS_SELECTOR, ".flight-item")

print(f"Found {len(elements)} results.")

print(from_city.upper())
print(to_city.upper())
for row in elements:
    airline = row.find_element(By.CSS_SELECTOR, ".airline-name")
    print(airline.text)
    price = row.find_element(By.CSS_SELECTOR, ".price")
    print(price.text)
    
    
browser.quit()


====== WebDriver manager ======

https://flights.ctrip.com/international/search/round-hkg-hel?depdate=2022-08-31_2022-09-05&cabin=y_s&adult=1&child=0&infant=0

Current google-chrome version is 104.0.5112
Get LATEST chromedriver version for 104.0.5112 google-chrome
Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache

Found 7 results.
HKG
HEL
土耳其航空
¥17846起
国泰航空
¥18400起
汉莎航空
¥18896起
国泰航空
¥18400起
汉莎航空
¥18896起

Example: Use MailGun to send result to yourself¶

In [13]:

DOMAIN = None
API_KEY= None
FROM = "mak@makzan.net"
TO = ["mak@makzan.net"]

In [14]:

from bs4 import BeautifulSoup
import requests
import datetime

def send_simple_message(content, subject="Yeah"):
    return requests.post(
        f"https://api.mailgun.net/v3/{DOMAIN}/messages",
        auth=("api", API_KEY),
        data={"from": FROM,
        "to": TO,
        "subject": subject,
        "text": content})

# keywords
keywords = ["創業", "科技"]

# today
today = datetime.datetime.today()
year = str(today.year).zfill(2)
month = str(today.month).zfill(2)
day = str(today.day).zfill(2)

res = requests.get(f"http://www.macaodaily.com/html/{year}-{month}/{day}/node_1.htm")

res.encoding = "utf-8"

soup = BeautifulSoup(res.text, "html5lib")

results = []

links = soup.select("#all_article_list a")
for link in links:
    news_title = link.getText()

    for keyword in keywords:
        if keyword in news_title:
            results.append(f"{year}-{month}-{day}: {news_title}")

content = "\n".join(results)
subject = f"今日有{len(results)}篇新聞您可能感興趣"
# send_simple_message(content, subject=subject)
print(subject)
print(content)

今日有0篇新聞您可能感興趣