Version 1.1. Prepared by Makzan. Updated at 2021 Janurary.
In this series, we will use 3 lectures to learn fetching data online. This includes:
We use Selenium when:
Pros:
Cons:
We need web browser driver to use Selenium.
pip install --user selenium
Requirement already satisfied: selenium in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (3.141.0) Requirement already satisfied: urllib3 in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from selenium) (1.24.1) Note: you may need to restart the kernel to use updated packages.
pip install --user webdriver-manager
Requirement already satisfied: webdriver-manager in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (3.5.4) Requirement already satisfied: requests in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from webdriver-manager) (2.21.0) Requirement already satisfied: idna<2.9,>=2.5 in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from requests->webdriver-manager) (2.8) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from requests->webdriver-manager) (3.0.4) Requirement already satisfied: certifi>=2017.4.17 in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from requests->webdriver-manager) (2019.3.9) Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\thomas\appdata\local\continuum\anaconda3\lib\site-packages (from requests->webdriver-manager) (1.24.1) Note: you may need to restart the kernel to use updated packages.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
# Common library to import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
options = Options()
# options.add_argument('-headless')
browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.quit()
====== WebDriver manager ====== Current google-chrome version is 104.0.5112 Get LATEST chromedriver version for 104.0.5112 google-chrome Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache
Here are some essential commands to control web browser through Selenium:
browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()
browser.get('https://example.com')
browser.find_element(By.CSS_SELECTOR, 'a')
browser.find_elements(By.CSS_SELECTOR, 'a')
browser.quit()
====== WebDriver manager ====== Current google-chrome version is 104.0.5112 Get LATEST chromedriver version for 104.0.5112 google-chrome Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache
'''Capture the screenshot of a website via Headless Browser.'''
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
options.add_argument('-headless')
browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()
browser.get('http://macaodaily.com')
browser.save_screenshot('MacaoDaily.png')
browser.quit()
====== WebDriver manager ====== Current google-chrome version is 104.0.5112 Get LATEST chromedriver version for 104.0.5112 google-chrome Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache
Let's try to fetch stock quote from aastock.com. If we try to directly access the stock page, the data may not load. We can load any one page from aastock and then simulate inputting the stock number and press enter. By using this automation, we can simulate a normal web browser browsing behavior.
'''Fetch current stock from aastock.'''
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
stock_number = '0011'
options = Options()
# options.add_argument('-headless')
browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()
browser.get('http://www.aastocks.com/tc/stocks/aboutus/companyinfo.aspx')
element = browser.find_element(By.CSS_SELECTOR, '#sb-txtSymbol-aa')
element.send_keys(stock_number)
element.send_keys(Keys.RETURN)
time.sleep(3)
element = browser.find_element(By.CSS_SELECTOR, '.lastBox')
print(element.text)
browser.quit()
====== WebDriver manager ====== Current google-chrome version is 104.0.5112 Get LATEST chromedriver version for 104.0.5112 google-chrome Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache
收市價(港元) (指數|行業) 波幅 121.800 - 123.300 123.000
We had used API to fetch DICJ data. This example shows an alternative to fetch the same data by using Selenium.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
options = Options()
options.add_argument('-headless')
browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.get('http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html')
time.sleep(5)
element = browser.find_element(By.CSS_SELECTOR, "#report #table1")
rows = element.find_elements(By.CSS_SELECTOR, "tr")
print(rows[0].text)
for row in rows[3:]:
print(row.text)
====== WebDriver manager ====== Current google-chrome version is 104.0.5112 Get LATEST chromedriver version for 104.0.5112 google-chrome Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache
2020年及2019年每月幸運博彩毛收入 一月份 22,126 24,942 -11.3% 22,126 24,942 -11.3% 二月份 3,104 25,370 -87.8% 25,229 50,312 -49.9% 三月份 5,257 25,840 -79.7% 30,486 76,152 -60.0% 四月份 754 23,588 -96.8% 31,240 99,739 -68.7% 五月份 1,764 25,952 -93.2% 33,004 125,691 -73.7% 六月份 716 23,812 -97.0% 33,720 149,503 -77.4% 七月份 1,344 24,453 -94.5% 35,064 173,956 -79.8% 八月份 1,330 24,262 -94.5% 36,394 198,218 -81.6% 九月份 2,211 22,079 -90.0% 38,605 220,297 -82.5% 十月份 7,270 26,443 -72.5% 45,875 246,740 -81.4% 十一月份 6,748 22,877 -70.5% 52,623 269,617 -80.5% 十二月份 7,818 22,838 -65.8% 60,441 292,455 -79.3%
In this example, we will fetch airline query by querying flights.ctrip.com with 4 parameters: departure date, arrival date, departure airport, arrival airport.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import datetime
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
today = datetime.date.today()
five_days_later = today + datetime.timedelta(days=5)
print(today.isoformat())
print(five_days_later.isoformat())
2022-08-31 2022-09-05
options = Options()
#options.add_argument('-headless')
from_city = "hkg"
to_city = "hel"
url = f"https://flights.ctrip.com/international/search/round-{from_city}-{to_city}?depdate={today}_{five_days_later}&cabin=y_s&adult=1&child=0&infant=0"
print(url)
browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
browser.maximize_window()
browser.get(url)
time.sleep(3)
elements = browser.find_elements(By.CSS_SELECTOR, ".flight-item")
print(f"Found {len(elements)} results.")
print(from_city.upper())
print(to_city.upper())
for row in elements:
airline = row.find_element(By.CSS_SELECTOR, ".airline-name")
print(airline.text)
price = row.find_element(By.CSS_SELECTOR, ".price")
print(price.text)
browser.quit()
====== WebDriver manager ======
https://flights.ctrip.com/international/search/round-hkg-hel?depdate=2022-08-31_2022-09-05&cabin=y_s&adult=1&child=0&infant=0
Current google-chrome version is 104.0.5112 Get LATEST chromedriver version for 104.0.5112 google-chrome Driver [C:\Users\thomas\.wdm\drivers\chromedriver\win32\104.0.5112.79\chromedriver.exe] found in cache
Found 7 results. HKG HEL 土耳其航空 ¥17846起 国泰航空 ¥18400起 汉莎航空 ¥18896起 国泰航空 ¥18400起 汉莎航空 ¥18896起
DOMAIN = None
API_KEY= None
FROM = "mak@makzan.net"
TO = ["mak@makzan.net"]
from bs4 import BeautifulSoup
import requests
import datetime
def send_simple_message(content, subject="Yeah"):
return requests.post(
f"https://api.mailgun.net/v3/{DOMAIN}/messages",
auth=("api", API_KEY),
data={"from": FROM,
"to": TO,
"subject": subject,
"text": content})
# keywords
keywords = ["創業", "科技"]
# today
today = datetime.datetime.today()
year = str(today.year).zfill(2)
month = str(today.month).zfill(2)
day = str(today.day).zfill(2)
res = requests.get(f"http://www.macaodaily.com/html/{year}-{month}/{day}/node_1.htm")
res.encoding = "utf-8"
soup = BeautifulSoup(res.text, "html5lib")
results = []
links = soup.select("#all_article_list a")
for link in links:
news_title = link.getText()
for keyword in keywords:
if keyword in news_title:
results.append(f"{year}-{month}-{day}: {news_title}")
content = "\n".join(results)
subject = f"今日有{len(results)}篇新聞您可能感興趣"
# send_simple_message(content, subject=subject)
print(subject)
print(content)
今日有0篇新聞您可能感興趣