Lecture 6—Fetching data online¶

Version 1.0, by Makzan. Last updated 2021 March.

In this series, we will use 3 lectures to learn fetching data online. This includes:

Finding patterns in URL
Open web URL
Downloading files in Python
Fetch data with API
Web scraping with Requests and BeautifulSoup
Web automation with Selenium
Converting Wikipedia tabular data into CSV

Finding patterns in URL¶

We need to know the URL In order to download files, or web scrap a web page. Usually it is finding the variable patterns in URL. Edit For example, from the following URL, we can find the pattern of the search query.

Let’s take a closer look at DSAT.gov.mo bus route page. If we can the bus routes, we can observe that the page URL doesn’t change. There may be 2 reasons:

The page changes are generated via JavaScript rendering.
The page is inside an iframe so that page changes do not change the top-level URL.

If it is the first reason, we will need a more advanced browser driver technique. If it is the second reason, we can get the URL by opening the link in a new tab, or simply copying the link location via right-click.

Now we can observe the URL for each route has the following pattern.

https://bis.dsat.gov.mo:37812/macauweb/routeLine.html?routeName=3&direction=0&language=zh-tw&ver=3.5.12

Take DICJ.gov.mo example, the URL is:

http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html

If we inspect the network requests, we can find the behind-the-scene XML URL:

http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/report_cn.xml?id=10

Example: Web Search Python Documentation Site¶

Sometimes, we can speed up our daily operation just by automatically opening the URL that we need. We can use webbrowser to do so.

In [2]:

import webbrowser

query = "webbrowser"

url = f"https://docs.python.org/3/search.html?q={query}&check_keywords=yes&area=default"

webbrowser.open(url)

Out[2]:

True

✏️ Exercise Time

Please try to turn the query into an input asking for the search query:

In [133]:

import webbrowser

### Start writing your code here
None
### End writing your code

webbrowser.open(url)

Out[133]:

True

Expected question to ask
Please input search query to search Python doc:

Example: Searching DuckDuckGo search engine¶

DuckDuckGo search engine allows going to the first search result by adding an exclamation mark (!) in the query string. We will use this feature to create a Python script.

In [130]:

import webbrowser

query = "Python history"

url = f"https://duckduckgo.com?q=!+{query}"

webbrowser.open(url)

Out[130]:

True

✏️ Exercise Time

Please try to turn the query into an input asking for the search query:

In [129]:

import webbrowser

### Start writing your code here
None
### End writing your code

webbrowser.open(url)

Out[129]:

True

Expected question to ask
Please input search query :

Example: Google map search near Macao¶

In [5]:

import webbrowser

query = "Book store"

# A map search in Macao.
url = f"https://www.google.com/maps/search/{query}/@22.1612464,113.5303786,13z"

webbrowser.open(url)

Out[5]:

True

✏️ Exercise Time

Try to turn the map location into Shanghai.

In [ ]:

import webbrowser

query = "Book store"

# Start writing your code here
latitude = None
longitude = None
zoom_level = 13
url = f"https://www.google.com/maps/search/{query}/@{latitude},{longitude},{zoom_level}z"

webbrowser.open(url)

URL for iOS apps¶

In iOS, we can use x-callback-url to interact with apps in iOS by using Python and Pythonista.

There are web site that collects x-callback-url for iOS apps:

http://x-callback-url.com/apps/

For example, Things—A tasks manager—provides x-callback-url API:

https://culturedcode.com/things/support/articles/2803573/

Another example that Bear—notes taking iOS app—provides x-callback-url API too.

https://bear.app/faq/X-callback-url%20Scheme%20documentation/

In [ ]:

Downloading files¶

We can use urlretrieve from urllib.request module to download file.

For example, we can download geckdriver.zip file from their Github repository with the following code.

In [7]:

'''Download chart from AAStock server with given stock numbers.'''

from urllib.request import urlretrieve

stock_numbers = ['0001','0005','0011','0700','3333','0002','0012']

for stock_number in stock_numbers:
    url = "http://charts.aastocks.com/servlet/Charts?fontsize=12&15MinDelay=T&lang=1&titlestyle=1&vol=1&Indicator=1&indpara1=10&indpara2=20&indpara3=50&indpara4=100&indpara5=150&subChart1=2&ref1para1=14&ref1para2=0&ref1para3=0&subChart2=3&ref2para1=12&ref2para2=26&ref2para3=9&subChart3=12&ref3para1=0&ref3para2=0&ref3para3=0&scheme=3&com=100&chartwidth=660&chartheight=855&stockid=00{}.HK&period=6&type=1&logoStyle=1".format(stock_number)
    urlretrieve(url, '{}-chart.gif'.format(stock_number))

('chromedriver.zip', <http.client.HTTPMessage object at 0x1091cd350>)

Fetching XML¶

In [10]:

pip install untangle

Collecting untangle
  Downloading untangle-1.1.1.tar.gz (3.1 kB)
Building wheels for collected packages: untangle
  Building wheel for untangle (setup.py) ... done
  Created wheel for untangle: filename=untangle-1.1.1-py3-none-any.whl size=3410 sha256=678ed047367a6d024ab37d3d424ef606a5d3de48f1d2aa254c5acdb9da946713
  Stored in directory: /Users/makzan/Library/Caches/pip/wheels/b9/a9/9c/45580c8b7a00e3e79b889e8e78a4f3427fff5a4d48f1cfea0a
Successfully built untangle
Installing collected packages: untangle
Successfully installed untangle-1.1.1
Note: you may need to restart the kernel to use updated packages.

Example: SMG.gov.mo¶

xml.smg.gov.mo

In [134]:

import untangle
import datetime

obj = untangle.parse('https://xml.smg.gov.mo/c_actual_brief.xml')

temperature = obj.ActualWeatherBrief.Custom.Temperature.Value.cdata
humidity = obj.ActualWeatherBrief.Custom.Humidity.Value.cdata

print("現時澳門氣溫 " + temperature + " 度，濕度 " + humidity + "%。")

現時澳門氣溫 30 度，濕度 81%。

There may be error when running the code above, depending on how many "Temperature" data are there from SMG.gov.mo.

If there are only one Temperature data, it is a direct access. If there are more than one Temperature data, it becomes a list. We can determine if it is a list by checking type(target) == list.

In [138]:

type([]) == list

Out[138]:

True

In [139]:

import untangle
import datetime

obj = untangle.parse('https://xml.smg.gov.mo/c_actual_brief.xml')

humidity = obj.ActualWeatherBrief.Custom.Humidity.Value.cdata

if type(obj.ActualWeatherBrief.Custom.Temperature) == list:
    temperature = obj.ActualWeatherBrief.Custom.Temperature[0].Value.cdata
else:
    temperature = obj.ActualWeatherBrief.Custom.Temperature.Value.cdata


print("現時澳門氣溫 " + temperature + " 度，濕度 " + humidity + "%。")

現時澳門氣溫 30 度，濕度 81%。

Example: 博彩月計毛收入¶

http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/index.html

In [81]:

import untangle
import datetime

year = datetime.date.today().year

# list begins at 0, and we look for previous month.
month = datetime.date.today().month -1 -1

if last_month < 0:
    year = year - 1
    last_month = 11 # list beings at 0.

url = f"http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/{year}/report_cn.xml?id=8"

data = untangle.parse(url)

month_data = data.STATISTICS.REPORT.DATA.RECORD[month]

net_income = month_data.DATA[1].cdata
last_net_income = month_data.DATA[2].cdata
change_rate = month_data.DATA[3].cdata
acc_net_income = month_data.DATA[4].cdata
acc_last_net_income = month_data.DATA[5].cdata
acc_change_rate = month_data.DATA[6].cdata

print(f"{year} 年 {month+1} 月份 毛收入 {net_income} ({year-1}:{last_net_income}), {change_rate}")
print(f"{year} 年 {month+1} 月份 累計毛收入 {acc_net_income} ({year-1}:{acc_last_net_income}), {acc_change_rate}")

2020 年 5 月份 毛收入 1,764 (2019:25,952), -93.2%
2020 年 5 月份 累計毛收入 33,004 (2019:125,691), -73.7%

過去 12 個月博彩月計毛收入¶

In [82]:

def fetch_and_print_dicj_year_month(year, month):
    url = f"http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/{year}/report_cn.xml?id=8"

    data = untangle.parse(url)

    month_data = data.STATISTICS.REPORT.DATA.RECORD[month]

    net_income = month_data.DATA[1].cdata
    last_net_income = month_data.DATA[2].cdata
    change_rate = month_data.DATA[3].cdata
    acc_net_income = month_data.DATA[4].cdata
    acc_last_net_income = month_data.DATA[5].cdata
    acc_change_rate = month_data.DATA[6].cdata

    print(f"{year} 年 {month+1}  月份 毛收入\t {net_income} \t ({year-1}:{last_net_income}), {change_rate}")
#     print(f"{year} 年 {month+1} 累計毛收入\t {acc_net_income}\t ({year-1}:{acc_last_net_income}), {acc_change_rate}")

In [83]:

import untangle
import datetime

for i in range(-12,0):    
    date = datetime.date.today() + datetime.timedelta(days=i*30)    
    fetch_and_print_dicj_year_month(date.year, date.month-1)

2019 年 6  月份 毛收入	 23,812 	 (2018:22,490), 5.9%
2019 年 7  月份 毛收入	 24,453 	 (2018:25,327), -3.5%
2019 年 8  月份 毛收入	 24,262 	 (2018:26,559), -8.6%
2019 年 9  月份 毛收入	 22,079 	 (2018:21,952), 0.6%
2019 年 10  月份 毛收入	 26,443 	 (2018:27,328), -3.2%
2019 年 11  月份 毛收入	 22,877 	 (2018:24,995), -8.5%
2019 年 12  月份 毛收入	 22,838 	 (2018:26,468), -13.7%
2020 年 1  月份 毛收入	 22,126 	 (2019:24,942), -11.3%
2020 年 2  月份 毛收入	 3,104 	 (2019:25,370), -87.8%
2020 年 3  月份 毛收入	 5,257 	 (2019:25,840), -79.7%
2020 年 4  月份 毛收入	 754 	 (2019:23,588), -96.8%
2020 年 5  月份 毛收入	 1,764 	 (2019:25,952), -93.2%

Example: Exchange Rate API¶

https://exchangeratesapi.io

In [84]:

import json
import requests

url = "https://api.exchangeratesapi.io/latest?symbols=HKD&base=CNY"

response = requests.get(url)
data = json.loads(response.text)
print(data)

print(data['rates']['HKD'])

{'rates': {'HKD': 1.0935529258}, 'base': 'CNY', 'date': '2020-06-17'}
1.0935529258

In [ ]: