Version 1.0, by Makzan. Last updated 2021 March.
In this series, we will use 3 lectures to learn fetching data online. This includes:
We need to know the URL In order to download files, or web scrap a web page. Usually it is finding the variable patterns in URL. Edit For example, from the following URL, we can find the pattern of the search query.
Let’s take a closer look at DSAT.gov.mo bus route page. If we can the bus routes, we can observe that the page URL doesn’t change. There may be 2 reasons:
If it is the first reason, we will need a more advanced browser driver technique. If it is the second reason, we can get the URL by opening the link in a new tab, or simply copying the link location via right-click.
Now we can observe the URL for each route has the following pattern.
Take DICJ.gov.mo example, the URL is:
http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/index.html
If we inspect the network requests, we can find the behind-the-scene XML URL:
http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/2020/report_cn.xml?id=10
Sometimes, we can speed up our daily operation just by automatically opening the URL that we need. We can use webbrowser
to do so.
import webbrowser
query = "webbrowser"
url = f"https://docs.python.org/3/search.html?q={query}&check_keywords=yes&area=default"
webbrowser.open(url)
True
✏️ Exercise Time
Please try to turn the query into an input asking for the search query:
import webbrowser
### Start writing your code here
None
### End writing your code
webbrowser.open(url)
True
Expected question to ask |
---|
Please input search query to search Python doc: |
DuckDuckGo search engine allows going to the first search result by adding an exclamation mark (!) in the query string. We will use this feature to create a Python script.
import webbrowser
query = "Python history"
url = f"https://duckduckgo.com?q=!+{query}"
webbrowser.open(url)
True
✏️ Exercise Time
Please try to turn the query into an input asking for the search query:
import webbrowser
### Start writing your code here
None
### End writing your code
webbrowser.open(url)
True
Expected question to ask |
---|
Please input search query : |
import webbrowser
query = "Book store"
# A map search in Macao.
url = f"https://www.google.com/maps/search/{query}/@22.1612464,113.5303786,13z"
webbrowser.open(url)
True
✏️ Exercise Time
Try to turn the map location into Shanghai.
import webbrowser
query = "Book store"
# Start writing your code here
latitude = None
longitude = None
zoom_level = 13
url = f"https://www.google.com/maps/search/{query}/@{latitude},{longitude},{zoom_level}z"
webbrowser.open(url)
In iOS, we can use x-callback-url to interact with apps in iOS by using Python and Pythonista.
There are web site that collects x-callback-url for iOS apps:
http://x-callback-url.com/apps/
For example, Things—A tasks manager—provides x-callback-url API:
https://culturedcode.com/things/support/articles/2803573/
Another example that Bear—notes taking iOS app—provides x-callback-url API too.
https://bear.app/faq/X-callback-url%20Scheme%20documentation/
We can use urlretrieve
from urllib.request
module to download file.
For example, we can download geckdriver.zip file from their Github repository with the following code.
'''Download chart from AAStock server with given stock numbers.'''
from urllib.request import urlretrieve
stock_numbers = ['0001','0005','0011','0700','3333','0002','0012']
for stock_number in stock_numbers:
url = "http://charts.aastocks.com/servlet/Charts?fontsize=12&15MinDelay=T&lang=1&titlestyle=1&vol=1&Indicator=1&indpara1=10&indpara2=20&indpara3=50&indpara4=100&indpara5=150&subChart1=2&ref1para1=14&ref1para2=0&ref1para3=0&subChart2=3&ref2para1=12&ref2para2=26&ref2para3=9&subChart3=12&ref3para1=0&ref3para2=0&ref3para3=0&scheme=3&com=100&chartwidth=660&chartheight=855&stockid=00{}.HK&period=6&type=1&logoStyle=1".format(stock_number)
urlretrieve(url, '{}-chart.gif'.format(stock_number))
('chromedriver.zip', <http.client.HTTPMessage object at 0x1091cd350>)
pip install untangle
Collecting untangle Downloading untangle-1.1.1.tar.gz (3.1 kB) Building wheels for collected packages: untangle Building wheel for untangle (setup.py) ... done Created wheel for untangle: filename=untangle-1.1.1-py3-none-any.whl size=3410 sha256=678ed047367a6d024ab37d3d424ef606a5d3de48f1d2aa254c5acdb9da946713 Stored in directory: /Users/makzan/Library/Caches/pip/wheels/b9/a9/9c/45580c8b7a00e3e79b889e8e78a4f3427fff5a4d48f1cfea0a Successfully built untangle Installing collected packages: untangle Successfully installed untangle-1.1.1 Note: you may need to restart the kernel to use updated packages.
xml.smg.gov.mo
import untangle
import datetime
obj = untangle.parse('https://xml.smg.gov.mo/c_actual_brief.xml')
temperature = obj.ActualWeatherBrief.Custom.Temperature.Value.cdata
humidity = obj.ActualWeatherBrief.Custom.Humidity.Value.cdata
print("現時澳門氣溫 " + temperature + " 度,濕度 " + humidity + "%。")
現時澳門氣溫 30 度,濕度 81%。
There may be error when running the code above, depending on how many "Temperature" data are there from SMG.gov.mo.
If there are only one Temperature
data, it is a direct access. If there are more than one Temperature
data, it becomes a list. We can determine if it is a list by checking type(target) == list
.
type([]) == list
True
import untangle
import datetime
obj = untangle.parse('https://xml.smg.gov.mo/c_actual_brief.xml')
humidity = obj.ActualWeatherBrief.Custom.Humidity.Value.cdata
if type(obj.ActualWeatherBrief.Custom.Temperature) == list:
temperature = obj.ActualWeatherBrief.Custom.Temperature[0].Value.cdata
else:
temperature = obj.ActualWeatherBrief.Custom.Temperature.Value.cdata
print("現時澳門氣溫 " + temperature + " 度,濕度 " + humidity + "%。")
現時澳門氣溫 30 度,濕度 81%。
import untangle
import datetime
year = datetime.date.today().year
# list begins at 0, and we look for previous month.
month = datetime.date.today().month -1 -1
if last_month < 0:
year = year - 1
last_month = 11 # list beings at 0.
url = f"http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/{year}/report_cn.xml?id=8"
data = untangle.parse(url)
month_data = data.STATISTICS.REPORT.DATA.RECORD[month]
net_income = month_data.DATA[1].cdata
last_net_income = month_data.DATA[2].cdata
change_rate = month_data.DATA[3].cdata
acc_net_income = month_data.DATA[4].cdata
acc_last_net_income = month_data.DATA[5].cdata
acc_change_rate = month_data.DATA[6].cdata
print(f"{year} 年 {month+1} 月份 毛收入 {net_income} ({year-1}:{last_net_income}), {change_rate}")
print(f"{year} 年 {month+1} 月份 累計毛收入 {acc_net_income} ({year-1}:{acc_last_net_income}), {acc_change_rate}")
2020 年 5 月份 毛收入 1,764 (2019:25,952), -93.2% 2020 年 5 月份 累計毛收入 33,004 (2019:125,691), -73.7%
def fetch_and_print_dicj_year_month(year, month):
url = f"http://www.dicj.gov.mo/web/cn/information/DadosEstat_mensal/{year}/report_cn.xml?id=8"
data = untangle.parse(url)
month_data = data.STATISTICS.REPORT.DATA.RECORD[month]
net_income = month_data.DATA[1].cdata
last_net_income = month_data.DATA[2].cdata
change_rate = month_data.DATA[3].cdata
acc_net_income = month_data.DATA[4].cdata
acc_last_net_income = month_data.DATA[5].cdata
acc_change_rate = month_data.DATA[6].cdata
print(f"{year} 年 {month+1} 月份 毛收入\t {net_income} \t ({year-1}:{last_net_income}), {change_rate}")
# print(f"{year} 年 {month+1} 累計毛收入\t {acc_net_income}\t ({year-1}:{acc_last_net_income}), {acc_change_rate}")
import untangle
import datetime
for i in range(-12,0):
date = datetime.date.today() + datetime.timedelta(days=i*30)
fetch_and_print_dicj_year_month(date.year, date.month-1)
2019 年 6 月份 毛收入 23,812 (2018:22,490), 5.9% 2019 年 7 月份 毛收入 24,453 (2018:25,327), -3.5% 2019 年 8 月份 毛收入 24,262 (2018:26,559), -8.6% 2019 年 9 月份 毛收入 22,079 (2018:21,952), 0.6% 2019 年 10 月份 毛收入 26,443 (2018:27,328), -3.2% 2019 年 11 月份 毛收入 22,877 (2018:24,995), -8.5% 2019 年 12 月份 毛收入 22,838 (2018:26,468), -13.7% 2020 年 1 月份 毛收入 22,126 (2019:24,942), -11.3% 2020 年 2 月份 毛收入 3,104 (2019:25,370), -87.8% 2020 年 3 月份 毛收入 5,257 (2019:25,840), -79.7% 2020 年 4 月份 毛收入 754 (2019:23,588), -96.8% 2020 年 5 月份 毛收入 1,764 (2019:25,952), -93.2%
import json
import requests
url = "https://api.exchangeratesapi.io/latest?symbols=HKD&base=CNY"
response = requests.get(url)
data = json.loads(response.text)
print(data)
print(data['rates']['HKD'])
{'rates': {'HKD': 1.0935529258}, 'base': 'CNY', 'date': '2020-06-17'} 1.0935529258