#!/usr/bin/env python # coding: utf-8 # # Collecting information for machine learning purposes. Parsing and Grabbing # When studing machine learning we mainly concentrate on algorithms of proccessing data rather than on collecting that data. And this is naturally because there are so many databases available for downloading: any types, any sizes, for any ML algorithms. But in real life we are given particular goals and of course any data science processing starts with collecting/getting of information. # Today our life is directly connected with internet and web sites: almost any text information that we could need is available online. So in this tutorial we'l consider how to collect particular information from web sites. And first of all we`l look a little inside html code to understand better how to extract information from it. # HTML language "tells" web browsers when, where and what element to show at the html page. We can imagine it as map that specifies the route for drivers: when to start, where to turn left or right and what place to go. That`s why html structure of web pages are so convinient to grab information. Here is the simple piece of HTML code: # In[ ]: """

Hello! This is the first text paragraph

and below how this code is interpreted by web browser

""" #

Hello! This is the first text paragraph

#

and below how this code is interpreted by web browser

# Two markers 'h1' and 'p' point browser what and how to show on the page, and thus this markers are keys for us that help to get exactly that information we need. There are a lot of information about html language and its main tags ('h1', 'p', 'html', etc. - this are all tags), so you can learn it more deeply, because we will focus on the parsing process. And for this purpose we will use BeautifulSoup Python library: # In[ ]: from bs4 import BeautifulSoup #
# Before learning how to grab information directly online, let's load a little html page as text file from our folder. (right click on this link and download to folder with yupiter notebook), as sometimes we can work with already downloaded files. #
# In[ ]: source_text = open("toy.html", "r") html = "" for line in source_text: html = html + line print(html) # Our file consists of several tags and the information we need to get is contained in the second 'p' tag. To get it, we have to "feed" the BeautifulSoup module with total html code, that it could parse through it and find what we need. # In[ ]: # "feed" total page soup = BeautifulSoup(html, "lxml") # Find all

blocks in the page all_p = soup.find_all("p") print(all_p) #


# Function '.find_all' collects for us all 'p' blocks in our file. We simply choose the necessary p-element from list by certain index and leave only text inside this tag: #
# In[ ]: # extract tags text all_p[1].text # But when there are many uniform tags (like 'p') in the code or when from page to page the index number of needed paragraph changes, such aprroach will not do right things. Today almost any tag contains special attributes like 'id', 'class', 'title' etc. To know more about it you can search for CSS style sheet language. For us this attributes are additional anchors to pull from page source exactly the right paragraph (in our case). Using function '.find' we'l get not list but only 1 element (be sure that such element is only one in the page, because you can miss some information in other case.) # In[ ]: # We know the needed paragraph contains attribute id='third'. Let`s grab it p_third = soup.find("p", id="third").text print(p_third) # In case of searching for class attribute, we have to use another way of coding, # because of reserved word "class" in Python p_first = soup.find("p", {"class": "main 1"}).text print(p_first) # In the times of dynamic pages which has various CSS styles for different types of devices, you will be very often facing the problem of changing names of tag attributes. Or it could change a little from page to page accroding to other content on it. In case when names of needed tag blocks are totaly different - we have to setup complex grabbing "architecture". But usually there are common words in that differing names. In our toy case we have two paragraphs with word "main"in 'class' attribute in both of them. # In[ ]: # find all paragraphs where class attribute contains word "main" p_main = soup.find_all("p", {"class": "main"}) p_main = [p.text for p in p_main] print(p_main) # find all paragraphs with word "2" p_second = soup.find_all("p", {"class": "2"}) p_second = [p.text for p in p_second] print(p_second) #
# Also you can get code between two tags which contains inside smaller tag blocks and clean them out, leaving only text inside them. #
# In[ ]: # get all html code inside tag "html" html_tag = soup.find("html") print(html_tag) # clear it from inside tags, leaving text. html_tag = html_tag.text print(html_tag) #
# # Let's do things more complicated. First grabbing task. # Imagine we have the task to analyze if there was correlation between the title of major news and price for Bitcoin? On the one hand we need to collect news about bitcoin for defined period and on the another - price. To do this we need to connect a few extra libraries including "selenium". Then download chromedriver and put it in folder with yupiter notebook. Selenium connects python script with chrome browser and let us to send commands to it and recieve html code of loaded pages. # In[ ]: import random import time from datetime import datetime, timedelta from selenium import webdriver # One of the way to get necessery news is to use Google Search. First of all because it grabs news headlines from many sites and we don't need to tune our script for that every news portal. The second thing - we can browse news by dates. What we have to do is to understand how the link of google news section works: #

https://www.google.com/search?q=bitcoin&num=100&biw=1920&bih=938&source=lnt&tbs=cdr%3A1%2Ccd_min%3A12%2F11%2F2018%2Ccd_max%3A12%2F11%2F2018&tbm=nws

#

"search?q=bitcoin" - what we are searching for

#

"num=100" - number of headlines

#

"cd_min%3A12%2F11%2F2018" - start date (cd_min%3A [12] %2F [11] %2F [2018] %2C - 12/11/2018 - MM/DD/YYYY)

#

"cd_max%3A12%2F11%2F2018" - end date

# #
# Let's try to load news for word bitcoin for 01/15/2018 #
# In[ ]: # start chrome web browser driver = webdriver.Chrome() # set settings cur_year = 2018 cur_month = 1 cur_day = 1 news_word = "bitcoin" # set up url cur_url = ( "https://www.google.com/search?q=" + str(news_word) + "&num=100&biw=1920&bih=929&source=lnt&tbs=cdr:1,cd_min:" + str(cur_month) + "/" + str(cur_day) + "/" + str(cur_year) + ",cd_max:" + str(cur_month) + "/" + str(cur_day) + "/" + str(cur_year) + "&tbm=nws" ) # load url to the chromedriver driver.get(cur_url) # wait a little to let the page load fully time.sleep(random.uniform(5, 10)) # read html code of loaded to chrome web page. html = driver.page_source soup = BeautifulSoup(html, "lxml") # We are lucky): correct page, correct search word and correct date. To move on, we need to examine html code and to find that tags (anchors) which let us grab necessary information. The most convenient way is to use "inspect" button of the right-click menu of Google Chrome web browser (or similar in other browsers). See the screenshot.

As we see 'h3' tag is resposible for block with news titles. This tag has attribute class="r dO0Ag". But in this case we can use only 'h3' tag as anchor because it used only to highlight titles.

# # In[ ]: # collect all h3 tags in the code titles = soup.find_all("h3") print(titles) #
# There are a lot of additional tags inside 'h3' blocks, that why we use loop to clear them and leave only text. #
# In[ ]: titles = [title.text for title in titles] # In[ ]: print(titles) print(len(titles)) #
# That's all. We get 21 news titles dated 1 january 2018. Also we can grab a few starting sentences from that news and use them in future analysis. #
# In[ ]: # alternative way to set class attribute is to use "class_" news = soup.find_all("div", class_="st") news = [new.text for new in news] print(news) print(len(news)) # But 1 day in history is nothing for correlation detection . That's why we'l create the list of 10 dates (for educational purposes) and set up loop grabbing to get news for all of them. Tip: if you want to change the language of news, when the first page is loaded during script execution, change the language in the settings manually or in few minutes you'l learn how to do this by algorithm. # # In[ ]: # Create the list of dates dates = [ datetime.strptime("01/25/2018", "%m/%d/%Y") + timedelta(days=n) for n in range(10) ] print(dates) # In[ ]: # create lists to save grabbed by dates information news = [] titles = [] # start loop for date in dates: cur_year = date.year cur_month = date.month cur_day = date.day news_word = "bitcoin" cur_url = ( "https://www.google.com/search?q=" + str(news_word) + "&num=100&biw=1920&bih=929&source=lnt&tbs=cdr:1,cd_min:" + str(cur_month) + "/" + str(cur_day) + "/" + str(cur_year) + ",cd_max:" + str(cur_month) + "/" + str(cur_day) + "/" + str(cur_year) + "&tbm=nws" ) driver.get(cur_url) # we have to increase the pause between loadings of pages to avoid detection # our activity as robots doing. So you have time for a cup of coffee while waiting. time.sleep(random.uniform(60, 120)) html = driver.page_source soup = BeautifulSoup(html, "lxml") cur_titles = soup.find_all("h3") cur_titles = [title.text for title in cur_titles] titles.append(cur_titles) cur_news = soup.find_all("h3") cur_news = [new.text for new in cur_news] news.append(cur_news) print(len(dates)) print(len(titles)) print(len(news)) # chech if the script works properly print(dates[5]) print(titles[5][:5]) print(news[5][:5]) driver.quit() #
# # But if it is so simply, it wouldn't be so interesting. # First of all such grabbing could be detected by web sites algorithms like the robots activity and could be banned. Secondly, some web sites hide all content of their pages and show it only when you scroll down the page. Thirdly, very often we need to input values in input boxes, click links to open next/previous page or click download button. To solve these problems we can use special methods to control browser. # Let's open the example page at "Yahoo! Finance": https://finance.yahoo.com/quote/SNA/history?p=SNA. If you scroll down the page you'l see that the content loads up periodically, and then finally it reaches the last row "Dec 13, 2017". But when the page is just opened and we view page source (Ctrl+U for Google Chrome), we'l not find there "Dec 13, 2017". So to get data with all dates for this symbol, first of all we have to scroll down to the end and after that parse the page. Such code will help us to solve this problem (to learn different ways of scrolling look here https://goo.gl/JdSvR4): # In[ ]: import pandas as pd driver = webdriver.Chrome() cur_url = "https://finance.yahoo.com/quote/SNA/history?p=SNA" driver.get(cur_url) SCROLL_PAUSE_TIME = 0.5 equal = 0 html_len_list = [] while True: window_size = driver.get_window_size() # get the size of loaded page html_len = len(driver.page_source.encode("utf-8")) html_len_list.append(html_len) # scroll down driver.execute_script( "window.scrollTo(0, window.scrollY +" + str(window_size["height"]) + ")" ) # time to load all content time.sleep(1) # get the size of newly loaded content new_html_len = len(driver.page_source.encode("utf-8")) # check if size of content not equal before and after scrolling. # if they are equal: add 1 to "equal" and scroll down. # if they are equal more than 4 last scrollings: break # if they are not "equal": reset equal to 0 and scroll down again if html_len == new_html_len: equal += 1 if equal > 4: break else: equal = 0 print(html_len_list) html = driver.page_source soup = BeautifulSoup(html, "lxml") table = soup.find("table", {"data-test": "historical-prices"}) table = pd.read_html(str(table))[0] print(table.head()) print(table.tail()) driver.quit() # There are many web sites that prefer to devide one article for two and more parts, so you have to click 'next' or 'previous' buttons. For us it is the task to open all that pages and grab them). The same task is with multi page catalogues. Here is example: we will open several pages at stackoverflow tags catalogue and collect top tag words with their occurancies through the portal. To do this we will use find_element_by_css_selector() method to locate certain element on the page and click on it with click() method. To read more about locating elements open this: https://goo.gl/PyzbBN # In[ ]: import pandas as pd from selenium.webdriver.common.keys import Keys def page_parse(html): soup = BeautifulSoup(html, "lxml") tags = soup.find_all("div", class_="grid-layout--cell tag-cell") tag_text_list = [] tag_count_list = [] for tag in tags: tag_text = tag.find("a").text tag_text_list.append(tag_text) tag_count = tag.find("span", class_="item-multiplier-count").text tag_count_list.append(tag_count) return tag_text_list, tag_count_list driver = webdriver.Chrome() cur_url = "https://stackoverflow.com/tags" driver.get(cur_url) tag_names, tag_counts = [], [] for i in range(3): if i == 0: html = driver.page_source cur_tag_names, cur_tag_counts = page_parse(html) tag_names = tag_names + cur_tag_names tag_counts = tag_counts + cur_tag_counts else: # find necessery element to click next = driver.find_element_by_css_selector(".page-numbers.next") # in some cases it would be enough to run next.click() but sometimes it doesn`t work # for more information about possible troubles of using click() read here: # https://goo.gl/kUGvsC driver.execute_script("arguments[0].click();", next) time.sleep(2) html = driver.page_source cur_tag_names, cur_tag_counts = page_parse(html) tag_names = tag_names + cur_tag_names tag_counts = tag_counts + cur_tag_counts tag_table = pd.DataFrame({"tag": tag_names, "count": tag_counts}) print(tag_table.head()) print(tag_table.tail()) driver.quit() # Or here is another example: medium.com site hides the part of comments below articles. But if we need to analyze the "reasons" of popularity of the page, comments can play a great role in this analysis and it better to grab all of them. Open this page and scroll to the bottom - you'll find that there is "Show all responses" button as "div" element there. Let's click on it and open all comments. # In[ ]: driver = webdriver.Chrome() cur_url = "https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60" driver.get(cur_url) # locate div container that contains button find_div = driver.find_element_by_css_selector(".container.js-showOtherResponses") # locate button inside container button = find_div.find_element_by_tag_name("button") driver.execute_script("arguments[0].click();", button) # check the page before and after running script - in second case all # comments are opened # # Authorization and input boxes # A lot of information is available only after authorization. So let's learn how to log in at Facebook. Algorithm is the same: find input boxes for login and password, insert into them text and after submit it. To send text to the inputs we will use .send_keys() method and to submit: .submit() method. # In[ ]: driver = webdriver.Chrome() cur_url = "https://www.facebook.com/" driver.get(cur_url) username_field = driver.find_element_by_name("email") # get the username field by name password_field = driver.find_element_by_name("pass") # get the password field by name username_field.send_keys("your_email") # insert your email password_field.send_keys("your_password") # insert your password password_field.submit() # But this methods are also very usefull when we need to change dates or insert values into input boxes to get certain information. For example, here is the "one page tool" to recieve etf fund flows information: Etf Fund Flows. There are no special pages for each ETF (as Yahoo! has) to view or download desired values. All you can do: to enter ETF symbol, start and end dates and click button "Submit". But if your boss set atask to obtain historical data for 500 etfs and 10 last years (120 months), you'l have to click 60000 the button "submit". What's a dull amusement... So let's make an algorithm that can collect this information while you'l be raving somewhere at Ibiza party. # # In[ ]: # a function to get year, date and month for start and end date inputs def convert_date(date): year = date.split("/")[2] month = date.split("/")[0] day = date.split("/")[1] # here we have to add zero before the month if it is less than 10 # because the input form requires such format of data: 01/01/2018 if len(month) < 2: month = str("0") + str(month) return day, month, month driver = webdriver.Chrome() # set some dates dates = ["6/30/2018", "3/31/2018", "12/31/2017", "9/30/2017"] # set a few etfs etfs = ["SPY", "QQQ"] # create empty dataframe to store values export_table = pd.DataFrame({}) # start ticker loop for ticker in etfs: loginUrl = "http://www.etf.com/etfanalytics/etf-fund-flows-tool" # set loop for dates without last date, because we need to set a period of # 'start' and 'end' date for i in range(0, len(dates) - 1): # create current pair of dates start_day, start_month, start_year = convert_date(dates[i + 1]) end_day, end_month, end_year = convert_date(dates[i]) date_1 = ( str(start_year) + str("-") + str(start_month) + str("-") + str(start_day) ) date_2 = str(end_year) + str("-") + str(end_month) + str("-") + str(end_day) driver.get(loginUrl) # locate the field to input etf symbol ticker_field = driver.find_element_by_id("edit-tickers") ticker_field.send_keys(str(ticker)) # locate the field to input start date start_date_field = driver.find_element_by_id( "edit-startdate-datepicker-popup-0" ) start_date_field.send_keys(str(date_1)) # locate the field to input end date end_date_field = driver.find_element_by_id("edit-enddate-datepicker-popup-0") end_date_field.send_keys(str(date_2)) # submit form end_date_field.submit() # read the page souse html = driver.page_source soup = BeautifulSoup(html, "lxml") # find certain table with etf flows information table = soup.find_all("table", id="topTenTable") # some transformations to get html code readable by pd.read_html() method table = table[0].find_all("tbody") table = str(table[0]).split("")[1] table = table.split("")[0] data = "" + str(table) + "
" soup = BeautifulSoup(data, "lxml") # convert html code to pandas dataframe df = pd.read_html(str(soup)) current_table = df[0] current_table.columns = ["Ticker", "Fund Name", "Net Flows", "Details"] current_table["Start Date"] = [date_1] current_table["End Date"] = [date_2] # concatenate current inflow table with main dataframe. export_table = pd.concat([export_table, current_table], ignore_index=True) # let the algorithm rest for a while time.sleep(random.uniform(5, 10)) # some magic and we get the information assigned by task. print(export_table) driver.quit() # There are enormous amount of sites, each of them has its own design, access to information, protection against robots, etc. That's why this tutorial could be as a little book. But at least one more approach of grabbing information we'l discover. It is connected with parsing dynamic graphs as like www.google/trends uses. Interestingly that Google's programmers don't allow to parse the code of trend graphs (the div tag which contains the code of graph is hidden) but let you download csv file with information (so we can use one of the above algorithms to find this button, to click and download file). # #
# Let's take another site where we can parse similar graphs: Portfolio Visualizer. Scroll down this page and you'l find graph as at screenshot. The worth of this graph is that the historical prices for Us Treasury Notes are not freely available - you have to buy it. But here we can grab it either manually (rewrite dates and values correspondigly), or to write code which "rewrites" the values for us and not only from this page... # # # In[ ]: loginUrl = "https://www.portfoliovisualizer.com/backtest-asset-class-allocation?s=y&mode=2&startYear=1972&endYear=2018&initialAmount=10000&annualOperation=0&annualAdjustment=0&inflationAdjusted=true&annualPercentage=0.0&frequency=4&rebalanceType=1&portfolio1=Custom&portfolio2=Custom&portfolio3=Custom&TreasuryNotes1=100" driver = webdriver.Chrome() driver.get(loginUrl) html = driver.page_source soup = BeautifulSoup(html, "lxml") # find the div with chart values chart = soup.find_all("div", id="chartDiv2") table = str(chart[0]).split("")[1] table = table.split("")[0] data = "" + str(table) + "
" soup = BeautifulSoup(data, "lxml") df = pd.read_html(str(soup))[0] print(df.head()) print(df.tail()) driver.quit() # # In coclusion # It's important to admit that parsing activity can be easily determined as robot's activity and you we'l be asked to pass the "antirobot's" captcha. On the one hand you can find solutions how to give the right answers to it, but on the other (i think more natural), you can set up such algorithms that will be similar to human activity, when they use web sites. You are lucky, when the website has no protection against parsing. But in case with Google News - after 10 or 20 loadings of page, you'l meet google's captcha. So try to make your algorithm more humanlike: scroll up and down, click links or buttons, be on the page for at list 10-15 seconds or more, especially when you need to download several thousands of pages, take breaks for hour and for night, etc. # # And good luck! # In[ ]: