Department of Data Science

Course: Tools and Techniques for Data Science

# # --- #

Instructor: Muhammad Arif Butt, Ph.D.

# #

Lecture 5.4 (Web Scraping using Selenium - II)

# In[ ]: # In[ ]: # In[ ]: # In[ ]: #

#

# # ## Learning agenda of this notebook # # **Recap of Previous Session** # # # - **Best Practices for Web Scraping & Some Points to Ponder** # # # - **Example 1:** Searching and Downloading Images for ML Classification:https://google.com # # # # - **Example 2:** Scraping Comments from a YouTube Video for NLP:https://www.youtube.com/watch?v=mHONNcZbwDY # # # - **Example 3:** Scraping Jobs from a Job Website: https://pk.indeed.com # # # - **Example 4:** Scraping Tweets of a celebirty: https://twitter.com/login # # # - **Example 5:** Scraping News Articles for a News Website: https://www.thenews.com.pk/today # # # - **Exercise:** # # In[ ]: # In[ ]: # ## Best Practices for Web Scraping & Some Points to Ponder # ### a. Check if Website is Changing Layouts and use Robust Locators # - Locating correct web element is a pre-requiste of web scraping. # - We can use ID, Name, Class, Tag, LinkText and PartialLinkText to locate web elements in Selenium. # - In dynamic environments the web elements mostly donot have consistent attribute values, therefore, finding a unique static attribute is quite a tricky task. Hence directly using above mentioned six selenium locators might not be able to uniquely identify a web element.. # - So in such situations CSS-Selector and XPATH should be preferred. # In[ ]: # In[ ]: # In[ ]: # #### CSS SELECTOR # - Basic Syntax: `tag[attribute='value']` # - Using ID: `input[id='username']` or `input#username` # - Using Class: `input[class='form-control']` or `input.form-control` # - Using any attribute: `input[any-attr='attr-value']` # - Combining attributes: `input.form-control[attr='value']` # - Using Parent/Child Hierarchy: # - Basic Syntax: `parent-locator > child-locator` # - Direct Parent/Child: `div > input[attr = 'value']` # In[ ]: # In[ ]: # In[ ]: # #### XPATH SELECTOR # - Basic Example: # - Absolute XPATH: `/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input` # - Relative XPATH: `//input[@title='Search']` # - For dyanmic websites using simple XPATH might return multiple elements. To write effective XPATHS, one can use XPATH functions to identify elements uniquely: # - Using contains(): `//input[contains(@id, 'userN')]` or `//input[@id, 'userNname']` # - Using starts-with(): `//tagname[starts-with(@attribute, 'initial partial value of attribute')]` # - Using text(): `//input[text() = 'text of the element')]` # - You can use AND & OR operators to identify an element by combining two different conditions or attributes: # - Using and: `//tagname[@name='value' and @id='value']` # - Using or: `//tagname[@name='value' or @id='value']` # - You can use XPATH Axis, which use the relationship between various nodes to locate a web element in the DOM structure: # - `ancestor`: Locates ancestors of current node, which includes the parents upto the root node. # - `descendant`: Locates descendants of current node, which includes the children upto the leaf node. # - `child`: Locates the children of current node. # - `parent`: Locates parent of the current node. # In[ ]: # In[ ]: # In[ ]: # ### b. Wait for the WebElement to be Displayed Before you Start Scraping # - These days most of the web apps are using AJAX techniques. When a page is loaded by the browser, the elements within that page may load at different time intervals. This makes locating elements difficult: if an element is not yet present in the DOM, a locate function will raise an ElementNotVisibleException. # - One can use `time.sleep(10)` to make our script wait for exact 10 seconds before proceeding. One should avoid using these static wait statements rather should use the dynamic waits provided by Selenium Webdriver. # - Many times the web elements are not interactable, not clickable, or not visible, and thats where you have to put the wait so that the page gets loaded and your script can find that particular web element and proceed further. # - **Implicit wait:** # - Implicit wait applies to all the Web Elements in the test script # - In implicit wait you specify a time out and your script wait for all the web elements to be loaded or raises an exception if the time expires. # - Example: The `driver.implicitly_wait(30)` will wait for a maximum of 30 seconds before throwing a timeout exception. If all the web elements are available before 30 seconds, control will move to the next LOC. # # - **Explicit wait:** # - Explicit wait is used to wait for a specific web element. # - In explicit wait, other than specifying the time out, you also specify a condition to be checked, like checking if the element is visible, or clickable and so on. # - Example: The `element = WebDriverWait(driver, 30).until(EC.presence_of_element_located(By.XPATH, 'xpath'))` will wait for a maximum of 30 seconds before throwing a timeout exception. If the specific web element becomes visible within 30 seconds, control will move to the next LOC. # # - **Fluent Wait** is quite similar to explicit wait, where you can specify the polling frequency. See Selenium documentations for details: https://www.selenium.dev/documentation/webdriver/waits/ # In[ ]: # In[ ]: # ### c. Robots Exclusion Protocol # - The robots exclusion protocol or simply `robots.txt`, is a standard used by websites to communicate with web crawlers/bots, informing them about which areas of the website can be scanned or scraped. # - The `robots.txt` file is mostly placed in a website's top level directory and is publically available. A sub-domain on a root domain can also have separate `robots.txt` files. # - The `robots.txt` file provides instructions for bots, however, it can't actually enforce the instructions. # - So a good bot follows those instructions, while a bad bot ignore them. # In[ ]: # In[ ]: # ### d. Do not Hammer the Webserver # - Web scraping bots fetch data very fast, so it is easy for a website to detect your scraper. # - So to make sure that your bot donot hammer the webserver by sending too many request in a very short span of time, you need to put some random programmatic sleep calls `time.sleep(2)` in between requests. # In[ ]: # In[ ]: # In[ ]: # ### e. Avoid Scraping Data Behind Login # - If a page is protected by login, the scraper would have to send some information or cookies along with each request to view the page. # - So be watchful, if you get caught, your account might get blocked. # In[ ]: # In[ ]: # In[ ]: # ### f. Do not Follow Same Crawling Pattern # - When humans browse a website, they have different view time, they are slow, and they perform random clicks. On the contrary bots are very fast and follow the same/fixed browsing pattern. # - Some websites have intelligent anti-crawling mechanisms to detect spiders and may block your IP and you can no more visit that website. # - A simple solution is to incorporate some random clicks on the page, mouse movements and random actions that will make your bot look like a human. # In[ ]: # In[ ]: # In[ ]: # ### g. Beware of Honey Pots # - Honeypots are systems that are used to lure hackers and detect any scraping attempts that try to gain information. # - Some websites install honeypots, which are links invisible to normal users with color disguised to blend in with the page’s background color. But can be seen by bots and therefore one of the reasons to get caught. # - So make sure that your bot take care that the link has proper visibility with no nofollow tag. # In[ ]: # In[ ]: # In[ ]: # ### h. Rotate User-Agents # - Every request made from a web browser contains a user-agent header, and if the user agent is not set, websites won’t let you view content. # - Using the same user-agent consistently leads to the detection of a bot. # - The only way to make your User-Agent appear more real and bypass detection is to fake the user agent. # > You can get your User-Agent by typing `what is my user agent` in Google’s search bar. # In[ ]: # In[ ]: # ### i. Make Requests through Proxies and Rotate Them as Needed # - When scraping blindlessly, multiple requests coming from the same IP will lead you to get blocked # - So better scrap from behind a proxy server, so the target website will not know where the original IP is from, making the detection harder. # - There are several methods that can change your outgoing IP # - TOR # - VPNs # - Free Proxies # - Shared Proxies # - Private Proxies # - Data Center Proxies # - Residential Proxies # > You can get your IP by typing `what is my ip` in Google’s search bar. # In[ ]: # In[ ]: # In[ ]: # ### j. Use CAPTCHA Solving Services # - Many websites use CAPTCHAs to keep bots out of their websites. # - If you want to scrape websites that use CAPTCHAs, you can use CAPTCHA services to get past these restrictions. # - https://2captcha.com/ # - https://anti-captcha.com/ # - https://pypi.org/project/pytesseract/0.1/ # In[ ]: # In[ ]: # In[ ]: # ## Example 1: Searching and Downloading Images for ML Classification:https://google.com # ### a. Search and Load the Images of Cats # In[3]: from selenium.webdriver import Chrome from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys import time #Create an instance of webdriver and load the google webpage s = Service('/Users/arif/Documents/chromedriver') myoptions = Options() myoptions.headless = False # default settings driver = Chrome(service=s, options=myoptions) driver.maximize_window() driver.get('https://google.com') # locate the search textbox, enter the search string and press enter key driver.implicitly_wait(30) #tbox = driver.find_element(By.XPATH, '/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input') #tbox = driver.find_element(By.XPATH, "//input[@title='Search']") tbox = driver.find_element(By.CSS_SELECTOR, "input[title='Search']") tbox.send_keys("Cat") # Instead of locating and clicking the search button, you can simplay press enter time.sleep(2) tbox.send_keys(Keys.ENTER) # Locate the image tab and click it to visit the images tab driver.implicitly_wait(30) #menu_img_link = driver.find_element(By.XPATH, '/html/body/div[7]/div/div[4]/div/div[1]/div/div[1]/div/div[2]/a') menu_img_link = driver.find_element(By.XPATH, '//*[@id="hdtb-msb"]/div[1]/div/div[2]/a') menu_img_link.click() # In[ ]: # In[ ]: # ### b. Self-Scroll to the Bottom of the Webpage # - Create an instance of WebDriver # - The `driver.execute_script(JS)` method is used to synchronously execute JavaScript in the current window/frame. # ``` # driver.execute_script('alert("Hello JavaScript")') # ``` # - The `window.scrollTo()` method is used to perform scrolling operation. The pixels to be scrolled horizontally along the x-axis and pixels to be scrolled vertically along the y-axis are passed as parameters to the method. # In[4]: # Self-Scroll the entire page till you reach the bottom last_height =driver.execute_script('return document.body.scrollHeight') while True: driver.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(4) new_height =driver.execute_script('return document.body.scrollHeight') if (new_height == last_height): break last_height = new_height print("Done... Reached the bottom of the page") # In[ ]: # In[ ]: # In[ ]: # ### c. Save the Images by using the `screenshot()` method # - Two ways to take a screenshot: # - `driver.save_screenshot(filename)` Saves a screenshot of the current window to a PNG image file and returns a bool value # # - `element.screenshot(filename)` saves a screenshot of the current element to a PNG image file and Returns a bool value # In[5]: from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # downlaod and save 40 images of cats for i in range(1,41): try: WebDriverWait(driver, 5).until (EC.presence_of_element_located((By.XPATH, '//*[@id="islrg"]/div[1]/div['+str(i)+']/a[1]/div[1]/img'))) cat_img = driver.find_element(By.XPATH, '//*[@id="islrg"]/div[1]/div['+str(i)+']/a[1]/div[1]/img') cat_img.screenshot('/Users/arif/Downloads/cat_images/cat_img'+str(i)+'.png') except: continue print("Done... Check the folder for images of cats") # In[ ]: # In[ ]: # In[6]: driver.quit() # In[7]: get_ipython().system('ls /Users/arif/Downloads/cat_images') # In[8]: from matplotlib import image from matplotlib import pyplot as plt img1 = image.imread("/Users/arif/Downloads/cat_images/cat_img20.png") plt.imshow(img1); # In[ ]: # In[ ]: # ## Example 2: Scraping Comments from a YouTube Video # - https://www.youtube.com/watch?v=mHONNcZbwDY&t=80s # In[10]: driver.quit() # In[11]: from selenium.webdriver import Chrome from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import pandas as pd import time #Create an instance of webdriver and load/run the youtube video page s = Service('/Users/arif/Documents/chromedriver') myoptions = Options() myoptions.headless = False # default settings driver = Chrome(service=s, options=myoptions) driver.maximize_window() time.sleep(1) driver.get('https://www.youtube.com/watch?v=mHONNcZbwDY&t=80s') driver.implicitly_wait(30) play = driver.find_element(By.XPATH, '//*[@id="movie_player"]/div[5]/button') play.click() # Perform three scrolls to get around 60 comments for scroll in range(1, 4): body = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.TAG_NAME, "body"))) body.send_keys(Keys.END) time.sleep(12) # Scrape the comments comments = [] comments_list = driver.find_elements(By.CSS_SELECTOR,"#content-text" ) for comment in comments_list: text = comment.text.strip() comments.append(text) # Scrape the authors who made the comments authors = [] authors_list = driver.find_elements(By.ID,"author-text") for author in authors_list: text = author.text.strip() authors.append(text) # Save the comments in csv file data = {'Authors':authors, 'Comments':comments} df = pd.DataFrame(data, columns=['Authors', 'Comments']) df.to_csv('hello.csv', index=False) # In[ ]: # In[12]: driver.quit() # In[2]: import pandas as pd df = pd.read_csv('hello.csv') pd.set_option('max_colwidth',150) df # In[ ]: # In[ ]: # ## Example 3: Scraping Jobs: # - https://pk.indeed.com # In[13]: from selenium.webdriver import Chrome from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import pandas as pd import time #Create an instance of webdriver and go the the appropriate job page s = Service('/Users/arif/Documents/chromedriver') myoptions = Options() myoptions.headless = False driver = Chrome(service=s, options=myoptions) driver.maximize_window() driver.get('https://pk.indeed.com') time.sleep(5) # Enter your search parameters and click FindJobs button what_box = driver.find_element(By.XPATH,'//*[@id="text-input-what"]') where_box = driver.find_element(By.XPATH,'//*[@id="text-input-where"]') button = driver.find_element(By.XPATH,'//*[@id="jobsearch"]/button') what_box.send_keys('Full Stack Web Developer') where_box.send_keys('Lahore') button.click() time.sleep(5) # Function that scrape the three pieces of information of each job and is called on each page jobtitles = [] companies = [] salaries = [] def jobs(): time.sleep(2) postings = driver.find_elements(By.CSS_SELECTOR, '.resultContent') for posting in postings: try: job_title = posting.find_element(By.CSS_SELECTOR,'h2 a').text except: job_title = 'No Job title' try: company = posting.find_element(By.CSS_SELECTOR,'.companyName').text except: company = "No company name" try: salary = posting.find_element(By.CSS_SELECTOR,'.salary-snippet-container').text except: salary = "No Salary" companies.append(company) salaries.append(salary) jobtitles.append(job_title) # Click the next page button in the pagination bar while(True): time.sleep(4) try: pop_up = driver.find_element(By.CSS_SELECTOR,'.popover-x-button-close.icl-CloseButton') driver.find_element(By.CSS_SELECTOR,'.popover-x-button-close.icl-CloseButton').click() except: pass jobs() try: driver.find_element(By.CLASS_NAME,'pagination-list') driver.execute_script("arguments[0].scrollIntoView();", driver.find_element(By.CLASS_NAME,'pagination-list')) try: driver.find_element(By.XPATH,'//*[@aria-label="Next"]').click() except: break except: break # # Writing in the file data = {'Company':companies, 'Job Title':jobtitles, 'Salary':salaries} df = pd.DataFrame(data, columns=['Company', 'Job Title', 'Salary']) df.to_csv('jobs.csv', index=False) df = pd.read_csv('jobs.csv') df # In[14]: driver.quit() # In[ ]: # In[ ]: # ## Example 4: Scraping: https://twitter.com/login # In[15]: from selenium.webdriver import Chrome from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC import pandas as pd import time import os # Create an instance of webdriver and get the twitter login page s = Service('/Users/arif/Documents/chromedriver') myoptions = Options() driver = Chrome(service=s, options=myoptions) driver.maximize_window() driver.get('https://twitter.com/login') driver.implicitly_wait(30) # Enter username and password time.sleep(5) username = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//input[@name="text"]'))) username.send_keys('username') username.send_keys(Keys.ENTER) time.sleep(2) passwd = WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, '//input[@name="password"]'))) passwd.send_keys(os.environ['yourtwitterpassword']) # actual passwd is saved in an environment variable :) passwd.send_keys(Keys.ENTER) # Enter Celebrity name (Imran Khan) in Search Textbox time.sleep(2) search_input = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH,'//input[@aria-label="Search query"]'))) search_input.send_keys("Imran Khan") time.sleep(2) search_input.send_keys(Keys.ENTER) ## Click on People tab for People Profiles using LINK_TEXT Locator time.sleep(2) people = WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.LINK_TEXT, 'People'))) people.click() # Click on the twitter link of Imran Khan time.sleep(2) click_imran = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.LINK_TEXT, 'Imran Khan'))) click_imran.click() # In[16]: # Scrape username user_name = driver.find_element(By.XPATH,'((//*[@data-testid="UserName"])//span)[last()]').text # Scrape 25 tweets with their dates articles = [] tweets = [] times=[] while True: time.sleep(1) article = driver.find_elements(By.TAG_NAME,'article') for a in article: if a not in articles: tweet = a.find_element(By.XPATH, './/*[@data-testid="tweetText"]') articles.append(a) t = a.find_element(By.XPATH,'.//time') times.append(t.text) tweets.append(tweet.text) if len(tweets) >=25: break driver.execute_script("window.scrollBy(0,500);") # Write scraped data in csv file data = {'User':user_name, 'Times':times,'Tweets':tweets} df = pd.DataFrame(data, columns=['User', 'Times','Tweets']) df.to_csv('tweets.csv', index=False) df = pd.read_csv('tweets.csv') df # In[17]: driver.quit() # In[ ]: # In[ ]: # ## Example 5: Scraping News: https://www.thenews.com.pk/today # In[18]: from selenium.webdriver import Chrome from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import pandas as pd import time # Create an instance of webdriver and load the newspaper s = Service('/Users/arif/Documents/chromedriver') myoptions = Options() driver = Chrome(service=s, options=myoptions) driver.maximize_window() driver.get('https://www.thenews.com.pk/today') time.sleep(2) # In[ ]: # In[ ]: # In[19]: # Create a list of all the URLs of your interest urls = [] try: s = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT,"Imran"))) search_urls = driver.find_elements(By.PARTIAL_LINK_TEXT,"Imran") for i in search_urls: urls.append(i.get_attribute("href")) except: print("I did not find it ") for url in urls: print(url) # In[ ]: # In[ ]: # In[20]: original_window = driver.current_window_handle news_articles = [] authors = [] headings = [] for url in urls: driver.switch_to.new_window('tab') driver.get(url) try: heading = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR,".detail-heading h1"))) author = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR,".category-source"))) article = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR,".story-detail"))) headings.append(heading.text) news_articles.append(article.text) authors.append(author.text) except: pass driver.switch_to.window(original_window) for heading in headings: print(heading) # In[ ]: # In[ ]: # In[21]: data = {'Headings':headings, 'Authors':authors, 'News Articles':news_articles} df = pd.DataFrame(data, columns=['Headings', 'Authors', 'News Articles']) df.to_csv('news.csv', index=False) df = pd.read_csv('news.csv') df # In[22]: df['News Articles'][0] # In[ ]: # In[ ]: # ## Practice Problem: Scraping Houses Data: https://zameen.com # - For machine learning tasks, we need to have following fields for a hundred thousand **houses** in Lahore and within cities different locations/societies # - City # - Location/Address # - Covered Area # - Number of Bedrooms # - Number of Bathrooms # - Price # In[ ]: