Recap of Previous Session
Overview of Selenium (Why, What and How)
A Step-by-Step Hello World with Selenium
Example 1: Scraping a JavaScript Driven WebSite (https://arifpucit.github.io/bss2/js/)
Example 2: Scraping Dynamic WebSites (https://arifpucit.github.io/bss2/login/)
find_element()
and find_elements()
methodsExample 3: Scraping Web Pages that Employ Infinite Scrolling: https://arifpucit.github.io/bss2/scrolling/
Example 4: Scraping Web Pages that Employ Pagination: https://arifpucit.github.io/bss2/pagination/
Example 5: Scraping Web Pages that use Pop-ups: https://arifpucit.github.io/bss2/popup/
Bonus:
import sys
!{sys.executable} -m pip install --upgrade pip -q
!{sys.executable} -m pip install --upgrade selenium -q
import selenium
selenium.__version__ , selenium.__path__
Create an instance of Browser:
- The
Service('pathtochromedriver)
method is used to create a Service object that needs to be passed toChrome()
method.- The
Chrome(service, options)
method is used to create a new instance of the chrome driver, starts the service and then creates a new instance of chrome browser.- ChromeOptions is a new concept added in Selenium WebDriver starting from Selenium version 3.6. 0 which is used for customizing the ChromeDriver session.
- The
Options()
method is used to change the default settings of chrome driver. The object is then passed to the webdriver.chrome() method.
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)
driver.quit()
from selenium.webdriver.chrome.options import Options
s = Service('/Users/arif/Documents/chromedriver')
myoptions = Options()
myoptions.headless = True
driver = Chrome(service=s, options=myoptions)
driver.quit()
s = Service('/Users/arif/Documents/chromedriver')
myoptions = Options()
driver = Chrome(service=s, options=myoptions)
Load a Web page in the browser window:
- The
driver.get('URL')
method is used to load a web page in the current browser session, after which you can access the browser and the HTML code using the driver object.- This is similar to
resp = requests.get('URL')
, after which you simply get the response object.
driver.get('https://google.com')
Access browser information:
- There is a bunch information about the browser you can request, including window handles, browser size / position, cookies, alerts, etc.
print(dir(driver))
driver.title
driver.current_url
driver.current_window_handle
driver.session_id
driver.page_source
Perform Different operations on the browser:
- The
driver.refresh()
method is used to refresh the page contents.- The
driver.set_window_position(x,y)
is used to set the positions of the top left corner of the browser window.- The
driver.set_window_size(x,y)
is used to set the width and height of current window.- The
driver.maximize_window()
is used to maximinize the size of the window.- The
driver.minimize_window()
is used to minimize the browser in the taskbar.
driver.refresh()
driver.set_window_position(0,0)
driver.maximize_window()
driver.minimize_window()
Create new tab in the browser window and shift between tabs:
- Clicking a link may opens in a new browser tab
- You can also create a new browser tab programmatically using the
driver.switch_to.new_window('tab')
.- All calls to the driver will now be interpreted as being directed to the current browser tab.
- WebDriver supports moving between windows using:
- `driver.switch_to.window("windowname")`
- `driver.switch_to.frame('framename')`
- `driver.switch_to.default_content()`
- All calls to driver will now be interpreted as being directed to the particular window.
google_tab = driver.current_window_handle
driver.switch_to.new_window('tab')
driver.get('https://www.yahoo.com')
driver.switch_to.window(google_tab)
driver.close()
driver.quit()
Close browser tab or close the entire session:
- The
driver.close()
will simply closes the current tab of the browser and will not close the browser process.- The
driver.quit()
will close all the browser tabs and the background driver process.
import requests
from bs4 import BeautifulSoup
import pandas as pd
titles = []
prices = []
availability=[]
reviews=[]
links=[]
stars=[]
resp = requests.get("https://arifpucit.github.io/bss2/js")
soup = BeautifulSoup(resp.text, 'lxml') #resp.text do not contain the HTML for the books data
sp_titles = soup.find_all('p', class_="book_name")
sp_prices = soup.find_all('p', class_="price green")
sp_availability = data = soup.find_all('p', class_='stock')
sp_reviews = soup.find_all('p',{'class','review'})
data = soup.find_all('p', class_="book_name")
sp_links=[]
for val in data:
sp_links.append(val.find('a').get('href'))
books = soup.find_all('div',{'class','book_container'})
for book in books:
stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
for i in range(len(sp_titles)):
titles.append(sp_titles[i].text)
prices.append(sp_prices[i].text)
availability.append(sp_availability[i].text)
reviews.append(sp_reviews[i].text)
links.append(sp_links[i])
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('books1.csv', index=False)
df = pd.read_csv('books1.csv')
df
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
s = Service('/Users/arif/Documents/chromedriver')
myoptions = Options()
driver = Chrome (service=s, options=myoptions)
driver.get('https://arifpucit.github.io/bss2/js/')
titles = []
prices = []
availability=[]
reviews=[]
links=[]
stars=[]
soup = BeautifulSoup(driver.page_source, 'lxml')#driver.page_source DO contain the HTML for the books data
sp_titles = soup.find_all('p', class_="book_name")
sp_prices = soup.find_all('p', class_="price green")
sp_availability = data = soup.find_all('p', class_='stock')
sp_reviews = soup.find_all('p',{'class','review'})
data = soup.find_all('p', class_="book_name")
sp_links=[]
for val in data:
sp_links.append(val.find('a').get('href'))
books = soup.find_all('div',{'class','book_container'})
for book in books:
stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
for i in range(len(sp_titles)):
titles.append(sp_titles[i].text)
prices.append(sp_prices[i].text)
availability.append(sp_availability[i].text)
reviews.append(sp_reviews[i].text)
links.append(sp_links[i])
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('books1.csv', index=False)
df = pd.read_csv('books1.csv')
df
driver.quit()
Once we have the webpage loaded inside our browser, the next task is to locate the web element(s) of our interest and later perform actions on it.
The two most commonly used methods used to locate elements are:
driver.find_element(By.LOCATOR, "value")
method is used to locate a single element.driver.find_elements(By.LOCATOR, "value")
method is used to locate multiple elements.The first argument to these methods are the locators, and second argument is the value of that locator.
In Selenium, there are eight different types of Locators or ways using which we can locate a web element:
Locating Web Elements: https://selenium-python.readthedocs.io/locating-elements.html
Interacting with Web Elements: https://www.selenium.dev/documentation/webdriver/elements/interactions/
Read about CSS_SELECTOR: https://www.w3schools.com/cssref/css_selectors.asp
Read about XPATH: https://www.guru99.com/xpath-selenium.html, https://www.browserstack.com/guide/find-element-by-xpath-in-selenium
Install Chrome Extension (Selector Hub): https://selectorshub.com/
Load the Login version of Book Scraping Site: https://arifpucit.github.io/bss2/login/
s = Service('/Users/arif/Documents/chromedriver')
myoptions = Options()
driver = Chrome (service=s, options=myoptions)
driver.get('https://arifpucit.github.io/bss2/login/')
driver.maximize_window()
driver.quit()
s = Service('/Users/arif/Documents/chromedriver')
myoptions = Options()
driver = Chrome (service=s, options=myoptions)
driver.get('https://arifpucit.github.io/bss2/login/')
driver.maximize_window()
from selenium.webdriver.common.by import By
tbox = driver.find_element(By.ID, 'name')
type(tbox)
tbox.send_keys("arif")
tbox.clear()
mylink = driver.find_element(By.LINK_TEXT, 'Ask Google for Password')
mylink.click()
driver.back()
tbox2 = driver.find_element(By.CSS_SELECTOR, '#password')
tbox2.send_keys('datascience')
btn = driver.find_element(By.XPATH, '//*[@id="submit_button"]')
btn.click()
for i in range(1,10):
price = driver.find_element(By.XPATH, '/html/body/section/div/div[2]/div[2]/div[' + str(i)+ ']/div/p[1]').text
print(price)
driver.quit()
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)
driver.get('https://arifpucit.github.io/bss2/login')
driver.maximize_window()
driver.find_element(By.ID, "name").send_keys("arif")
driver.find_element(By.ID, "password").send_keys("datascience")
btn = driver.find_element(By.ID, "submit_button")
time.sleep(2)
btn.click()
time.sleep(2)
titles = []
prices = []
availability=[]
reviews=[]
links=[]
for i in range(1,10):
title = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/p').text
titles.append(title)
price = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/div/p[1]').text
prices.append(price)
avail = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/div/p[2]').text
availability.append(avail)
review = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/div/p[3]').text
reviews.append(review)
link = driver.find_element(By.XPATH,'/html/body/section/div/div[2]/div[2]/div['+str(i)+']/p/a').get_attribute('href')
links.append(link)
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links'])
df.to_csv('books2.csv', index=False)
df = pd.read_csv('books2.csv')
df
driver.quit()
- Above bot scrape the data of nine OS books only.
- Try extending above crawler to scrape the books data of SP and CA as an exercise.
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)
driver.get('https://arifpucit.github.io/bss2/scroll')
driver.maximize_window()
titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]
books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
books_count = len(books)
for i in range(1,books_count+1):
title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
titles.append(title)
price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
prices.append(price)
avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
availability.append(avail)
review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
reviews.append(review)
link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
links.append(link)
star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
star_rates.append(star_rate)
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books3.csv', index=False)
df = pd.read_csv('books3.csv')
df
driver.execute_script(JS)
method is used to synchronously execute JavaScript in the current window/frame.driver.execute_script('alert("Hello JavaScript")')
window.scrollTo()
method is used to perform scrolling operation. The pixels to be scrolled horizontally along the x-axis and pixels to be scrolled vertically along the y-axis are passed as parameters to the method.driver.execute_script('return document.body.scrollHeight')
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
driver.quit()
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
import time
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)
driver.get('https://arifpucit.github.io/bss2/scroll')
driver.maximize_window()
last_height =driver.execute_script('return document.body.scrollHeight')
while True:
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(2)
new_height =driver.execute_script('return document.body.scrollHeight')
if (new_height == last_height):
break
last_height = new_height
# Count of books in the entire page
books_count = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
len(books_count)
driver.quit()
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)
driver.get('https://arifpucit.github.io/bss2/scroll')
driver.maximize_window()
#Scroll the entire page and then starts scraping
last_height =driver.execute_script('return document.body.scrollHeight')
while True:
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(2)
new_height =driver.execute_script('return document.body.scrollHeight')
if (new_height == last_height):
break
last_height = new_height
titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]
books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
books_count = len(books)
for i in range(1,books_count+1):
title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
titles.append(title)
price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
prices.append(price)
avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
availability.append(avail)
review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
reviews.append(review)
link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
links.append(link)
star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
star_rate = round(float(star_rate),2)
star_rates.append(star_rate)
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books3.csv', index=False)
df = pd.read_csv('books3.csv')
df
driver.quit()
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]
def books():
time.sleep(2)
books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
books_count = len(books)
for i in range(1,books_count+1):
title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
titles.append(title)
price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
prices.append(price)
avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
availability.append(avail)
review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
reviews.append(review)
link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
links.append(link)
star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points
star_rates.append(star_rate)
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)
url = 'https://arifpucit.github.io/bss2/pagination'
driver.get(url)
driver.maximize_window()
books()
# Writing in the file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books4.csv', index=False)
df = pd.read_csv('books4.csv')
df
driver.quit()
time.sleep(ARBITRARY_TIME)
method.WebDriverWait()
method.time.sleep()
you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough. Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.presence_of_element_located
element_to_be_clickable
text_to_be_present_in_element
element_to_be_clickable
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
# new header files
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)
url = 'https://arifpucit.github.io/bss2/pagination'
driver.get(url)
driver.maximize_window()
while(True):
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item')))
try:
driver.find_element(By.XPATH,'//*[@id="page_8"]')
driver.execute_script("arguments[0].scrollIntoView();", driver.find_element(By.XPATH,'//*[@id="page_8"]'))
try:
time.sleep(2)
driver.find_element(By.CSS_SELECTOR,'.page-item.disabled')
break
except:
driver.find_element(By.XPATH,'//*[@id="page_8"]').click()
except:
break
print("Successfully, reached on the last page")
driver.quit()
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
# new header files
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]
def books():
time.sleep(2)
books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
books_count = len(books)
for i in range(1,books_count+1):
title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
titles.append(title)
price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
prices.append(price)
avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
availability.append(avail)
review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
reviews.append(review)
link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
links.append(link)
star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points
star_rates.append(star_rate)
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)
url = 'https://arifpucit.github.io/bss2/pagination'
driver.get(url)
driver.maximize_window()
# Let us call the books() function and click the next button and repeat till it disappears/disabled
while(True):
books()
page = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item')))
try:
driver.find_element(By.XPATH,'//*[@id="page_8"]')
driver.execute_script("arguments[0].scrollIntoView();", driver.find_element(By.XPATH,'//*[@id="page_8"]'))
try:
time.sleep(2)
driver.find_element(By.CSS_SELECTOR,'.page-item.disabled')
break
except:
driver.find_element(By.XPATH,'//*[@id="page_8"]').click()
except:
break
# Writing in the file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books4.csv', index=False)
df = pd.read_csv('books4.csv')
df
driver.quit()
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
# new header files
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]
def books():
time.sleep(2)
books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
books_count = len(books)
for i in range(1,books_count+1):
title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
titles.append(title)
price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
prices.append(price)
avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
availability.append(avail)
review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
reviews.append(review)
link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
links.append(link)
star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points
star_rates.append(star_rate)
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)
url = 'https://arifpucit.github.io/bss2/popup'
driver.get(url)
driver.maximize_window()
# Let us call the books() function and click the next button and repeat till it disappears/disabled
while(True):
books()
page = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item'))) ## this is what wait looks like and this is important otherwise this code dont even But github worked without wait as i loaded all the data very fast
try:
driver.find_element(By.XPATH,'//*[@id="page_8"]')
driver.execute_script("arguments[0].scrollIntoView();", driver.find_element(By.XPATH,'//*[@id="page_8"]'))
try:
time.sleep(2)
driver.find_element(By.CSS_SELECTOR,'.page-item.disabled') ## finding the disbale class which disable the last page if we get this then break otherwise exception will be throwed
break
except:
driver.find_element(By.XPATH,'//*[@id="page_8"]').click() ## as exception throwed now should to the next button i mean a tag of next page
except:
break
# Writing in the file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books5.csv', index=False)
df = pd.read_csv('books5.csv')
df
driver.quit()
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
# new header files
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]
def books():
time.sleep(2)
books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
books_count = len(books)
for i in range(1,books_count+1):
title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
titles.append(title)
price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
prices.append(price)
avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
availability.append(avail)
review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
reviews.append(review)
link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
links.append(link)
star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
star_rate = round(float(star_rate),2) # fixing our rating to only 2 floatng points
star_rates.append(star_rate)
s = Service('/Users/arif/Documents/chromedriver')
driver = Chrome(service=s)
url = 'https://arifpucit.github.io/bss2/popup'
driver.get(url)
driver.maximize_window()
#### Lets close the pop up
time.sleep(5)
driver.switch_to.frame(driver.find_element(By.ID,'frame'))
clos_button = driver.find_element(By.XPATH,'//*[@id="staticBackdrop"]/div/div/div[1]/button')
clos_button.click()
driver.switch_to.default_content()
# Let us call the books() function and click the next button and repeat till it disappears/disabled
while(True):
books()
page = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.page-item'))) ## this is what wait looks like and this is important otherwise this code dont even But github worked without wait as i loaded all the data very fast
try:
driver.find_element(By.XPATH,'//*[@id="page_8"]')
driver.execute_script("arguments[0].scrollIntoView();", driver.find_element(By.XPATH,'//*[@id="page_8"]'))
try:
time.sleep(2)
driver.find_element(By.CSS_SELECTOR,'.page-item.disabled') ## finding the disbale class which disable the last page if we get this then break otherwise exception will be throwed
break
except:
driver.find_element(By.XPATH,'//*[@id="page_8"]').click() ## as exception throwed now should to the next button i mean a tag of next page
except:
break
# Writing in the file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books5.csv', index=False)
df = pd.read_csv('books5.csv')
df
How to Generate App Passwords in Gmail
The smtplib is a Python library for sending emails using the Simple Mail Transfer Protocol (SMTP). The smtplib is a built-in module; we do not need to install it. It abstracts away all the complexities of SMTP
MIMEBase is just a base class. As the specification says: "Ordinarily you won’t create instances specifically of MIMEBase"
MIMEText is for text (e.g. text/plain or text/html), if the whole message is in text format, or if a part of it is.
MIMEMultipart is for saying "I have more than one part", and then listing the parts - you do that if you have attachments, you also do it to provide alternative versions of the same content (e.g. a plain text version plus an HTML version)
-- Last generated app password is: clqthhcveozaehuk
import smtplib, ssl
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email import encoders
sender = 'arifbuttscraper@gmail.com'
passwd = 'xxxxxxxxxxxxx'
receiver = 'xxxxx@gmail.com'
# create MIMEMultipart object, fill its header information and attach the body to it
msg = MIMEMultipart()
msg['From'] = sender
msg['To'] = receiver
msg['Subject'] = 'Books Data'
msg.attach(MIMEText(
'''
AoA,
Please see the attached file containing information about books.
Best
''',
'plain'))
# create MIMEBase object for creating an attachment and attach it with the MIMEMultipart object
part = MIMEBase('application', 'octet-stream')
fd = open('books4.csv', 'rb')
file_contents = fd.read()
part.set_payload(file_contents)
encoders.encode_base64(part)
part.add_header('Content-Disposition', 'attachment; filename ="books4.csv"')
msg.attach(part)
# Send message object as email using smptplib by first creating an smtp session
s = smtplib.SMTP_SSL(host='smtp.gmail.com', port=465)
s.login(user = sender, password = passwd)
s.sendmail(sender, receiver, msg.as_string())
s.quit()
print('Done..!!')
import sys
!{sys.executable} -m pip install schedule -q
import smtplib, ssl
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email import encoders
import os
sender = 'xxxxx@gmail.com'
receiver = 'yyyyy@gmail.com'
passwd = 'xxxxxxxxxxxxx'
msg = MIMEMultipart()
msg['From'] = sender
msg['To'] = receiver
msg['Subject'] = 'Scheduled Email'
msg.attach(MIMEText('This is an email scheduled to be sent at a specific date and time', 'plain'))
part = MIMEBase('application', 'octet-stream')
fd = open('books4.csv', 'rb')
file_contents = fd.read()
part.set_payload(file_contents)
encoders.encode_base64(part)
part.add_header('Content-Disposition', 'attachment; filename ="books.csv"')
msg.attach(part)
def mail():
s = smtplib.SMTP_SSL(host='smtp.gmail.com', port=465)
s.login(user = sender, password = passwd)
s.sendmail(sender, receiver, msg.as_string())
!date
import schedule
import time
schedule.every().day.at("16:27").do(mail)
while (True):
schedule.run_pending()
time.sleep(1)