Overview of BeautifulSoup
Playing with BeautifulSoup
requests
LibraryBeautifulSoup
LibrarySoup
Objectsoup.find()
Methodsoup.find_all()
Methodsoup.find_all()
MethodExample 1: Scraping Information from a Single Web Page https://arifpucit.github.io/bss2/
Example 1 (cont): Scraping Information from a Multiple Web Pages https://arifpucit.github.io/bss2/
Example 2: Scraping Information from a Multiple Web Pages (Pagination) http://www.arifbutt.me/category/sp-with-linux/
Limitations of BeautifulSoup
Some Coding Exercises
Beautiful Soup is a Python library for pulling data out of HTML
and XML
files.
BeutifulSoup cannot fetch HTML contents from a web site. To pull HTML we will use requests
library and then pass the HTML to BeautifulSoup constructor.
The three main features of BeautifulSoup are:
Different parsers may create different parse trees and could return different results depending on the HTML that you are trying to parse. If your are trying to parse perfectly formed HTML, then the different parsers will give almost the same output, but if there are mistakes in the html then different parsers will try to fill in missing information differently.
import sys
!{sys.executable} -m pip install --upgrade pip -q
!{sys.executable} -m pip install requests -q
!{sys.executable} -m pip install beautifulsoup4 -q
!{sys.executable} -m pip install --upgrade lxml -q
!{sys.executable} -m pip install html5lib -q
import requests
import bs4 # bs4 is a dummy package managed by the developer of Beautiful Soup to prevent name squatting
from bs4 import BeautifulSoup
import lxml
import html5lib
requests.__version__, bs4.__version__ , lxml.__version__
requests
Library¶import requests
print(dir(requests))
resp = requests.get("https://arifpucit.github.io/bss2")
resp.status_code
print(dir(resp))
resp.url
resp.headers
resp.content
print(resp.text)
BeautifulSoup
Library¶BeautifulSoup()
method is used to create a BeautifulSoup object.BeautifulSoup(markup, "lxml")
The first argument to the BeautifulSoup constructor is a string or an open filehandle containing the markup you want to be parsed.
The second argument is how youโd like the markup parsed. If you donโt specify anything, youโll get the best HTML parser thatโs installed. Beautiful Soup ranks lxmlโs parser as being the best, then html5libโs, then Pythonโs built-in parser.
The method returns a BeautifulSoup object which represents the parsed document and knows how to navigate through the DOM
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text, 'lxml')
print(type(soup))
print(dir(soup))
print(soup)
print(soup.prettify())
Tag Objects
soup.header
soup.p
Name Objects
soup.header.name
soup.img.name
print(type(soup.header.name))
print(type(soup.img.name))
Attribute Objects
soup.p.attrs
soup.img.attrs
Navigatable String Object
soup.title.string
You can Navigate the Entire Tree of Soup Object
soup.body.a
soup.body.a.parent
soup.body.a.parent.parent
soup.body.ul
soup.body.ul.children
for tag in soup.body.ul.children:
print(tag)
soup.find()
Method¶soup.find()
method returns the first tag that matches the search criteria:soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)
name
is the tag name to search.attrs={}
, A dictionary of filters on attribute values.recursive=True
, If this is True
, find() will perform a recursive search of this PageElement's children. Otherwise, only the direct children will be considered.text=None
, Stlimit
, Stop looking after finding this many results.The find()
method can be called on the entire soup object or you can call find()
method from a specific tag from within a soup object
soup.find('div', {'class':'navbar'})
soup.find('div', class_='navbar')
soup.find_all()
Method¶soup.find_all()
method returns a list of all the tags or strings that match a particular criteria.soup.find(name=None, attrs={}, limit, string=None, recursive=True, text=None, **kwargs)
name
is the name of the tag to return.attrs={}
, A dictionary of filters on attribute values.string=None
, is used if you want to search for a text string rather than tagnamerecursive=True
, If this is True
, will perform a recursive search of all the descendents. Otherwise, only the direct children will be considered.string=None
, is used if you want to search for a text string rather than tagnamelimit
, is the number of elements to return. Defaults to all matching (find()
method is similar to find_all() by passing the limit=1Note: The class attribute having space separated string means multiple classes, while an id attribute having space separated string means a single id whose name is having spaces in between
prices = soup.find_all('p', class_='price green')
prices
# Since `soup.find_all()` method returns a list of tags we can iterate through all the list values
for price in prices:
print(price.text)
import requests
from bs4 import BeautifulSoup
import lxml
resp = requests.get("https://arifpucit.github.io/bss2")
soup = BeautifulSoup(resp.text, 'lxml')
sp_titles = soup.find_all('p', class_="book_name")
sp_titles
titles = []
for title in sp_titles:
titles.append(title.text)
print(titles)
sp_titles = soup.find_all('p', class_="book_name")
sp_titles
for item in sp_titles:
print(item.find('a'))
for item in sp_titles:
print(item.find('a').get('href')) # print(item.find('a')['href'])
links=[]
for item in sp_titles:
links.append(item.find('a').get('href'))
links
sp_prices = soup.find_all('p', class_="price green")
sp_prices
prices = []
for price in sp_prices:
prices.append(price.text)
print(prices)
sp_availability = soup.find_all('p', class_='stock')
sp_availability
availability=[]
for aval in sp_availability:
availability.append(aval.text)
print(availability)
sp_reviews = soup.find_all('p', class_='review')
sp_reviews
reviews = []
for review in sp_reviews:
reviews.append(review.get('data-rating')) ## get() method is passwed an attribute and it returns its value
print(reviews)
book = soup.find('div', class_ = 'book_container')
print(book.prettify())
book.find_all('span', class_ = 'not_filled')
len(book.find_all('span', class_ = 'not_filled'))
5-len(book.find_all('span',{'class','not_filled'}))
stars = list()
books = soup.find_all('div',{'class','book_container'})
for book in books:
stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
print(stars)
for i in range(9):
print("",titles[i])
print(" Link: ",links[i])
print(" Price: ",prices[i])
print(" Stock: ",availability[i])
print(" Reviews: ",reviews[i])
print(" Stars: ",stars[i])
import pandas as pd
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('books1.csv', index=False)
df = pd.read_csv('books1.csv')
df
import csv
help(csv)
import csv
import pandas as pd
fd = open('books2.csv', 'wt')
csv_writer = csv.writer(fd)
csv_writer.writerow(['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
for i in range(len(titles)):
csv_writer.writerow([titles[i], prices[i], availability[i], reviews[i], links[i], stars[i]])
fd.close()
df = pd.read_csv('books2.csv')
df
import requests
from bs4 import BeautifulSoup
import pandas as pd
titles = []
prices = []
availability=[]
reviews=[]
links=[]
stars=[]
def books(soup):
sp_titles = soup.find_all('p', class_="book_name")
sp_prices = soup.find_all('p', class_="price green")
sp_availability = data = soup.find_all('p', class_='stock')
sp_reviews = soup.find_all('p',{'class','review'})
data = soup.find_all('p', class_="book_name")
sp_links=[]
for val in data:
sp_links.append(val.find('a').get('href'))
books = soup.find_all('div',{'class','book_container'})
for book in books:
stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
for i in range(len(sp_titles)):
titles.append(sp_titles[i].text)
prices.append(sp_prices[i].text)
availability.append(sp_availability[i].text)
reviews.append(sp_reviews[i].text)
links.append(sp_links[i])
resp = requests.get("https://arifpucit.github.io/bss2")
soup = BeautifulSoup(resp.text, 'lxml')
books(soup)
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('books3.csv', index=False)
df = pd.read_csv('books3.csv')
df
import requests
from bs4 import BeautifulSoup
import pandas as pd
titles = []
prices = []
availability=[]
reviews=[]
links=[]
stars=[]
def books(soup):
sp_titles = soup.find_all('p', class_="book_name")
sp_prices = soup.find_all('p', class_="price green")
sp_availability = data = soup.find_all('p', class_='stock')
sp_reviews = soup.find_all('p',{'class','review'})
# for links
data = soup.find_all('p', class_="book_name")
sp_links=[]
for val in data:
sp_links.append(val.find('a').get('href'))
books = soup.find_all('div',{'class','book_container'})
for book in books:
stars.append(5 - len(book.find_all('span',{'class','not_filled'})))
for i in range(len(sp_titles)):
titles.append(sp_titles[i].text)
prices.append(sp_prices[i].text)
availability.append(sp_availability[i].text)
reviews.append(sp_reviews[i].text)
links.append(sp_links[i])
urls = ['https://arifpucit.github.io/bss2/index.html',
'https://arifpucit.github.io/bss2/SP.html',
'https://arifpucit.github.io/bss2/CA.html']
for url in urls:
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
books(soup)
# Creating a dataframe and saving data in a csv file
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability,
'Reviews':reviews, 'Links':links, 'Stars':stars}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'Stars'])
df.to_csv('books3.csv', index=False)
df = pd.read_csv('books3.csv')
df
url = 'http://www.arifbutt.me/category/sp-with-linux/'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
articles = soup.find_all('div', class_='media-body')
articles
article = soup.find('div', class_='media-body')
article
article.find('h4', class_='media-heading1').text
article.find('p', align="justify").text
article.find('iframe').get('src')
article.find('iframe').get('src').split('/')
article.find('iframe').get('src').split('/')[4]
article.find('iframe').get('src').split('/')[4].split('?')
video_id = article.find('iframe').get('src').split('/')[4].split('?')[0]
video_id
f'https://youtube.com/watch?v={video_id}'
import requests
from bs4 import BeautifulSoup
import pandas as pd
def videos(soup):
articles = soup.find_all('div', class_='media-body')
for article in articles:
title = article.find('h4', class_="media-heading1").text
titles.append(title)
descr = article.find('p', align='justify').text
descriptions.append(descr)
video_id = article.find('iframe')['src'].split('/')[4].split('?')[0]
youtube_link = f'https://youtube.com/watch?v={video_id}'
links.append(youtube_link)
titles = []
descriptions = []
links=[]
first_page = requests.get("http://www.arifbutt.me/category/sp-with-linux/")
soup = BeautifulSoup(first_page.text,'lxml')
videos(soup)
titles
pegination_code = soup.find('div',class_="navigation_pegination")
pegination_code
pegination_code = soup.find('div',class_="navigation_pegination")
all_links= pegination_code.find_all('li')
last_link = None
for last_link in all_links:
pass
next_url = last_link.find('a').get('href')
resp = requests.get(next_url)
soup = BeautifulSoup(resp.text,'lxml')
videos(soup)
print(next_url)
import requests
from bs4 import BeautifulSoup
import pandas as pd
titles = []
descriptions = []
links=[]
first_page = requests.get("http://www.arifbutt.me/category/sp-with-linux/")
soup = BeautifulSoup(first_page.text,'lxml')
videos(soup)
while True:
pegination_code = soup.find('div',class_="navigation_pegination")
all_links= pegination_code.find_all('li')
last_link = None
for last_link in all_links:
pass
if(last_link.find('a').text == "Next Page ยป"):
next_url = last_link.find('a').get('href')
resp = requests.get(next_url)
soup = BeautifulSoup(resp.text,'lxml')
videos(soup)
else:
break;
# Creating a dataframe and saving data in a csv file
data = {'Title':titles, 'YouTube Link':links, 'Description':descriptions}
df = pd.DataFrame(data, columns=['Title', 'YouTube Link', 'Description'])
df.to_csv('spvideos.csv', index=False)
df = pd.read_csv('spvideos.csv')
df
Title | YouTube Link | Description | |
---|---|---|---|
0 | Lec01 Introduction to System Programming (Arif... | https://youtube.com/watch?v=qThI-U34KYs | This is the first session on the subject of Sy... |
1 | Lec02 C Compilation: A System Programmer Persp... | https://youtube.com/watch?v=a7GhFL0Gh6Y | This session starts with the C-Compilation pro... |
2 | Lec03 Working of Linkers: Creating your own Li... | https://youtube.com/watch?v=A67t7X2LUsA | Linking and loading a process (Behind the curt... |
3 | Lec04 UNIX make utility (Arif Butt @ PUCIT) | https://youtube.com/watch?v=8hG0MTyyxMI | This session deals with the famous UNIX make u... |
4 | Lec05 GNU autotools and cmake (Arif Butt @ PUCIT) | https://youtube.com/watch?v=Ncb_xzjGAwM | This session starts with a brief comparison be... |
5 | Lec06 Versioning Systems git-I (Arif Butt @ PU... | https://youtube.com/watch?v=TBqLJg6PmWQ | This session gives an overview of different mo... |
6 | Lec07 Versioning Systems git-II (Arif Butt @ P... | https://youtube.com/watch?v=3akXFcBDYc0 | This is a continuity of previous session and s... |
7 | Lec08 Exit Handlers and Resource Limits (Arif ... | https://youtube.com/watch?v=ujzom1OyPMY | This session describes as to how a C program s... |
8 | Lec09 Stack Behind the Curtain (Arif Butt @ PU... | https://youtube.com/watch?v=1XbTmmWxHzo | This session describes how a process is laid o... |
9 | Lec10 Heap Behind the Curtain (Arif Butt @ PUCIT) | https://youtube.com/watch?v=zpcPS27ZQr0 | This session start with a discussion on types ... |
10 | Lec11 Design and Code of UNIX more utility (Ar... | https://youtube.com/watch?v=epefPagPgvk | This session deals with the design and develop... |
11 | Lec12 UNIX File System Architecture (Arif Butt... | https://youtube.com/watch?v=x_bu6De71KY | In this session will start with a quick recap ... |
12 | Lec13 UNIX File Management (Arif Butt @ PUCIT) | https://youtube.com/watch?v=DZQkyoXgkMs | This session will deal with various file relat... |
13 | Lec14 Design and Code of UNIX ls Utility (Arif... | https://youtube.com/watch?v=24WNjxn4asY | This session deals with the designing the ls p... |
14 | Lec15 Design and Code Of UNIX who Utility (Ari... | https://youtube.com/watch?v=96EcaPZo90U | This session deals with the different categori... |
15 | Lec16 Programming Terminal Devices (Arif Butt ... | https://youtube.com/watch?v=t5sC6G73oo4 | This session starts with character and block s... |
16 | Lec17 Process Management-I (Arif Butt @ PUCIT) | https://youtube.com/watch?v=R_01xGLp0ZQ | This session starts with a quick recap of proc... |
17 | Lec18 Process Management-II (Arif Butt @ PUCIT) | https://youtube.com/watch?v=91qzstPN1p8 | This session starts with a comparison between ... |
18 | Lec19 Process Management-III (Arif Butt @ PUCIT) | https://youtube.com/watch?v=QWfeh1bFvs0 | This is a continuation of previous two session... |
19 | Lec20 Design and Code Of Daemon Processes (Ari... | https://youtube.com/watch?v=p0ccoTM7v8I | This session gives an overview of daemon proce... |
20 | Lec21 Process Scheduling Algorithms (Arif Butt... | https://youtube.com/watch?v=Y86pa2nrT_k | This session gives an overview of process sch... |
21 | Lec22 Design And Code Of UNIX Shell Utility (A... | https://youtube.com/watch?v=F7oAWvh5J_o | This session gives an overview of working of U... |
22 | Lec23 Multi Threaded Programming (Arif Butt @ ... | https://youtube.com/watch?v=OgnLaXwLC8Y | This session gives an overview of concurrent p... |
23 | Lec24 Overview Of UNIX IPC And Signals On The ... | https://youtube.com/watch?v=EX7EWSX8-qM | This session gives an overview of taxonomy of ... |
24 | Lec25 Design and Code Of Signal Handlers (Arif... | https://youtube.com/watch?v=YBg9sWw4qbU | This session is a continuation of previous ses... |
25 | Lec26 Programming UNIX Pipes (Arif Butt @ PUCIT) | https://youtube.com/watch?v=VA8FEgahi1Y | This session deals with the concept and use of... |
26 | Lec27 Programming UNIX Named Pipes (Arif Butt ... | https://youtube.com/watch?v=jowB4nuf55c | This session deals with the concept and use of... |
27 | Lec28 Message Queues (Arif Butt @ PUCIT) | https://youtube.com/watch?v=UAbMS3kYV5s | This session deals with the concept and use of... |
28 | Lec29 Programming With Shared Memory (Arif But... | https://youtube.com/watch?v=IzhnAW8u1iQ | This session deals with the concept and use of... |
29 | Lec30 Memory Mapped Files (Arif Butt @ PUCIT) | https://youtube.com/watch?v=z0I1TlqDi50 | This session deals with the concept and use of... |
30 | Lec31 Synchronization among Threads (Arif Butt... | https://youtube.com/watch?v=SvFr7rPWI3g | This session starts with a quick recap of POSI... |
31 | Lec32 Programming with POSIX Semaphores (Arif ... | https://youtube.com/watch?v=KupTFYvxRnE | This session starts with introduction to POSIX... |
32 | Lec33 Overview Of TCPIP Architecture and Servi... | https://youtube.com/watch?v=p5SrRob-bWg | This session starts with introduction to TCP/I... |
33 | Lec34 Socket Programming Part-I (Arif Butt @ P... | https://youtube.com/watch?v=tk_RpIVbOMQ | This session starts with introduction to Clien... |
34 | Lec35 Socket Programming Part-II (Arif Butt @ ... | https://youtube.com/watch?v=yNUFQaSclmM | This session starts with introduction to Datag... |
35 | Lec36 Socket Programming Part-III (Arif Butt @... | https://youtube.com/watch?v=TDRIweWXHe4 | This session starts with introduction to UNIX ... |
36 | Lec37 Socket Programming Part-IV (Arif Butt @ ... | https://youtube.com/watch?v=irRkNrruwxc | This session starts with a discussion on concu... |
37 | Lec39 Exploiting Buffer Overflow Vulnerability... | https://youtube.com/watch?v=eAzCm0Ncnhg | This is a continuation of Video Session 38. In... |
38 | Lec38 Exploiting Buffer Overflow Vulnerability... | https://youtube.com/watch?v=3hDNvlIZFQ8 | This is a series of three videos, which gives ... |
39 | Lec40 Exploiting Buffer Overflow Vulnerability... | https://youtube.com/watch?v=DayRrBYZRRk | This is a continuation of Video Session 39. In... |
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://arifpucit.github.io/bss2/")
soup = BeautifulSoup(resp.text,'lxml')
price = soup.find_all('p', class_='price green')
price
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://arifpucit.github.io/bss2/js")
soup = BeautifulSoup(resp.text,'lxml')
prices = soup.find_all('p', class_='price green')
prices
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://arifpucit.github.io/bss2/login/")
soup = BeautifulSoup(resp.text,'lxml')
prices = soup.find_all('p', class_='green')
prices