So, when it comes to retrieving data from a website one can hear different notions: parsing
, (web)scraping
and crawling
. Let's first understand if there is any difference between those notions and why everybody is using different terms.
After a brief googling, one can come to conclusion that:
Parsing
is just getting information basically from any data source (logs/tables or files)
(Web)scraping
is essentially getting data from a web page
Crawling
is the process of moving around the website
So, when somebody speaks about retrieving data from a webpage and a recursive data extraction from a website, he/she will probably use one of the listed above words. In reality, these notions are intependent
, but interrelated
and consequent
(first you scrape one page, crawl to another, scrape another page etc.).
When terms are defined, and we understand them, its time to dive into details.
Scrapy
,formally, is kinda more than a library - it is believed to be a framework, a powerful tool to extract data from websites and automatize this process with a few code written by the programmer.
The strong point of Scrapy
is that it has a bunch of template spiders
(programs that go around the targeted locations and search for needed data), that can be adjusted in a blink of an eye for user-specific need with a few lines of python and bash.
Well, I was stralling around the internet prior to the "Black Friday" to find a tempting offer. And I thought that it would be interesting to get phone assortment at Svyaznoy, one of the biggest online retailer in Russia.
So the ultimate goal of this tutorial is to get the phone name, its price, a discount (if there is one) and a phone photo and beautifully store them.
So i went to Svyaznoy website and looked in the phone assortment.
And saw that there are 109 pages and something around 2,5K of phone articles, which would be tough and tyring even to look though.
P.S. We also note the link to the 1st page (which looks weird), beacuse we will need it further on.
First, install the library if you don't have it.
conda install -c conda-forge scrapy
or pip install scrapy
pwd
Scrapy works in terms of projects. So you create a default project with a bunch of scripts that Scrapy runs to get the data from defined locations.
Let's create a default project.
!scrapy startproject svyaznoy
ls
Look in the project folder that we just have created.
cd svyaznoy
So there is some config file and a folder with Scrapy-specific scripts.
ls
First we have to create spider
, a key program defines which location to crawl and whcih data to collect.
For similarity, we shall call in svz.py
We will also pass the webpage, so the Scrapy would identify the structure.
P.S. You cannot call thespider
the same as the project (I guess it just disturbs everybody).
!scrapy genspider svz https://www.svyaznoy.ru/catalog/phone/224
We can see that our spider
has been created in location: svyaznoy.spiders.svz
Proceed to svyaznoy
to see what is inside.
cd svyaznoy/
ls
Look inside the spiders
folder and see our script.
cd spiders/
ls
Look inside the spider
script.
cat svz.py
We all know that everythin in python
can be regarded as an object. The same applies to the website.
Every phone is embeded in some kind of a card, that has all characteristics, like name, price, discounts, rating andd others and everything can be regarded as separate objects. A card can be recarded as an object as well and set of goods therefore is a set of cards.
Here is what I mean by card of the good:
Everybody knows about the developer tools, so we can click with the right button to inspect our objects and understand where are needed objects are located in terms of HTML/XML markup.
Here we can see the phone price block:
The phone name block: The photo link block:Now we have to understand what HTML and XML is, how they differ and how efficiently retrieve information from a website markup.
The key and the only difference between the 2 guys that we will need is that, XML is used for storing and transporting data, while HTML is used for formatting and displaying the same data.
What is XML?
Although XML looks a lot like HTML, but it has absolutely different purpose and guts. XML stands short for eXtensible Markup Language
, which actually explains itself.
Surprisingly, but XML doesn't really do anything, it just structures, stores and transports data upon request.
One of the reasons why it is called eXtensible is because you can invent your own tags, that helps you navigating the data the way you like, while HTML has predefined tags and all HTML documents are based on standartised tags, like <body>, <p>, <li>
etc.
This helps the developer invent own tags and structure the data the way if fits the nature of the document. However XML is not a replacement for HTML, but is an extension (seriously, man?).
So, in most of the web solutions they word in synergy, XML transports and HTML formats and displys the data nicely.
All this maked XML a vital tool for the internet and is utilized everywhere, where one has to transport the data between all kinds of applications.
This is how XML code looks:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
Or it can also be represented in a tree-form, which can be easier to grasp.
What is XPath?
XPath is a special language to identify parts of XML documents, search and select information.
It uses path expressions to navigate, that look a lot like queries.
It also has a list of functions (logical and numerical) to test the data.
And this is how query structures look like. With the help of those we can drill into XML notation with Xpath to data elements, using hierarchical selectors:
XPath expression | Result |
---|---|
/bookstore/book[1] | Selects the first book element that is the child of the bookstore element |
/bookstore/book[last( )] | Selects the last book element that is the child of the bookstore element |
/bookstore/book[last( )-1] | Selects the last but one book element that is the child of the bookstore element |
/bookstore/book[position( )<3] | Selects the first two book elements that are children of the bookstore element |
//title[@lang] | Selects all the title elements that have an attribute named lang |
//title[@lang='eng'] | Selects all the title elements that have a "lang" attribute with a value of "en" |
/bookstore/book[price>35.00] | Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00 |
/bookstore/book[price>35.00]/title | Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00 |
For now, it is sufficient to know that textual data on the website is stored in in XML format and we can efficiently retrieve that by quering with Xpath.
Xpath in its turn, is a qury language, that is base on hierarcical tag structure, forming a tree-based structure, which can be easily decomposed, selected and manipulated for our purposes.
Scrapy
has an interactive shell where you can debug your scraping code very quickly and try out selecting data without running a spider every time.
Try it yourself!
scrapy shell # this will start the shell
fetch("https://www.svyaznoy.ru/catalog/phone/224") # get the structure of the web page
print(response.text) # bring back the bare html and css, like in developer tools
Since you cannot execute everything in Jupyter notebooks
(unfortuantely), we test/debug via scpary shell
in terminal, Command line or other console app, and use Xpath
notation in order to drill in the tag tree and get the needed data.
Titles can be found like so:
response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@title").extract()
Write them to an object.
titles = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@title").extract()
Out photo links are also located in b-product-block__image
class and we can extract them the following way:
imgs = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@data-original").extract()
Prices are located a bit further in the tree in b-product-block__image
block and span of b-product-block__visible-price
.
response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()
However, the raw data is dirty and we will have to clean it up, using some time, magic and regular expressions (which are actually are equivalent to magic).
This is how we can clear it up:
prices = [price.replace("\n", "") for price in response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()]
prices = [price.replace("\xa0", "") for price in prices] # cleaning from non-breaking space in Latin1(ISO 8859-1)
prices = [price.strip() for price in prices] # cleaning from unwanted spaces
prices = [int(price) for price in prices if price] # turning string objects to integers
It is pretty much the same as with prices, but there are cases, when there is no sale offer for an item, so we will have to be a bit witty here.
Below you can find a list comprehension of how you get the sale offer. So, in this case we test, if there is an item then we extract it, otherwise fill our object with string zero.
[response.xpath(".//div[@class='b-product-block__gain']").extract_first() if \
'b-product-block__gain' in i else '' \
for i in response.xpath(".//div[@class='b-product-block__price']").extract()]
import re
sales = [response.xpath(".//div[@class='b-product-block__gain']").extract_first() if 'b-product-block__gain' in i else '0' for i in response.xpath(".//div[@class='b-product-block__price']").extract()]
sales = [sale.replace("\xa0", "") for sale in sales] # cleaning from non-breaking space in Latin1(ISO 8859-1)
sales = [sale.strip() for sale in sales] # cleaning from unwanted spaces
sales = [re.findall("\d+", sale) for sale in sales] # finding all objects that contain digits in our list of ojects
sales = [item for sublist in sales for item in sublist] # flatten the list of lists
sales = [int(sale) for sale in sales] # turning string objects to integers
So we put all things together.
# Retriving objects
imgs = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@data-original").extract()
titles = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@title").extract()
prices = response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()
sales = [response.xpath(".//div[@class='b-product-block__gain']").extract_first() if 'b-product-block__gain' in i else '0' for i in response.xpath(".//div[@class='b-product-block__price']").extract()]
# Process the prices
prices = [price.replace("\n", "") for price in response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()]
prices = [price.replace("\xa0", "") for price in prices]
prices = [price.strip() for price in prices]
prices = [int(price) for price in prices if price]
# Process the discounts
sales = [sale.replace("\xa0", "") for sale in sales]
sales = [sale.strip() for sale in sales]
sales = [re.findall("\d+", sale) for sale in sales]
sales = [item for sublist in sales for item in sublist]
sales = [int(sale) for sale in sales]
Then we want to have the data in a usual format. First the data will be stored in dictionaries and then we will pass it to internal Scrapy
scripts, so that we would yeild a table in .csv
format.
We will use a for
cycle and zip
construction.
We will also use a yield
generator so that a spider would use it and form a dictionary for each request.
All together look like this:
for item in zip(titles,prices,sales,imgs):
scraped_info = {
'title' : item[0],
'price' : item[1],
'sale_offer': sales[2],
'image_urls' : [item[3]]}
yield scraped_info
Scrapy
has built in structures for extracting page links and defining rules to crawl, but I decided to make it very simple and make a list of links with agenerator and merge them.
# Paginartion.
allowed_domains = ['http://www.svyaznoy.ru/']
first_page = ['http://www.svyaznoy.ru/catalog/phone/224/']
all_others = ['http://www.svyaznoy.ru/catalog/phone/224/page-'+str(x) for x in range(2,109)]
# Locate 1st page.
start_urls = first_page + all_others
Apart from it we will also have to call our spider
to crawl the pages for each request.
next_page = response.xpath(".//li[@class='next']//a/@data-page").extract() next_page = str(int(next_page[0])+1) if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
In addition, we will have to inclide several adjustments in our scripts.
Firstly, we will have to specify where we would like to locate our results in our main script - svz.py
.
custom_settings = {'FEED_URI' : 'results/svyaznoy.csv'}
Secondly, we will have to look in our project folder svyaznoy
for a settings.py
script and include some parameters, listed below.
BOT_NAME = 'svyaznoy'
SPIDER_MODULES = ['svz.spiders']
NEWSPIDER_MODULE = 'svz.spiders'
FEED_FORMAT = "csv"
FEED_URI = "svyaznoy.csv"
And lastly, we will need to spcify the pipeline to download phone photos by extracted links. (include in settings.py
)
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1
}
IMAGES_STORE = 'results/images/'
Done!
Other scripts in spiders
forder do not need any adjustments, at least for our purposes.
Locate yourself in project folder(svyaznoy
), make some popcorn and run the script with the command scpay crawl svyaznoy
. Enjoy.
After couple minutes we will get a report with stats about the job done. Something like that:
We can see that the majority of data was scraped, while some pages occured with 301 code, which is redirecting.
Of course there are tips and tricks how to deal with that as well, but I will leve it to the reader to find out in the documentation.
You can also notice small resized photos and a csv file available in result
folder.
The script below can be just "Ctl+C/Ctrl-V" to a key spider/crawler script - svy.py
. Don't forget the additional adjustments in settings.py
P.S. Bare in mind the indentation problem (4-spaces or Tab) when writing/debugging code in text editors/IDE.
In my case, I choose Tabs.
import re
import scrapy
class SvzSpider(scrapy.Spider):
custom_settings = {'FEED_URI' : 'results/svyaznoy.csv'}
name = 'svyaznoy'
""" Making a proper list of pages. """
allowed_domains = ['http://www.svyaznoy.ru/']
first_page = ['http://www.svyaznoy.ru/catalog/phone/224/']
all_others = ['http://www.svyaznoy.ru/catalog/phone/224/page-'+str(x) for x in range(2,109)]
""" Inserting the 1st page. """
start_urls = first_page+all_others
def parse(self, response):
# Retrieving objects
imgs = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@data-original").extract()
titles = response.xpath("//div[@class='b-product-block__image']//img[@class='lazy']/@title").extract()
prices = response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()
sales = [response.xpath(".//div[@class='b-product-block__gain']").extract_first() if 'b-product-block__gain' in i else '0' for i in response.xpath(".//div[@class='b-product-block__price']").extract()]
# Processing prices
prices = [price.replace("\n", "") for price in response.xpath(".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()").extract()]
prices = [price.replace("\xa0", "") for price in prices]
prices = [price.strip() for price in prices]
prices = [int(price) for price in prices if price]
# Processing sale offers
sales = [sale.replace("\xa0", "") for sale in sales]
sales = [sale.strip() for sale in sales]
sales = [re.findall("\d+", sale) for sale in sales]
sales = [item for sublist in sales for item in sublist]
sales = [int(sale) for sale in sales]
# Yielding objects
for item in zip(titles,prices,sales,imgs):
scraped_info = {
'title' : item[0],
'price' : item[1],
'sale_offer': sales[2],
'image_urls' : [item[3]]}
yield scraped_info
# Pagination loop
next_page = response.xpath(".//li[@class='next']//a/@data-page").extract()
next_page = str(int(next_page[0])+1)
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)