If you do SEO, a big part of your job is tracking your rankings on SERP's, as well as your competitors'.
I'm going to be sharing a way to get SERP data and have it in a DataFrame (table / csv / excel sheet) for analysis, on a large scale, and in an automated way.
I will be using the programming language Python, so there will be some coding involved. If you don't know any programming, you can simply ignore the code snippets below, as you don't need to understand them to follow along.
So how exactly are we going to get the data, and what are we going to do with it?
Google's Custom Search Engine is a service that allows you to create your own customized search engine, where you can specify certain sites to crawl, and set your own relevancy rules. You can also programmatically pull the data throug their API, which is what we are going to do. If you don't specify any specific rules then your custom search engine will basically be searching the whole web. In addition, you have the ability to specify many parameters for your search queries; the location of the user, the language of the site, image search, and many more.
Here are the steps to setup your account to import data (skip if you don't want to run the code yourself).
We will be using three Python packages for our work:
To give you a quick idea, here is a sample from the SERP response that we will be working with:
import pandas as pd
serp_flights = pd.read_csv('serp_flights.csv')
serp_flights.head()
searchTerms | rank | title | snippet | displayLink | link | queryTime | totalResults | cacheId | formattedUrl | ... | count | startIndex | inputEncoding | outputEncoding | safe | cx | gl | searchTime | formattedSearchTime | formattedTotalResults | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | flights to hong kong | 1 | Cheap Flights to Hong Kong (HKG) from $401 - K... | Find flights to Hong Kong on XiamenAir, Hong K... | www.kayak.com | https://www.kayak.com/flight-routes/United-Sta... | 2018-11-21 03:03:21.596540+00:00 | 12700000 | W42baDpas_gJ | https://www.kayak.com/flight-routes/United.../... | ... | 10 | 1 | utf8 | utf8 | off | 012859022920491477448:pubdbfjmmec | us | 0.428278 | 0.43 | 12,700,000 |
1 | flights to hong kong | 2 | $458 Flights to Hong Kong, China (HKG) - TripA... | Cheap Flights to Hong Kong: Enter your dates o... | www.tripadvisor.com | https://www.tripadvisor.com/Flights-g294217-Ho... | 2018-11-21 03:03:21.596540+00:00 | 12700000 | 5jsjhmxgxvkJ | https://www.tripadvisor.com/Flights-g294217-Ho... | ... | 10 | 1 | utf8 | utf8 | off | 012859022920491477448:pubdbfjmmec | us | 0.428278 | 0.43 | 12,700,000 |
2 | flights to hong kong | 3 | Cheap Flights to Hong Kong SAR, Asia $307.69 i... | Expedia Add-On Advantage: Book a flight & unlo... | www.expedia.com | https://www.expedia.com/Cheap-Flights-To-Hong-... | 2018-11-21 03:03:21.596540+00:00 | 12700000 | iEmRbX0P8CAJ | https://www.expedia.com/Cheap-Flights-To-Hong-... | ... | 10 | 1 | utf8 | utf8 | off | 012859022920491477448:pubdbfjmmec | us | 0.428278 | 0.43 | 12,700,000 |
3 | flights to hong kong | 4 | Cheap Flights to Hong Kong International from ... | The Magnificent City of Hong Kong. Check for c... | www.skyscanner.com | https://www.skyscanner.com/flights-to/hkg/chea... | 2018-11-21 03:03:21.596540+00:00 | 12700000 | EI78TI4Q1qYJ | https://www.skyscanner.com/flights-to/hkg/chea... | ... | 10 | 1 | utf8 | utf8 | off | 012859022920491477448:pubdbfjmmec | us | 0.428278 | 0.43 | 12,700,000 |
4 | flights to hong kong | 5 | Flights to Hong Kong (HKG) | Hong Kong Flight ... | Find & book great deals on Hong Kong (HKG) Fli... | flights.cathaypacific.com | https://flights.cathaypacific.com/en-us/flight... | 2018-11-21 03:03:21.596540+00:00 | 12700000 | q6kszaD7mggJ | https://flights.cathaypacific.com/en-us/flight... | ... | 10 | 1 | utf8 | utf8 | off | 012859022920491477448:pubdbfjmmec | us | 0.428278 | 0.43 | 12,700,000 |
5 rows × 26 columns
A few notes on the different columns available:
"queryTime" is the time that the query was run (when I made the request). This is different from "searchTime" which is the amount of time it took Google to run the query (usually less than one second). Most of the main columns will always be there, but if you pass different parameters you will have more or less columns. For example, you would have columns describing the images, in case you specify the type of search to be "image".
We are going to take a look at the airlines tickets industry, and here are the details:
As a result we have 100 destinations x 2 variations x 2 countries x 10 results = 4,000 rows of data.
We begin by importing the packages that we will use, and defining our Google CSE ID and key:
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import advertools as adv
import pandas as pd
pd.set_option('display.max_columns', None)
cx = 'YOUR_GOOGLE_CUSTOM_SEARCH_ENGINE_ID'
key = 'YOUR_GOOGLE_DEVELOPER_KEY'
Now we can import the Wikipedia table, and here are the top destinations with some additional data:
# top_dest = pd.read_html('https://en.wikipedia.org/wiki/List_of_cities_by_international_visitors',
# header=0)[0]
# top_dest.to_csv('top_destinations.csv', index=False)
top_dest = pd.read_csv('top_destinations.csv')
top_dest.head().style.format({'Arrivals 2016Euromonitor': '{:,}'})
RankEuromonitor | RankMastercard | City | Country | Arrivals 2016Euromonitor | Arrivals 2016Mastercard | Growthin arrivalsEuromonitor | Income(billions $)Mastercard | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 11 | Hong Kong | Hong Kong | 25,695,800.0 | 8.37e+06 | −3.1 % | 6.84 |
1 | 2 | 1 | Bangkok | Thailand | 23,270,600.0 | 2.147e+07 | 9.5 % | 14.84 |
2 | 3 | 2 | London | United Kingdom | 19,842,800.0 | 1.988e+07 | 3.4 % | 19.76 |
3 | 4 | 6 | Singapore | Singapore | 17,681,800.0 | 1.211e+07 | 6.1 % | 12.54 |
4 | 5 | nan | Macau | Macau | 16,299,100.0 | nan | 5.9 % | nan |
Create the keywords by concatenating the two variations mentioned above:
cities = top_dest['City'].tolist()
queries = ['flights to ' + c.lower() for c in cities] + ['tickets to ' + c.lower() for c in cities]
queries[:3] + queries[-3:] + ['etc...']
['flights to hong kong', 'flights to bangkok', 'flights to london', 'tickets to qingdao', 'tickets to philadelphia', 'tickets to lagos', 'etc...']
With the main parameters defined, we can now send the requests to Google as follows:
# serp_flights = adv.serp_goog(cx=cx, key=key, q=queries, gl=['us', 'uk']) # imports data
serp_flights = pd.read_csv('serp_flights.csv',parse_dates=['queryTime'])
serp_us = serp_flights[serp_flights['gl'] == 'us'].copy() # create a subset for US
serp_uk = serp_flights[serp_flights['gl'] == 'uk'].copy() # create a subset for UK
Let's now take a quick look at the top domains.
print('Domain Summary - Overall')
(serp_flights
.pivot_table('rank', 'displayLink',
aggfunc=['count', 'mean'])
.sort_values([('count', 'rank'), ('mean', 'rank')],
ascending=[False, True])
.assign(coverage=lambda df: df[('count', 'rank')] / len(serp_flights)*10)
.head(10).style.format({("coverage", ''): "{:.1%}",
('mean', 'rank'): '{:.2f}'}))
Domain Summary - Overall
count | mean | coverage | |
---|---|---|---|
rank | rank | ||
displayLink | |||
www.skyscanner.net | 364 | 3.32 | 91.0% |
www.tripadvisor.com | 273 | 4.01 | 68.3% |
www.expedia.com | 243 | 3.06 | 60.7% |
www.kayak.com | 196 | 2.57 | 49.0% |
www.cheapflights.com | 180 | 5.68 | 45.0% |
www.kayak.co.uk | 175 | 4.91 | 43.8% |
www.cheapflights.co.uk | 174 | 4.51 | 43.5% |
www.skyscanner.com | 167 | 4.87 | 41.8% |
www.expedia.co.uk | 165 | 5.25 | 41.2% |
www.tripadvisor.co.uk | 135 | 6.07 | 33.8% |
As you see, since we are mainly interested in the ranking of domains we have it summarized by three main metrics:
The above pivot table is for all the results, and to get a quick overview. I think it's more meaningful to split the data into two different pivot tables, one for each of the countries:
print('Domain Summary - US')
(serp_flights[serp_flights['gl']=='us']
.pivot_table('rank', 'displayLink',
aggfunc=['count', 'mean'])
.sort_values([('count', 'rank'), ('mean', 'rank')],
ascending=[False, True])
.assign(coverage=lambda df: df[('count', 'rank')] / len(serp_flights)*10 * 2)
.head(10).style.format({("coverage", ''): "{:.1%}",
('mean', 'rank'): '{:.2f}'}))
Domain Summary - US
count | mean | coverage | |
---|---|---|---|
rank | rank | ||
displayLink | |||
www.expedia.com | 195 | 2.28 | 97.5% |
www.tripadvisor.com | 195 | 3.14 | 97.5% |
www.kayak.com | 176 | 2.07 | 88.0% |
www.skyscanner.com | 158 | 4.73 | 79.0% |
www.skyscanner.net | 148 | 6.05 | 74.0% |
www.cheapflights.com | 147 | 5.68 | 73.5% |
www.cheapoair.com | 74 | 6.84 | 37.0% |
www.orbitz.com | 60 | 6.70 | 30.0% |
www.makemytrip.com | 45 | 5.80 | 22.5% |
www.lufthansa.com | 37 | 6.92 | 18.5% |
For coverage, I divided by 400 in the first table, but for the countries I'm dividing by 200, because we are interested in queries for that country. An interesting point here, is that kayak.com has lower coverage than tripadvisor.com, but it has a higher mean rank. In top positions, the difference between position two and three is quite high in terms of value. Depending on your case, you might value one or the other (metric).
print('Domain Summary - UK')
(serp_flights[serp_flights['gl']=='uk']
.pivot_table('rank', 'displayLink',
aggfunc=['count', 'mean'])
.sort_values([('count', 'rank'), ('mean', 'rank')],
ascending=[False, True])
.assign(coverage=lambda df: df[('count', 'rank')] / len(serp_flights)*10*2)
.head(10).style.format({("coverage", ''): "{:.1%}",
('mean', 'rank'): '{:.2f}'}))
Domain Summary - UK
count | mean | coverage | |
---|---|---|---|
rank | rank | ||
displayLink | |||
www.skyscanner.net | 216 | 1.45 | 108.0% |
www.cheapflights.co.uk | 161 | 4.25 | 80.5% |
www.kayak.co.uk | 159 | 4.75 | 79.5% |
www.expedia.co.uk | 150 | 5.14 | 75.0% |
www.tripadvisor.co.uk | 128 | 6.10 | 64.0% |
www.lastminute.com | 112 | 5.97 | 56.0% |
www.opodo.co.uk | 109 | 5.96 | 54.5% |
www.travelsupermarket.com | 83 | 6.16 | 41.5% |
www.tripadvisor.com | 78 | 6.18 | 39.0% |
www.britishairways.com | 69 | 6.22 | 34.5% |
Having a coverage of 108% means that skyskanner.net has appeared on all searches, and in some cases they appeared more than once in the same SERP.
Note that their mean rank is 1.45, much higher than the second domain.
No joking with SkySkanner!
Now that we have an idea about the number of times they appeared and the average ranks they have, it might also be good to visualize the data, so we can see how it is distributed.
We first get the top 10 domains for each country, and define two new DataFrames (tables) containing only the filtered data and then visualize:
top10_domains = serp_flights.displayLink.value_counts()[:10].index
top10_df = serp_flights[serp_flights['displayLink'].isin(top10_domains)]
top10_domains_us = serp_us.displayLink.value_counts()[:10].index
top10_df_us = serp_flights[serp_flights['displayLink'].isin(top10_domains_us)]
top10_domains_uk = serp_uk.displayLink.value_counts()[:10].index
top10_df_uk = serp_flights[serp_flights['displayLink'].isin(top10_domains_uk)]
fig, ax = plt.subplots(facecolor='#ebebeb')
fig.set_size_inches(15, 9)
ax.set_frame_on(False)
ax.scatter(top10_df['displayLink'].str.replace('www.', ''),
top10_df['rank'], s=850, alpha=0.02, edgecolor='k', lw=2)
ax.grid(alpha=0.25)
ax.invert_yaxis()
ax.yaxis.set_ticks(range(1, 11))
ax.tick_params(labelsize=15, rotation=9, labeltop=True,
labelbottom=False)
ax.set_ylabel('Search engine results page rank', fontsize=16)
ax.set_title('Top 10 Tickets and Flights Domains', pad=75, fontsize=24)
ax.text(4.5, -0.5, 'Organic Search Rankings for 200 Keywords in US & UK',
ha='center', fontsize=15)
fig.savefig(ax.get_title() + '.png',
facecolor='#eeeeee', dpi=150, bbox_inches='tight')
plt.show()
For each appearance on an SERP we plot a very light circle in the position where that domain appeared (from one to ten). The more frequently a domain appears, the darker the circle.
For example, kayak.com, expedia.com, and skyskanner.net have solid blue circles on position one, as well as lighter ones on different positions.
A minor issue in this analysis so far is that it treats all keywords equally. The number of tourists in the top one hundred list varies between two and twentysix million, so they are clearly not equal. Also, for your specific case, you might have your own set of "top 100" based on the website you are working on. But since we are exploring the industry and trying to understand the positions of the different players, I don't think it's a bad assumption. Just keep this in mind when doing a similar analysis for a specific case.
As above, this was for the overall data, and below is the same visualization split by country:
top10_dfs = [top10_df_us, top10_df_uk]
colors = ['darkred', 'olive']
suffixes = [' - US', ' - UK']
fig, ax = plt.subplots(2, 1, facecolor='#ebebeb')
fig.set_size_inches(15, 18)
for i in range(2):
ax[i].set_frame_on(False)
ax[i].scatter(top10_dfs[i]['displayLink'].str.replace('www.', ''),
top10_dfs[i]['rank'], s=850, alpha=0.02,
edgecolor='k', lw=2, color='darkred')
ax[i].grid(alpha=0.25)
ax[i].invert_yaxis()
ax[i].yaxis.set_ticks(range(1, 11))
ax[i].tick_params(labelsize=15, rotation=12, labeltop=True,
labelbottom=False)
ax[i].set_ylabel('Search engine results page rank', fontsize=16)
ax[i].set_title('Top 10 Tickets and Flights Domains' + suffixes[i],
pad=75, fontsize=24)
ax[i].text(4.5, -0.5, 'Organic Search Rankings for 200 Keywords',
ha='center', fontsize=15)
plt.tight_layout()
fig.savefig(ax[i].get_title() + '.png',
facecolor='#eeeeee', dpi=150, bbox_inches='tight')
plt.show()
Another important metric you might be interested in, is how many pages each domain has for the different cities.
Assuming the content is real, and with a minimum level of quality, the more content you have, the more likely you are to appear on SERPs, especially for keyword variations and the different combinations users can think of.
One of the parameters of the request allowed by Google is specifying the site you want to search in, and you have the option to include or exclude that site.
So if we search for "tickets to hong kong" and specify siteSearch=www.tripadvisor.com
with siteSearchFilter=i
(for "include") we will get the search results restricted to that site only.
An important column that comes together with every response is "totalResults", which shows how many pages Google has for that query.
Since that query was restricted to a certain domain, and is for a specific keyword, we can figure out how many pages that domain has, that are eligible to appear for that keyword.
I ran the queries for the top five destinations, and for the two countries.
# pagesperdomain_us = adv.serp_goog(cx=cx, key=key, q=queries[:5],
# siteSearch=top10_domains_us.tolist(),
# siteSearchFilter='i', num=1)
# pagesperdomain_uk = adv.serp_goog(cx=cx, key=key, q=queries[:5],
# siteSearch=top10_domains_uk.tolist() ,
# siteSearchFilter='i', num=1)
# pagesperdomain_us.to_csv('pagesperdomain_us.csv', index=False)
# pagesperdomain_uk.to_csv('pagesperdomain_uk.csv', index=False)
pagesperdomain_us = pd.read_csv('pagesperdomain_us.csv')
pagesperdomain_uk = pd.read_csv('pagesperdomain_uk.csv')
Here are the first ten results from the US for "flights to hong kong" and below a visualization for each of the keywords and the countries:
(pagesperdomain_us
[['searchTerms', 'displayLink', 'totalResults']]
.head(10)
.style.format({'totalResults': '{:,}'}))
searchTerms | displayLink | totalResults | |
---|---|---|---|
0 | flights to hong kong | www.tripadvisor.com | 301,000 |
1 | flights to hong kong | www.expedia.com | 616,000 |
2 | flights to hong kong | www.kayak.com | 75,500 |
3 | flights to hong kong | www.skyscanner.com | 5,420 |
4 | flights to hong kong | www.skyscanner.net | 5,660 |
5 | flights to hong kong | www.cheapflights.com | 37,500 |
6 | flights to hong kong | www.cheapoair.com | 4,020 |
7 | flights to hong kong | www.orbitz.com | 27,100 |
8 | flights to hong kong | www.makemytrip.com | 208,000 |
9 | flights to hong kong | www.lufthansa.com | 5,690 |
from matplotlib.cm import tab10
from matplotlib.ticker import EngFormatter
fig, ax = plt.subplots(5, 2, facecolor='#eeeeee')
fig.set_size_inches(17, 20)
countries = [' - US', ' - UK']
pages_df = [pagesperdomain_us, pagesperdomain_uk]
for i in range(5):
for j in range(2):
ax[i, j].set_frame_on(False)
ax[i, j].barh((pages_df[j][pages_df[j]['searchTerms']== queries[i]]
.sort_values('totalResults')['displayLink']
.str.replace('www.', '')),
(pages_df[j][pages_df[j]['searchTerms']== queries[i]]
.sort_values('totalResults')['totalResults']),
color=tab10.colors[i+5*j])
ax[i, j].grid(axis='x')
ax[i, j].set_title('Pages per domain. Keyword: "' + queries[i] + '"' + countries[j],
fontsize=15)
ax[i, j].tick_params(labelsize=12)
ax[i, j].xaxis.set_major_formatter(EngFormatter())
plt.tight_layout()
fig.savefig('Pages per domain' + '.png',
facecolor='#eeeeee', dpi=150, bbox_inches='tight')
plt.show()
As you can see, the difference can be drammatic in some cases, and it doesn't always correlate with top positions. Feel free to analyze further, or try other keywords if you are interested.
There are many ways to analyze titles (and snippets), but in this case, one particular thing caught my attention, and I think it's very important in this industry.
Many sites have the price of the tickets in the title of the page, which is visible in SERPs and is one of the most important factors encouraging / discouraging people to click. For example:
serp_flights[serp_flights['searchTerms'] == 'flights to paris'][['searchTerms', 'title']].head(10)
searchTerms | title | |
---|---|---|
120 | flights to paris | Cheap Flights to Paris (PAR) from $269 - KAYAK |
121 | flights to paris | $309 Flights to Paris, France (PAR) - TripAdvisor |
122 | flights to paris | Cheap Flights to Paris, Ile-de-France (CDG Air... |
123 | flights to paris | Cheap Flights to Paris, France - Search Deals ... |
124 | flights to paris | Cheap Flights to Paris | CheapOair |
125 | flights to paris | $309.98 + Flights to Paris (PAR) on Orbitz.com |
126 | flights to paris | Cheap Flights to Paris from $259 in 2019 | Sky... |
127 | flights to paris | Flight to Paris - Cheap flights to France | Ai... |
128 | flights to paris | Cheap Flights to Paris from $15 | JETCOST |
129 | flights to paris | Cheap Paris Return Flights, 2018 PAR Airfare @... |
Let's now extract the prices and currencies, so we can do further analysis.
serp_flights['price'] = (serp_flights['title']
.str.extract('[$£](\d+,?\d+\.?\d+)')[0]
.str.replace(',', '').astype(float))
serp_flights['currency'] = serp_flights['title'].str.extract('([$£])')
serp_flights[['searchTerms', 'title', 'price', 'currency', 'displayLink']].head(15)
searchTerms | title | price | currency | displayLink | |
---|---|---|---|---|---|
0 | flights to hong kong | Cheap Flights to Hong Kong (HKG) from $401 - K... | 401.00 | $ | www.kayak.com |
1 | flights to hong kong | $458 Flights to Hong Kong, China (HKG) - TripA... | 458.00 | $ | www.tripadvisor.com |
2 | flights to hong kong | Cheap Flights to Hong Kong SAR, Asia $307.69 i... | 307.69 | $ | www.expedia.com |
3 | flights to hong kong | Cheap Flights to Hong Kong International from ... | 385.00 | $ | www.skyscanner.com |
4 | flights to hong kong | Flights to Hong Kong (HKG) | Hong Kong Flight ... | NaN | NaN | flights.cathaypacific.com |
5 | flights to hong kong | Deal Alert: Nonstop Flights to Hong Kong From ... | 511.00 | $ | thepointsguy.com |
6 | flights to hong kong | Cheap Flights to Hong Kong - Search Deals on A... | NaN | NaN | www.cheapflights.com |
7 | flights to hong kong | Flights to Hong Kong SAR (HKG) on Orbitz.com | NaN | NaN | www.orbitz.com |
8 | flights to hong kong | Fly Hong Kong Airlines | Flights To & From Hon... | NaN | NaN | www.hongkongairlines.com |
9 | flights to hong kong | Passenger Arrivals, Flights - Hong Kong Intern... | NaN | NaN | www.hongkongairport.com |
10 | flights to hong kong | Cheap Flights to Hong Kong International from ... | 354.00 | £ | www.skyscanner.net |
11 | flights to hong kong | Cheap Flights to Hong Kong from £359 - Cheapfl... | 359.00 | £ | www.cheapflights.co.uk |
12 | flights to hong kong | Flights to Hong Kong SAR (HKG Airport) | Expedia | NaN | NaN | www.expedia.co.uk |
13 | flights to hong kong | Cheap Flights to Hong Kong (HKG) from £361 - K... | 361.00 | £ | www.kayak.co.uk |
14 | flights to hong kong | Cheap flights to Hong Kong from £115 - Book tr... | 115.00 | £ | www.opodo.co.uk |
Now we have two new columns "price" and "currency".
In some cases there is no price in the title "NaN" for (not a number), and there are dollar and pound signs.
Some sites also have the prices in other currencies, but they are very small in number, and it doesn't make sense to compare those especially when there are big differences in their values, so we will only be dealing with dollars and pounds.
For the top five queries, we can plot the different prices (where available), and get a quick overview of how the prices compare.
Here is a quick price comparison engine for you :)
fig, ax = plt.subplots(5, 2, facecolor='#eeeeee')
fig.set_size_inches(17, 20)
countries = [' - US ($)', ' - UK (£)']
country_codes = ['us', 'uk']
currency = ['$', '£']
top10dfs = [top10_domains_us, top10_domains_uk]
for i in range(5):
for j in range(2):
ax[i, j].grid()
ax[i, j].set_frame_on(False)
df = serp_flights[(serp_flights['gl'] == country_codes[j]) &
(serp_flights['searchTerms'] == queries[i]) &
(serp_flights['currency'] == currency[j])]
for country in top10dfs[j]:
ax[i, j].scatter(df.sort_values('totalResults')['displayLink'].str.replace('www.', ''),
df.sort_values('totalResults')['price'],
color=tab10.colors[i+5*j], s=300)
ax[i, j].set_title('Price per domain. Keyword: "' + queries[i] + '"' + countries[j],
fontsize=15)
ax[i, j].tick_params(labelsize=12, rotation=9, axis='x')
plt.tight_layout()
fig.savefig('Prices per domain' + '.png',
facecolor='#eeeeee', dpi=150, bbox_inches='tight')
plt.show()
In order to get a general overview of pricing for the top domains, we can also plot all instances where a price appears in a SERP, so we can see how prices compare overall by domain.
fig, ax = plt.subplots(1, 2, facecolor='#eeeeee')
fig.set_size_inches(17, 8)
countries = [' - US ($)', ' - UK (£)']
country_codes = ['us', 'uk']
currency = ['$', '£']
top10dfs = [top10_domains_us, top10_domains_uk]
for j in range(2):
ax[j].grid()
ax[j].set_frame_on(False)
df = serp_flights[(serp_flights['gl'] == country_codes[j]) &
(serp_flights['currency'] == currency[j]) &
(serp_flights['displayLink'].isin(top10dfs[j]))]
ax[j].scatter(df.sort_values('totalResults')['displayLink'].str.replace('www.', ''),
df.sort_values('totalResults')['price'] ,
color=tab10.colors[j],
s=300, alpha=0.1)
ax[j].set_title('Prices per domain' + countries[j],
fontsize=21)
ax[j].tick_params(labelsize=18, rotation=18, axis='x')
ax[j].tick_params(labelsize=18, axis='y')
plt.tight_layout()
fig.savefig('Prices per country' + '.png',
facecolor='#eeeeee', dpi=150, bbox_inches='tight')
plt.show()
In the US, expedia.com clearly has lower prices on average, and a good portion of those is below 200 USD. tripadvisor.com seems to be the highest on average, but its prices have a higher range of fluctuation as well, compared to others.
opodo.co.uk is clearly the cheapest for UK, with pretty much all its prices below 200 GBP.
Keep in mind that the two charts have diferent scales, and they have different Y axes, and also that they show prices with different currencies. At the time of writing the GBP is around 1.30 USD.
This does not necessarily mean that expedia.com has lower actual prices, as it could be "starting from" or with conditions, etc. But these are their advertised prices on SERPs.
As with titles we can do a similar analysis, but one site looked very interesting to me when it came to the text of their snippets, and that is kayak.com.
Below is a sample of their snippets. Note that they mention airlines' names, prices, as well as source cities, even though the queries don't indicate wher the user is flying from.
Note also that they are different for each query. For Hong Kong they say "from San Francisco" and "from New York" while for Dubai it's New York, Chicago, and Orlando.
It seems that they have the snippets text dynamically generated based on the most frequent places people buy tickets from, and the airlines they use for those destinations.
This could be an interesting insight on the market, or at least on Kayak's view of the market and how they position themselves.
You might want to export the Kayak snippets, and generate a mapping between source and destination cities, as well as the airlines that they are most frequently associated with.
with pd.option_context('display.max_colwidth', 200):
print(*serp_flights[serp_flights['displayLink'] == 'www.kayak.com']['snippet'][:10].tolist(), sep='\n')
Find flights to Hong Kong on XiamenAir, Hong Kong Airlines, China Eastern and more. Fly round-trip from San Francisco from $402, from New York from $443, ... Find flights to Bangkok on China Southern, China Eastern, Air China and more. Fly round-trip from San Francisco from $400, from New York from $439, from San ... Find flights to London on Icelandair, Norwegian Air UK, WOW air and more. Fly round-trip from New York from $274, from Baltimore from $312, from Chicago ... Find flights to Singapore on XiamenAir, Air China, China Eastern and more. Fly round-trip from San Francisco from $492, from New York from $460, from ... Find flights to Macau on XiamenAir, China Eastern, Air China and more. Fly round-trip from San Francisco from $589, from New York from $499, from Honolulu ... Find flights to Macau on XiamenAir, China Eastern, Air China and more. Fly round-trip from San Francisco from $589, from New York from $499, from Honolulu ... Find flights to Dubai on Saudia, Norwegian, SWISS and more. Fly round-trip from New York from $600, from Chicago from $587, from Orlando from $597, from ... Find flights to Paris on LEVEL, SWISS, WOW air and more. Fly round-trip from New York from $269, from Detroit from $320, from Baltimore from $320, from ... Find flights to New York on Spirit Airlines, Frontier and more. Fly round-trip from Myrtle Beach from $58, from Chicago from $48, from Orlando from $101, from ... Find flights to Shenzhen on Hainan Airlines, Sichuan Airlines, China Eastern and more. Fly round-trip from San Francisco from $494, from New York from $439, ...
So, that was a quick overview of how Google's CSE can be used to automate a large number of reports and a few ideas on what can be analyzed.
Other things you might consider as well:
Good luck!