To access our browsing history, we go to ~/Library/Safari and look for the database History.db. We make a copy of it in a folder in our workspace, e.g. to ~/Workspace/web_browsing/hs.db.
%%bash
cp ~/Library/Safari/History.db ~/Workspace/web_browsing/hs.db
Now let us fire up sqlite3 and see what tables are inside the database.
%%script sqlite3 hs.db
.tables
history_items history_tombstones history_visits metadata
Among the four (4) tables above, we'll be using 'history_items' and 'history_visits'. Let's take a look at their schemata.
%%script sqlite3 hs.db
.schema history_items
CREATE TABLE history_items (id INTEGER PRIMARY KEY AUTOINCREMENT,url TEXT NOT NULL UNIQUE,domain_expansion TEXT NULL,visit_count INTEGER NOT NULL,daily_visit_counts BLOB NOT NULL,weekly_visit_counts BLOB NULL,autocomplete_triggers BLOB NULL,should_recompute_derived_visit_counts INTEGER NOT NULL); CREATE INDEX history_items__domain_expansion ON history_items (domain_expansion);
%%script sqlite3 hs.db
.schema history_visits
CREATE TABLE history_visits (id INTEGER PRIMARY KEY AUTOINCREMENT,history_item INTEGER NOT NULL REFERENCES history_items(id) ON DELETE CASCADE,visit_time REAL NOT NULL,title TEXT NULL,load_successful BOOLEAN NOT NULL DEFAULT 1,http_non_get BOOLEAN NOT NULL DEFAULT 0,synthesized BOOLEAN NOT NULL DEFAULT 0,redirect_source INTEGER NULL UNIQUE REFERENCES history_visits(id) ON DELETE CASCADE,redirect_destination INTEGER NULL UNIQUE REFERENCES history_visits(id) ON DELETE CASCADE,origin INTEGER NOT NULL DEFAULT 0,generation INTEGER NOT NULL DEFAULT 0); CREATE INDEX history_visits__last_visit ON history_visits (history_item, visit_time DESC, synthesized ASC); CREATE INDEX history_visits__origin ON history_visits (origin, generation);
The first thing to note is that each history visit table is associated with one history item, as we can tell from 'history_item' column in 'history_visits' table.
history_item INTEGER NOT NULL REFERENCES history_items(id) ON DELETE CASCADE
Moreover, as we can tell from the columns of the two tables, 'history_visits' contains information about individual visits to websites, like the time of the visit and the title of the website, while history_items contains general and aggregate information about the websites we visit, such as their url and number of visits.
Let's take a look now at the addresses that we visit most often (and despair later)...
%%script sqlite3 hs.db
SELECT id, url, visit_count FROM history_items ORDER BY visit_count DESC LIMIT 5;
491|https://twitter.com/|4048 9252|https://www.facebook.com/|2380 17706|http://international.nytimes.com/|1225 18|http://www.theverge.com/|1211 23141|https://mail.google.com/mail/u/0/#inbox|1134
...As well as our most recent visits.
%%script sqlite3 hs.db
SELECT V.title, I.url, V.visit_time
FROM history_items as I,
(SELECT history_item, title, visit_time FROM history_visits ORDER BY visit_time DESC LIMIT 5) as V
WHERE I.id = V.history_item;
Sanja Scepanovic and 2 others - Messages|https://www.facebook.com/messages/conversation-1680195825528863|462218817.510182 Facebook|https://www.facebook.com/login.php?login_attempt=1&lwv=111|462218817.412889 Espoo Cine 2015|http://boxoffice.espoocine.fi/ewt/?lang=fin&tctpage=cart|462218724.79521 Espoo Cine 2015: bikes vs cars|http://www.espoocine.fi/2015/fi/ohjelmisto/elokuvat/bikes-vs-cars|462218707.2001 Espoo Cine 2015: bikes vs cars|http://www.espoocine.fi/2015/en/programme/movies/bikes-vs-cars|462218641.31604
Note that, from what I can tell at least, time is expressed as the number of seconds since the beginning of 2001-01-01.
Finally, let's dump to text files ('visits.csv') only the information that interests us: for now, that's only the web address visited (url) and the associated timestamp.
%%bash --out visits
sqlite3 hs.db \
"SELECT I.url, V.visit_time \
FROM history_items as I, \
(SELECT history_item, title, visit_time FROM history_visits) as V \
WHERE I.id = V.history_item;"
f = open('visits.txt', 'w')
f.write(visits)
f.close()
That was it, our browsing history is in file 'visits.txt'.
%%bash
head -5 visits.txt
echo "... ... ..."
tail -5 visits.txt
http://www.apple.com/iphone/|431973567.3 http://live.arstechnica.com/apples-september-9-2014-event/|431975282.8 http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDAQFjAA&url=http%3A%2F%2Fwww.stevens.edu%2F&ei=NofoU56VCMeP0AW6rYCIBg&usg=AFQjCNFN7sE7yPBglg-N1stqD034Ifa_pQ&bvm=bv.72676100,d.d2k|429440704.7 http://www.theverge.com/2014/9/9/6122731/iphone-6-apple-4-7-inch-display-announced|431982335.9 http://www.apple.com/live/2014-sept-event/|431919219.233333 ... ... ... http://www.espoocine.fi/2015/en/programme/movies/bikes-vs-cars|462218641.31604 http://www.espoocine.fi/2015/fi/ohjelmisto/elokuvat/bikes-vs-cars|462218707.2001 http://boxoffice.espoocine.fi/ewt/?lang=fin&tctpage=cart|462218724.79521 https://www.facebook.com/login.php?login_attempt=1&lwv=111|462218817.412889 https://www.facebook.com/messages/conversation-1680195825528863|462218817.510182
%matplotlib inline
import sys
import matplotlib.pyplot as plt
import time
import datetime
Let us parse the text file we created and extract the timestamps of our visits.
f = open('visits.txt')
tt = [] # stores timestamps of our visits
for line in f:
try:
tokens = line.rstrip().split('|')
time_sec = float(tokens[-1])
tt.append(time_sec)
except:
sys.stderr.write(line)
sys.stderr.write('\n')
f.close()
# sort the timestamps
tt.sort()
# turn into standard POSIX by adding
# the first second of 2001-1-1 UTC
zero_sec = time.mktime(datetime.datetime(2001, 1, 1).timetuple())
for i in range(len(tt)): tt[i] += zero_sec
Let us plot the daily number of pages we've visited for that period of time.
bin_size = 24 * 3600 # have one bin per 24 hours
num_of_bins = (tt[-1] - tt[0]) / bin_size
# figure out intervals to accomodate number of labels on x-axis
num_of_xlabels = 10
label_bin = (tt[-1] - tt[0]) / (num_of_xlabels - 1)
locs = [tt[0] + label_bin * (i) for i in range(num_of_xlabels)]
labels = map(lambda x: str(datetime.datetime.fromtimestamp(x)).split()[0], locs)
# plot the histogram
plt.figure(figsize=(15, 5))
plt.hist(tt, num_of_bins, range = (tt[0], tt[-1]))
plt.xticks(locs, labels)
plt.ylabel("Daily Visits")
<matplotlib.text.Text at 0x108166b50>
What are websites we've visited most often? We've already had an idea about this, but now let us do more elaborate processing.
import re
import numpy as np
website_pattern = re.compile('http(s)?://(\w+\.)?(\w+?)\.\w+?/')
verbose = False
f = open('visits.txt')
website_counts = {} # stores counts for each website
addresses = {} # stores the different addresses (base-urls) for each website
for line in f:
try:
# keep only the url, ignore the timestamp
tokens = line.rstrip().split('|')
m = website_pattern.search("".join(tokens[:-1]))
try:
# parse the url to extract the website name
# for example, if the url is http://www.facebook.com/ab238ub
# the website name is 'facebook'
website = m.group(3)
# update count
website_counts[website] = website_counts.get(website, 0) + 1
# keep track of the different base urls that match this website
# for example, if the url is http://www.facebook.com/ab238ub
# the base url is http://www.facebook.com/
if website not in addresses:
addresses[website] = {}
addresses[website][m.group(0)] = addresses[website].get(m.group(0), 0) + 1
except Exception, e:
if verbose:
sys.stderr.write(str(e))
sys.stderr.write('\n')
except:
sys.stderr.write(line)
sys.stderr.write('\n')
f.close()
Plot the number of visits to the websites we visit most frequently
k = 5 # The number of top websites to present
# Keep the top-k websites
k = min(k, len(website_counts))
top_websites = sorted(website_counts.items(),
key = lambda x: x[1], reverse = True)[:k]
# The number of visits to websites that are not among the top-k
other_visits = sum(x[1] for x in top_websites[k:])
# Let's make a histogram for the top-k websites
plt.figure(figsize = (10, 5)) # create new figure
bar_width = 0.5
xpos = np.arange(len(top_websites)) + bar_width # bar positions on the x-axis
plt.bar(xpos, [w[1] for w in top_websites], bar_width) # places the bars
plt.xticks(xpos+bar_width/2, [w[0] for w in top_websites]) # ticks on x-axis
([<matplotlib.axis.XTick at 0x1077d3c90>, <matplotlib.axis.XTick at 0x1077aff10>, <matplotlib.axis.XTick at 0x10811ee50>, <matplotlib.axis.XTick at 0x10812d6d0>, <matplotlib.axis.XTick at 0x10812de10>], <a list of 5 Text xticklabel objects>)
## For each of the top-k websites, show the most popular
## addresses (base urls)
m = 3 # Show at most m addresses...
pct = 0.90 # ... or stop at 90% of visits
# For each of the top-k websites...
for website_name, website_visits in top_websites:
# ... store the most frequently visited base urls ...
website_addresses = sorted(addresses[website_name].items(),
key = lambda x: x[1], reverse = True)
top_addresses = []; visits = 0.
for address, address_count in website_addresses:
visits += address_count
top_addresses.append((address, address_count))
if len(top_addresses) > m or visits / website_visits >= pct:
break
# ... and the number of visits to addresses that are not among the top
other_num = website_visits - visits
# make a pie-chart for this website
# if it is associated with many addresses
if len(top_addresses) >= m: # TODO change this condition if you want
plt.figure(figsize = (7,7))
labels = [x[0] for x in top_addresses] + ["other"]
sizes = [x[1] for x in top_addresses] + [other_num]
# arbitrary choice of colors -- I like blue
n = float(len(sizes))
colors = [( p / (n + 1.), 0., (1. - p / (n + 1.)))\
for p in xrange(len(sizes))]
colors[-1] = 'grey'
plt.pie(sizes, labels=labels, colors=colors,
labeldistance = 1.1, autopct='%1.1f%%', startangle = 90)
For more information on the tools we used in this session, check the following websites.