Start by using wget -r
to create a local mirror of the pages we care about:
cd /Users/simonw/Dropbox/Development/24ways-search
wget --recursive --wait 2 --no-clobber \
-I /2005,/2006,/2007,/2008,/2009,/2010,/2011,/2012,/2013,/2014,/2015,/2016,/2017 \
-X "/*/*/comments" \
https://24ways.org/archives/
And to get the rest:
wget --recursive --wait 2 --no-clobber \
-I /2018 \
-X "/*/*/comments" \
https://24ways.org/
from pathlib import Path
The pathlib module lets us easily work with recursive directory structures. We can use glob
to find deeply nested files mating a specified pattern.
base = Path("/Users/simonw/Dropbox/Development/24ways-search")
articles = list(base.glob("*/*/*/*.html"))
len(articles)
329
articles[:5]
[PosixPath('/Users/simonw/Dropbox/Development/24ways-search/24ways.org/2013/why-bother-with-accessibility/index.html'), PosixPath('/Users/simonw/Dropbox/Development/24ways-search/24ways.org/2013/levelling-up/index.html'), PosixPath('/Users/simonw/Dropbox/Development/24ways-search/24ways.org/2013/project-hubs/index.html'), PosixPath('/Users/simonw/Dropbox/Development/24ways-search/24ways.org/2013/credits-and-recognition/index.html'), PosixPath('/Users/simonw/Dropbox/Development/24ways-search/24ways.org/2013/managing-a-mind/index.html')]
We need a way to derive the URL on 24 ways based on the filepath:
str(articles[37].relative_to(base).parent)
'24ways.org/2014/websites-of-christmas-past-present-and-future'
Let's grab a random article and experiment with using the BeautifulSoup Python HTML scraping library to extract the data we need from it.
We needed to pip install beautifulsoup4 html5lib
first to install the library.
path = articles[37]
html = path.open().read()
len(html)
40295
from bs4 import BeautifulSoup as Soup
soup = Soup(html, "html5lib")
soup.find("title").text
'Websites of Christmas Past, Present and Future ◆ 24 ways'
BeautifulSoup lets you extract data from HTML using CSS selectors
div = soup.select_one(".e-content")
Using div.text
extracts just the text contents of a div, stripping any HTML tags
print(div.text.strip()[:200])
The websites of Christmas past The first website was created at CERN. It was launched on 20 December 1990 (just in time for Christmas!), and it still works today, after twenty-four years. Isn’t that
soup.select(".c-meta time")
[<time class="dt-published" datetime="2014-12-08T00:00:00+00:00">8 Dec<span>ember</span> 2014</time>]
We can use ["attribute"]
syntax to access HTML attributes.
soup.select_one(".c-meta time")["datetime"]
'2014-12-08T00:00:00+00:00'
The easiest way to extract the topic of the article (code, process, design etc) is from a link:
soup.select_one('.c-meta a[href^="/topics/"]')["href"].split("/topics/")[1].split("/")[0]
'code'
The author's name can be found in the sentence "More information about Josh Emerson"
soup.select_one(".c-continue")["title"].split("More information about")[1].strip()
'Josh Emerson'
soup.select_one(".c-continue")["href"].split("/authors/")[1].split("/")[0]
'joshemerson'
year = str(path.relative_to(base)).split("/")[1]; year
'2014'
Now that we've figured out how to extract data from a single article, we can extract the data from all of the articles inside a loop:
docs = []
for path in articles:
year = str(path.relative_to(base)).split("/")[1]; year
url = 'https://' + str(path.relative_to(base).parent) + '/'
soup = Soup(path.open().read(), "html5lib")
author = soup.select_one(".c-continue")["title"].split("More information about")[1].strip()
author_slug = soup.select_one(".c-continue")["href"].split("/authors/")[1].split("/")[0]
published = soup.select_one(".c-meta time")["datetime"]
contents = soup.select_one(".e-content").text.strip()
title = soup.find("title").text.split(" ◆")[0]
try:
topic = soup.select_one('.c-meta a[href^="/topics/"]')["href"].split("/topics/")[1].split("/")[0]
except TypeError:
# Some articles don't have topics
topic = None
docs.append({
"title": title,
"contents": contents,
"year": year,
"author": author,
"author_slug": author_slug,
"published": published,
"url": url,
"topic": topic,
})
len(docs)
329
docs[0]
{'title': 'Why Bother with Accessibility?', 'contents': 'Web accessibility (known in other fields as inclusive design or universal design) is the degree to which a website is available to as many people as possible. Accessibility is most often used to describe how people with disabilities can access the web.\n\nHow we approach accessibility\n\nIn the web community, there’s a surprisingly inconsistent approach to accessibility. There are some who are endlessly dedicated to accessible web design, and there are some who believe it so intrinsic to the web that it shouldn’t be considered a separate topic. Still, of those who are familiar with accessibility, there’s an overwhelming number of designers, developers, clients and bosses who just aren’t that bothered.\n\nOver the last few months I’ve spoken to a lot of people about accessibility, and I’ve heard the same reasons to ignore it over and over again. Let’s take a look at the most common excuses.\n\nExcuse 1: “People with disabilities don’t really use the web”\n\nAccessibility will make your site available to more people — the inclusion case\n\nIn the same way that the accessibility of a building isn’t just about access for wheelchair users, web accessibility isn’t just about blind users and screen readers. We can affect positively the lives of many people by making their access to the web easier.\n\nThere are four main types of disability that affect use of the web:\n\n\n\tVisual\n\tBlindness, low vision and colour-blindness\n\tAuditory\n\tProfoundly deaf and hard of hearing\n\tMotor\n\tThe inability to use a mouse, slow response time, limited fine motor control\n\tCognitive\n\tLearning difficulties, distractibility, the inability to focus on large amounts of information\n\n\nNone of these disabilities are completely black and white\n\nExamining deafness, it’s clear from the medical scale that there are many grey areas between full hearing and total deafness:\n\n\n\tmild\n\tmoderate\n\tmoderately severe\n\tsevere\n\tprofound\n\ttotally deaf\n\n\nFor eyesight, and brain conditions that affect what users see, there is a huge range of conditions and challenges:\n\n\n\tastigmatism\n\tcolour blindness\n\takinetopsia (motion blindness)\n\tscotopic visual sensitivity (visual stress related to light)\n\tvisual agnosia (impaired recognition or identification of objects)\n\n\nWhile we might have medical and government-recognised definitions that tell us what makes a disability, day-to-day life is not so straightforward. People experience varying degrees of different conditions, and often one or more conditions at a time, creating a false divide when you view disability in terms of us and them.\n\nImpairments aren’t always permanent\n\nAs we age, we’re more likely to experience different levels of visual, auditory, motor and cognitive impairments. We might have an accident or illness that affects us temporarily. We might struggle more earlier or later in the day. There are so many little physiological factors that affect the way people interact with the web that we can’t afford to make any assumptions based on our own limited experiences.\n\nImpairments might be somewhere between the user and the website\n\nThere are also impairments that aren’t directly related to the user. Environmental factors have a huge effect on the way people interact with the web. These could be:\n\n\n\tLow bandwidth, or intermittent internet connection\n\tBright light, rain, or other weather-based conditions\n\tNoisy environments, or a location where the user doesn’t want to disturb their neighbours with sound\n\tBrowsing with mobile devices, games consoles and other non-desktop devices\n\tBrowsing with legacy browsers or operating systems\n\n\nSuch environmental factors show that it’s not just those with physical impairments who benefit from more accessible websites. We started designing responsive websites so we could be more future-friendly, and with a shared goal of better optimised experiences, accessibility should be at the core of responsive web design.\n\nExcuse 2: “We don’t want to affect the experience for the majority of our users”\n\nAccessibility will improve your site for all your users — the usability case\n\nOn a basic level, the different disability groups, as shown in the inclusion case, equate to simple usability goals:\n\n\n\tVisual – make it easy to read\n\tAuditory – make it easy to hear\n\tMotor – make it easy to interact\n\tCognitive – make it easy to understand and focus\n\n\nTaking care to ensure good usability in these areas will also have an impact on accessibility. Unless your site is catering specifically to a particular disability, where extreme optimisation is most beneficial, taking care to design with accessibility in mind will rarely negatively affect the experience of your wider audience.\n\nExcuse 3: “We don’t have the budget for accessibility”\n\nAccessibility will make you money — the business case\n\nBy reducing your audience through ignoring accessibility, you’re potentially excluding the income from those users. Designing with accessibility in mind from the beginning of a project makes it easier to make small inexpensive optimisations as part of the design and development process, rather than bolting on costly updates to increase your potential audience later on.\n\nThe following are excerpts from a white paper about companies that increased the accessibility of their websites to comply with government regulation.\n\n\n\tImprovements in accessibility doubled Legal and General’s life insurance sales online.\n\n\n\n\tImprovements in accessibility increased Tesco’s grocery home delivery sales by £13 million in 2005… To their surprise they found that many normal visitors preferred the ease of navigation and improved simplicity of the [parallel] accessible site and switched to use it. Tesco have replaced their ‘normal’ site with their accessible version and expect a further increase in revenues.\n\n\n\n\tImprovements in accessibility increased Virgin.net sales by 68%.\n\n\nStatistics all from WSI white paper: Improve your website’s usability and accessibility to increase sales (PDF).\n\nExcuse 4: “Accessible websites are ugly”\n\nAccessibility won’t stop your site from being beautiful — the beauty case\n\nMany people use ugly accessible websites as proof that all accessible websites are ugly. This just isn’t the case. I’ve compiled some examples of beautiful and accessible websites with screenshots of how they look through the Color Oracle simulator and how they perform when run through Webaim’s Wave accessibility checker tool.\n\nWhile automated tools are no substitute for real users, they can help you learn more about good practices, and give you guidance on where your site needs improvements to make it more accessible.\n\nAmazon.co.uk\n\nIt may not be a decorated beauty, but Amazon is often first in functional design. It’s a huge website with a lot of interactive content, but it generates just five errors on the Wave test, and is easy to read under a Color Oracle filter.\n\n Screenshot of Amazon website\n Screenshot of Amazon’s Wave results – five errors\n Screenshot of Amazon through a Color Oracle filter\n\n24 ways\n\nWhen Tim Van Damme redesigned 24 ways back in 2007, it was a striking and unusual design that showed what could be achieved with CSS and some imagination. Despite the complexity of the design, it gets an outstanding zero errors on the Wave test, and is still readable under a Color Oracle filter.\n\n Screenshot of pre-2013 24 ways website design\n Screenshot of 24 ways Wave results – zero errors\n Screenshot of 24ways through a Color Oracle filter\n\nOpera’s Shiny Demos\n\nDemos and prototypes are notorious for ignoring accessibility, but Opera’s Shiny Demos site shows how exploring new technologies doesn’t have to exclude anyone. It only gets one error on the Wave test, and looks fine under a Color Oracle filter.\n\n Screenshot of Opera’s Shiny Demos website\n Screenshot of Opera’s Shiny Demos Wave results – 1 error\n Screenshot of Opera’s Shiny Demos through a Color Oracle filter\n\nSoundCloud\n\nWhen a site is more app-like, relying on more interaction from the user, accessibility can be more challenging. However, SoundCloud only gets one error on the Wave test, and the colour contrast holds up well under a Color Oracle filter.\n\n Screenshot of SoundCloud website\n Screenshot of SoundCloud’s Wave results – one error\n Screenshot of SoundCloud through a Color Oracle filter\n\nEducation and balance\n\nAs with most web design, doing accessibility well is about combining your knowledge of accessibility with your project’s context to create a balance that serves your users’ needs. Your types of content and interactions will dictate one set of constraints. Your users’ needs and goals will dictate another. In broad terms, web design as a practice is finding the equilibrium between these constraints.\n\nAnd then there’s just caring. The web as a platform is open, affordable and available to many. Accessibility is our way to ensure that nobody gets shut out.', 'year': '2013', 'author': 'Laura Kalbag', 'author_slug': 'laurakalbag', 'published': '2013-12-10T00:00:00+00:00', 'url': 'https://24ways.org/2013/why-bother-with-accessibility/', 'topic': 'design'}
Finally, we can use the sqlite-utils library to create a brand new SQLite database and then automatically create an articles
database table with the correct columns to store all of our collected data.
import sqlite_utils
db = sqlite_utils.Database("/tmp/24ways.db")
db["articles"].insert_all(docs)
<Table articles>
We're going to want to be able to run full-text-search queries against the contents of the title
, author
and contents
columns:
db["articles"].enable_fts(["title", "author", "contents"])
<Table articles>
And here's our final SQLite database:
!ls -lah /tmp/24ways.db
-rw-r--r-- 1 simonw wheel 5.2M Dec 16 22:13 /tmp/24ways.db
You can play with it in Datasette at https://search-24ways.herokuapp.com/