Sebastian Bank (sebastian.bank@uni-leipzig.de) http://www.uni-leipzig.de/~sbank/
The latest version of this IPython Notebook is available at http://gist.github.com/xflr6/9050337.
Glottolog provides its comprehensive catalog of the world's languages, language families and dialects for download in linked data format.
In this notebook, I will process this data set using the following tools:
If you are new to scientific Python, the Anaconda Python Distribution is probably the fastest way to get Python installed with all the commonly used scientific packages. It supports all platforms (Linux, Mac, and Windows).
If you are on Windows, there are Unofficial Windows Binaries for a lot of Python extension packages used in scientific computing.
import urllib
URL = 'http://glottolog.org/static/download/2.7/glottolog-language.n3.gz'
filename, headers = urllib.urlretrieve(URL, URL.rpartition('/')[2])
The file contains RDF in Notation3 compressed with gzip.
print filename
glottolog-language.n3.gz
Display the size in megabytes.
size = int(headers['Content-Length'])
print size / 1024.0 ** 2
2.63014793396
Read the first few bytes from the file with gzip
(docs) so we can get an impression of the format.
import gzip
with gzip.open(filename) as fd:
sample = fd.read(4000)
Split the sample into the namespaces definitions and the actual RDF triples. They are separated by a blank line.
head, _, body = sample.partition('\n\n')
len(head), len(body)
(932, 3066)
Inspect the start of the namespaces.
print head[:600] + '...'
@prefix bibo: <http://purl.org/ontology/bibo/> . @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix dctype: <http://purl.org/dc/dcmitype/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix frbr: <http://purl.org/vocab/frbr/core#> . @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> . @prefix gold: <http://purl.org/linguistics/gold/> . @prefix isbd: <http://iflastandards.info/ns/isbd/elements/> . @prefix lexvo: <http://lexvo.org/ontology#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-r...
Display the first RDF triples.
print body[:600] + '...'
<http://glottolog.org/resource/languoid/id/muni1258> a dcterms:LinguisticSystem, gold:Language ; rdfs:label "Muniche"@en ; lexvo:iso639P3PCode "myr"^^xsd:string ; dcterms:description <http://glottolog.org/resource/reference/id/10167>, <http://glottolog.org/resource/reference/id/132589>, <http://glottolog.org/resource/reference/id/135495>, <http://glottolog.org/resource/reference/id/300702>, <http://glottolog.org/resource/reference/id/303200>, <http://glottolog.org/resource/reference/id/34227>, <http://glottolog.org/resource/re...
The entry starts with the full URI of the languoid, followed by its types, label, ISO 639-3 code and description.
Let's try to extract some meaningful information from this string just using Pythons regular expressions.
Load the whole file uncompressed into memory.
with gzip.open(filename) as fd:
data = fd.read()
Display the size in megabytes.
print len(data) / 1024.0 ** 2
32.698595047
Extract the glottocode from the start of all dcterms:LinguisticSystem
entries with the re
module (docs) and count them.
import re
GLOTTOCODE = '<http://glottolog.org/resource/languoid/id/(\w+)> a dcterms:LinguisticSystem'
gcodes = re.findall(GLOTTOCODE, data)
len(gcodes)
24393
Display the glottocodes of the first five entries.
gcodes[:5]
['muni1258', 'west1503', 'port1278', 'west2205', 'nilo1247']
Looks unordered, sort them alphabetically and display the first and last five entries.
gcodes.sort()
print gcodes[:5]
print gcodes[-5:]
['aala1237', 'aant1238', 'aari1238', 'aari1239', 'aari1240'] ['zuti1239', 'zuwa1238', 'zwal1238', 'zyph1238', 'zyud1238']
Extract everything that looks like an ISO code. Count the results.
ISO_CODE = 'iso639P3PCode "(\w+)"'
icodes = re.findall(ISO_CODE, data)
len(icodes)
7822
Display the first ten ISO codes.
icodes[:10]
['myr', 'pko', 'oki', 'mwy', 'kqh', 'mwx', 'aam', 'spy', 'tec', 'kpz']
Sort them as well and display the start and end.
icodes.sort()
print icodes[:10]
print icodes[-10:]
['aaa', 'aab', 'aac', 'aad', 'aae', 'aaf', 'aag', 'aah', 'aai', 'aak'] ['zun', 'zuy', 'zwa', 'zyb', 'zyg', 'zyj', 'zyn', 'zyp', 'zza', 'zzj']
Glottocodes consist of four letters and some apparently recurring digit combinations.
Display the five most common of those digits and their frequency with collections.Counter
(docs).
import collections
collections.Counter(g[4:] for g in gcodes).most_common(5)
[('1238', 3022), ('1239', 1192), ('1242', 1039), ('1241', 997), ('1237', 903)]
Show the most common inital parts.
collections.Counter(g[:4] for g in gcodes).most_common(5)
[('nort', 563), ('sout', 560), ('nucl', 508), ('west', 461), ('east', 425)]
Use rdflib
(docs) to load the whole graph into memory.
This will take a while and fill a couple hundred megabytes of RAM.
import rdflib
graph = rdflib.Graph()
with gzip.open(filename) as fd:
graph.parse(fd, format='n3')
graph
<Graph identifier=N5f0224c79a154d14bd437619ecf4e397 (<class 'rdflib.graph.Graph'>)>
Count the number of triples.
len(graph)
670194
Display some of the triples (subject, predicate, object).
import itertools
for s, p, o in itertools.islice(graph, 15):
print s[42:], graph.qname(p), o
pwon1235 rdf:type http://purl.org/dc/terms/LinguisticSystem guin1260 dcterms:spatial http://www.geonames.org/countries/GW/ mogu1251 dcterms:description http://glottolog.org/resource/reference/id/156942 tibe1272 dcterms:description http://glottolog.org/resource/reference/id/26288 barr1251 dcterms:isReferencedBy http://glottolog.org/valuesets/vitality-barr1251 nang1261 skos:altLabel nang1261 choc1278 dcterms:spatial North America song1308 rdf:type http://purl.org/linguistics/gold/Dialect pato1242 void:inDataset http://glottolog.org/ nort2855 rdf:type http://purl.org/dc/terms/LinguisticSystem chil1280 skos:broader http://glottolog.org/resource/languoid/id/nort2940 sate1242 dcterms:title Saterfriesisch stan1290 dcterms:description http://glottolog.org/resource/reference/id/37004 bord1246 skos:broader http://glottolog.org/resource/languoid/id/komb1273 marg1251 dcterms:description http://glottolog.org/resource/reference/id/54615
Show all available predicates.
for p in sorted(set(graph.predicates())):
print graph.qname(p)
lexvo:iso639P3PCode dcterms:description dcterms:isReferencedBy dcterms:isReplacedBy dcterms:spatial dcterms:title void:inDataset rdf:type rdfs:label owl:sameAs geo:lat geo:long skos:altLabel skos:broader skos:broaderTransitive skos:changeNote skos:editorialNote skos:narrower skos:prefLabel skos:scopeNote
Create shortcuts for querying glottocodes and ISO codes. Translate glottocodes into ISO codes.
glottocode = rdflib.Namespace('http://glottolog.org/resource/languoid/id/')
lexvo = rdflib.Namespace('http://lexvo.org/ontology#')
iso639 = lexvo.iso639P3PCode
for g in gcodes[:5]:
i = graph.value(glottocode[g], iso639, default='___')
print '%s -> %s,' % (g, i),
aala1237 -> ___, aant1238 -> ___, aari1238 -> ___, aari1239 -> aiw, aari1240 -> aay,
Translate ISO codes into glottocodes
string = rdflib.namespace.XSD.string
for i in icodes[:5]:
g = graph.value(None, iso639, rdflib.Literal(i, datatype=string))
print '%s -> %s,' % (i, g[42:]),
aaa -> ghot1243, aab -> alum1246, aac -> arii1243, aad -> amal1242, aae -> arbe1236,
Retrieve the preferred label of languoids.
label = rdflib.namespace.RDFS.label
for g in gcodes[:5]:
l = graph.value(glottocode[g], label)
print '%s -> %s,' % (g, l),
aala1237 -> Aalawa, aant1238 -> Aantantara, aari1238 -> Aari-Gayil, aari1239 -> Aari, aari1240 -> Aariya,
Lookup an arbitrary languoid with a given label.
print graph.value(None, label, rdflib.Literal('Aalawa', lang='en'))
http://glottolog.org/resource/languoid/id/aala1237
Show the predicates and objects of an individual languoid.
for p, o in graph[glottocode['aala1237']]:
print graph.qname(p), o
skos:prefLabel Aalawa rdfs:label Aalawa skos:broaderTransitive http://glottolog.org/resource/languoid/id/aust1307 dcterms:isReferencedBy http://glottolog.org/valuesets/fc42061 dcterms:title Aalawa dcterms:isReferencedBy http://glottolog.org/valuesets/sc42061 skos:altLabel aala1237 void:inDataset http://glottolog.org/ skos:scopeNote language skos:broader http://glottolog.org/resource/languoid/id/ramo1244 rdf:type http://purl.org/dc/terms/LinguisticSystem dcterms:spatial Papunesia rdf:type http://purl.org/linguistics/gold/Dialect
Display the nodes along a languoid's path up the tree.
broader = rdflib.namespace.SKOS.broader
aalawa = graph.resource(glottocode['aala1237'])
print ' -> '.join(b.label() for b in aalawa.transitive_objects(broader))
Aalawa -> Ramoaaina -> Kandas-Duke of York -> Label-Bilur -> St George linkage -> New Ireland-Northwest Solomonic linkage -> Meso Melanesian linkage -> Western Oceanic linkage -> Oceanic -> Eastern Malayo-Polynesian -> Central-Eastern Malayo-Polynesian -> Malayo-Polynesian -> Nuclear Austronesian -> Austronesian
Display the nodes immediately below a languoid.
narrower = rdflib.namespace.SKOS.narrower
atlaco = graph.resource(glottocode['atla1278'])
print '%s <- %s' % (atlaco.label(), ', '.join(n.label() for n in atlaco.objects(narrower)))
Atlantic-Congo <- Volta-Congo, North-Central Atlantic, Nalu, Mansoanka-Fore-Mboteni, Limba, Mel
Count all nodes below a languoid.
len(list(atlaco.transitive_objects(narrower)))
4608
Retrieve rows of glottocode, ISO code, and label with RDFs query language SPARQL. Also display the annotated language of the label.
GIL = """
SELECT
(substr(str(?s), 43) AS ?glottocode) ?iso ?label
WHERE
{ ?s a dcterms:LinguisticSystem ; skos:prefLabel ?label
OPTIONAL { ?s lexvo:iso639P3PCode ?iso } }
ORDER BY ?s LIMIT 10"""
for g, i, l in graph.query(GIL):
print '%s | %-4s | %-10s | %s' % (g, i, l, l.language)
aala1237 | None | Aalawa | en aant1238 | None | Aantantara | en aari1238 | None | Aari-Gayil | en aari1239 | aiw | Aari | en aari1240 | aay | Aariya | en aari1244 | aiz | Aari | en aasa1238 | aas | Aasax | en aata1238 | None | Aatasaara | en abaa1238 | None | Aba | en abab1239 | None | Ababda | en
Display the result as CSV (json
and xml
format are also supported).
print graph.query(GIL).serialize(format='csv')
glottocode,iso,label aala1237,,Aalawa aant1238,,Aantantara aari1238,,Aari-Gayil aari1239,aiw,Aari aari1240,aay,Aariya aari1244,aiz,Aari aasa1238,aas,Aasax aata1238,,Aatasaara abaa1238,,Aba abab1239,,Ababda
Determine the language families with the most child languages.
FAMILIES = """
SELECT
?label (count(*) as ?n)
WHERE
{ ?s a gold:LanguageFamily ; rdfs:label ?label ; skos:narrower+/a gold:Language }
GROUP BY ?s
ORDER BY desc(?n) LIMIT 10"""
for f, n in graph.query(FAMILIES):
print '%s\t%s' % (f, n)
Atlantic-Congo 1430 Austronesian 1274 Indo-European 583 Sino-Tibetan 475 Bookkeeping 391 Afro-Asiatic 372 Nuclear Trans New Guinea 315 Pama-Nyungan 241 Otomanguean 179 Sign Language 168
Display the immediate children for some families.
CHILDREN = """
SELECT
?label (group_concat(?o; separator=", ") as ?children)
WHERE
{ ?s a gold:LanguageFamily ; rdfs:label ?label ; skos:narrower/rdfs:label ?o }
GROUP BY ?s
ORDER BY ?label OFFSET 10 LIMIT 5"""
for f, c in graph.query(CHILDREN):
print '%s <- %s' % (f, c)
Arawan <- Suruahá, Paumarí, Aruá (Amazonas State), Madi-Madiha Artificial Language <- Neo, Efate group based (Artificial Language), Kotava, Esperanto, Lingua Franca Nova, Talossan, Interlingua (International Auxiliary Language Association), Rennellese Sign Language, Ladakhi Sign Athapaskan-Eyak-Tlingit <- Athapaskan-Eyak, Tlingit Atlantic-Congo <- Volta-Congo, North-Central Atlantic, Nalu, Mansoanka-Fore-Mboteni, Limba, Mel Austroasiatic <- Nicobaric, Monic, Khmuic, Vietic, Mangic, Pearic, Bahnaric, Khasi-Palaung, Katuic, Mundaic, Aslian, Khmeric
Do the same for a specific languoid.
for l, c in graph.query("""BASE <http://glottolog.org/resource/languoid/id/>
SELECT
?label (group_concat(?o; separator=", ") as ?children)
WHERE
{ <atla1278> rdfs:label ?label ; skos:narrower/rdfs:label ?o }"""):
print '%s <- %s' % (l, c)
Atlantic-Congo <- Volta-Congo, North-Central Atlantic, Nalu, Mansoanka-Fore-Mboteni, Limba, Mel
Here's a SPARQL query that retrieves most of the functional properties of the languoids.
LANGUOIDS = """
SELECT
(substr(str(?s), 43) AS ?id) ?label
(substr(str(?type), 34) AS ?level)
(substr(str(?broader), 43) AS ?parent)
(if(bound(?change_note), 1, 0) AS ?obsolete)
?status ?iso639 ?latitude ?longitude
WHERE
{ ?s a dcterms:LinguisticSystem ; skos:prefLabel ?label .
?s a ?type FILTER (strstarts(str(?type), "http://purl.org/linguistics/gold/"))
OPTIONAL { ?s skos:broader ?broader }
OPTIONAL { ?s skos:changeNote ?change_note FILTER (?change_note = "obsolete") }
OPTIONAL { ?s skos:editorialNote ?status }
OPTIONAL { ?s lexvo:iso639P3PCode ?iso639 }
OPTIONAL { ?s geo:lat ?latitude; geo:long ?longitude } }"""
Display some results.
for row in itertools.islice(graph.query(LANGUOIDS), 20):
print '%s %-20s %-17s %-8s %s %-11s %-4s %-8s %s' % row
pwon1235 Pwo Northern Karen Language nort2704 0 established pww 18.016 98.2709 kwes1244 Kwese Language phee1234 0 established kws -5.60445 18.5759 abaw1238 Abawa Dialect gupa1248 0 None None None None roto1247 Rotorua-Taupo Dialect maor1246 0 None None None None nort2855 North Coast Mengen Dialect meng1267 0 None None None None maca1260 Maca Language mata1290 0 established mca -25.0119 -57.3694 nyon1241 Nyong Language pere1234 0 established muo 7.27419 11.0615 fars1254 Farsic-Caucasian Tat LanguageSubfamily sout3157 0 established None None None yeng1243 Yengi Hissar Dialect uigh1240 0 None None None None thui1238 Thui Phum Dialect ngal1291 0 None None None None west2339 Western Asturian Dialect astu1245 0 None None None None kele1254 Kele (C.60) LanguageFamily None 1 established None None None zumu1241 Zumu Dialect bata1314 0 None None None None nort2742 Northern Isan Dialect nort2741 0 None None None None tezo1238 Tezoatlán Mixtec Language mixt1427 0 established mxb 17.6155 -97.9002 sund1254 Sundi-Kamba LanguageSubfamily laad1234 0 established None None None long1404 Long Bento' Dialect moda1244 0 None None None None pouy1238 Pouye Language ramm1241 0 established bye -3.72704 141.864 gola1255 Gola Language mela1257 0 established gol 7.06193 -10.8138 supp1238 Suppire-Mamara LanguageFamily None 1 established None None None
Write the results into a CSV file. Show the beginning of the file.
CSV = 'glottolog.csv'
graph.query(LANGUOIDS).serialize(CSV, format='csv')
with open(CSV) as fd:
sample = fd.read(500)
print sample + '...'
id,label,level,parent,obsolete,status,iso639,latitude,longitude pwon1235,Pwo Northern Karen,Language,nort2704,0,established,pww,18.016,98.2709 kwes1244,Kwese,Language,phee1234,0,established,kws,-5.60445,18.5759 abaw1238,Abawa,Dialect,gupa1248,0,,,, roto1247,Rotorua-Taupo,Dialect,maor1246,0,,,, nort2855,North Coast Mengen,Dialect,meng1267,0,,,, maca1260,Maca,Language,mata1290,0,established,mca,-25.0119,-57.3694 nyon1241,Nyong,Language,pere1234,0,established,muo,7.27419,11.0615 fars1254,Farsic-Cau...
Let's put that into a relational database so we can reuse it later.
Create an SQLite database file connecting with sqlite3
(docs). Activate foreign key checks so we notice if something is inconsistent.
import sqlite3
DB = 'glottolog.sqlite3'
conn = sqlite3.connect(DB)
conn.execute('PRAGMA foreign_keys = ON')
conn.execute('PRAGMA synchronous = OFF')
conn.execute('PRAGMA journal_mode = MEMORY')
conn
<sqlite3.Connection at 0x28c60858>
Create a table for the results of the languoids query with some additional sanity checks. Insert the query rows. Count them.
conn.execute("""
CREATE TABLE languoid (
id TEXT NOT NULL PRIMARY KEY,
label TEXT NOT NULL,
level TEXT NOT NULL,
parent TEXT,
obsolete BOOLEAN NOT NULL,
status TEXT,
iso TEXT UNIQUE,
latitude REAL,
longitude REAL,
FOREIGN KEY(parent) REFERENCES languoid(id) DEFERRABLE INITIALLY DEFERRED,
CHECK (level IN ('LanguageFamily', 'LanguageSubfamily', 'Language', 'Dialect')),
CHECK (obsolete IN (0, 1)),
CHECK (status IN ('established', 'spurious', 'spurious retired', 'unattested',
'provisional', 'retired'))
)""")
conn.executemany('INSERT INTO languoid VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)',
graph.query(LANGUOIDS))
conn.commit()
conn.execute('SELECT count(*) FROM languoid').fetchone()
(24393,)
Languoids may have n alternative labels.
Create a table for the labels and their language. Retrieve them with SPARQL. Insert the query results into the table. Count rows.
conn.execute("""
CREATE TABLE label (
id TEXT NOT NULL,
lang TEXT NOT NULL,
label TEXT NOT NULL,
PRIMARY KEY (id, lang, label),
FOREIGN KEY(id) REFERENCES languoid(id)
)""")
LABELS = """
SELECT
(substr(str(?s), 43) AS ?id) (lang(?label) AS ?lang) ?label
WHERE
{ ?s a dcterms:LinguisticSystem ; skos:altLabel ?label }"""
conn.executemany('INSERT INTO label VALUES (?, ?, ?)',
graph.query(LABELS))
conn.commit()
conn.execute('SELECT count(*) FROM label').fetchone()
(86463,)
Languoids may have n references.
Create a table for the references. Retrieve them with SPARQL. Insert the query results into the table. Count.
conn.execute("""
CREATE TABLE reference (
id TEXT NOT NULL,
reference INTEGER NOT NULL,
PRIMARY KEY (id, reference),
FOREIGN KEY(id) REFERENCES languoid(id)
)""")
REFERENCES = """
SELECT
(substr(str(?s), 43) AS ?id) (substr(str(?o), 44) AS ?reference)
WHERE
{ ?s a dcterms:LinguisticSystem ; dcterms:description ?o
FILTER (strstarts(str(?o), "http://glottolog.org/resource/reference/id/")) }"""
conn.executemany('INSERT INTO reference VALUES (?, ?)',
graph.query(REFERENCES))
conn.commit()
conn.execute('SELECT count(*) FROM reference').fetchone()
(212614,)
Display the number of languoids. Break it down by type and check the proportion of superseded entries. Most of the family entries are obsolete.
print conn.execute('SELECT count(*) FROM languoid').fetchone()
conn.execute('SELECT level, count(*), sum(obsolete) FROM languoid GROUP BY level').fetchall()
(24393,)
[(u'Dialect', 10599, 185), (u'Language', 8418, 21), (u'LanguageFamily', 1505, 1263), (u'LanguageSubfamily', 3871, 0)]
Check the distribution of status values by type. Only language entries distinguish it.
conn.execute("""SELECT level, status, count(*) AS n
FROM languoid GROUP BY level, status ORDER BY level, n DESC""").fetchall()
[(u'Dialect', None, 10599), (u'Language', u'established', 7945), (u'Language', u'spurious', 199), (u'Language', u'spurious retired', 192), (u'Language', u'unattested', 61), (u'Language', u'retired', 19), (u'Language', u'provisional', 2), (u'LanguageFamily', u'established', 1505), (u'LanguageSubfamily', u'established', 3871)]
Display the number ISO codes. Break the proportions down by languoid type. ISO 639-3 also contains macrolanguages.
print conn.execute('SELECT count(*), count(iso) FROM languoid').fetchone()
conn.execute('SELECT level, count(*), count(iso) FROM languoid GROUP BY level').fetchall()
(24393, 7822)
[(u'Dialect', 10599, 5), (u'Language', 8418, 7789), (u'LanguageFamily', 1505, 1), (u'LanguageSubfamily', 3871, 27)]
Check how many entries specify location. Only language entries do so.
conn.execute('SELECT level, count(latitude) FROM languoid GROUP BY level').fetchall()
[(u'Dialect', 0), (u'Language', 7634), (u'LanguageFamily', 0), (u'LanguageSubfamily', 1)]
Display the first and last glottocodes and ISO codes.
GLOTTOCODES = 'SELECT id FROM languoid ORDER BY id %s LIMIT 10'
print ', '.join(g for g, in conn.execute(GLOTTOCODES % 'ASC'))
print ', '.join(g for g, in conn.execute(GLOTTOCODES % 'DESC'))
ISO_CODES = 'SELECT iso FROM languoid WHERE iso NOT NULL ORDER BY iso %s LIMIT 20'
print ', '.join(i for i, in conn.execute(ISO_CODES % 'ASC'))
print ', '.join(i for i, in conn.execute(ISO_CODES % 'DESC'))
aala1237, aant1238, aari1238, aari1239, aari1240, aari1244, aasa1238, aata1238, abaa1238, abab1239 zyud1238, zyph1238, zwal1238, zuwa1238, zuti1239, zurr1238, zuri1238, zura1238, zuoj1238, zuni1245 aaa, aab, aac, aad, aae, aaf, aag, aah, aai, aak, aal, aam, aan, aao, aap, aaq, aar, aas, aat, aau zzj, zza, zyp, zyn, zyj, zyg, zyb, zwa, zuy, zun, zum, zul, zuh, zua, zty, ztx, ztu, ztt, zts, ztq
Display the number of labels. Break them down by language and entry type.
print conn.execute('SELECT count(*) FROM label').fetchone()
print conn.execute("""SELECT lang, count(*) AS n
FROM label GROUP BY lang ORDER BY n DESC LIMIT 5""").fetchall()
conn.execute("""SELECT languoid.level, count(*) AS n
FROM label JOIN languoid ON languoid.id=label.id
GROUP BY languoid.level""").fetchall()
(86463,) [(u'en', 45862), (u'x-clld', 24393), (u'fr', 899), (u'br', 671), (u'ru', 589)]
[(u'Dialect', 14827), (u'Language', 65685), (u'LanguageFamily', 1723), (u'LanguageSubfamily', 4228)]
Show the minimal, mean, and maximal number of labels per entry. Check the languoids with the most labels.
print conn.execute("""SELECT min(n), avg(n), max(n) FROM
(SELECT count(*) AS n FROM label GROUP BY id)""").fetchone()
conn.execute("""SELECT languoid.label, count(*) AS n
FROM label JOIN languoid ON languoid.id=label.id
GROUP BY label.id ORDER BY n DESC LIMIT 3""").fetchall()
(1, 3.5445824621817734, 174)
[(u'Standard French', 174), (u'Standard Spanish', 154), (u'Russian', 144)]
Show the minimal, mean, and maximal label length. Check the frequencies of the most common lengths.
print conn.execute("""SELECT min(s), avg(s), max(s) FROM
(SELECT length(label) AS s FROM label)""").fetchall()
conn.execute("""SELECT length(label) AS l, count(*) AS n
FROM label GROUP BY l HAVING n > 3200 ORDER BY l""").fetchall()
[(1, 9.443553890103281, 65)]
[(4, 3882), (5, 6174), (6, 7481), (7, 6887), (8, 29717), (9, 4054), (10, 3336)]
Display the number of references. Break them down by entry type. There are much less references for non-languages.
print conn.execute('SELECT count(*) FROM reference').fetchone()
conn.execute("""SELECT l.level, count(*) AS n
FROM reference AS r JOIN languoid AS l ON l.id=r.id GROUP BY l.level""").fetchall()
(212614,)
[(u'Dialect', 43), (u'Language', 210178), (u'LanguageFamily', 1663), (u'LanguageSubfamily', 730)]
Show the minimal, mean, and maximal number of references per entry. Check the most referenced languoids.
print conn.execute("""SELECT min(n), avg(n), max(n) FROM
(SELECT count(*) AS n FROM reference GROUP BY id)""").fetchone()
conn.execute("""SELECT l.label, count(*) AS n FROM reference AS r
JOIN languoid AS l ON l.id=r.id GROUP BY r.id ORDER BY n DESC LIMIT 3""").fetchall()
(1, 25.640858658948385, 2728)
[(u'Luxembourgish', 2728), (u'Standard French', 2160), (u'Swahili', 1840)]
The languoids table only specifies the direct parent of each entry. However, we want to be able to traverse the tree and query the whole path.
As SQLite supports hierarchical queries only with version 3.8.3+, we will use a more general approach and generate a table with all tree paths.
In other words, we will compute the transitive closure of the parent relation, a.k.a. tree closure table.
Since we won't use recursion inside the database, we will simply put together a bunch of SQL queries and feed the results back into a new table of our database.
PATH = """SELECT
i0 AS child, %(depth)d AS steps, i%(depth)d AS parent, i%(next)d IS NULL AS terminal
FROM (
SELECT %(select)s
FROM languoid AS l0
%(joins)s
) WHERE parent IS NOT NULL"""
def path_query(depth):
select = ', '.join('l%(step)d.id AS i%(step)d' % {'step': i} for i in range(depth + 2))
joins = ' '.join('LEFT JOIN languoid AS l%(next)d ON l%(step)d.parent = l%(next)d.id'
% {'step': i, 'next': i + 1} for i in range(depth + 1))
return PATH % {'depth': depth, 'next': depth + 1, 'select': select, 'joins': joins}
The path_query
function generates a query for a tree walk of the length given by depth
. Note that we will omit zero step (reflexive) walks.
print path_query(1)
SELECT i0 AS child, 1 AS steps, i1 AS parent, i2 IS NULL AS terminal FROM ( SELECT l0.id AS i0, l1.id AS i1, l2.id AS i2 FROM languoid AS l0 LEFT JOIN languoid AS l1 ON l0.parent = l1.id LEFT JOIN languoid AS l2 ON l1.parent = l2.id ) WHERE parent IS NOT NULL
Each query returns the start glottocode, number of steps, end glottocode and a boolean indicating if there is no grandparent.
print conn.execute('%s ORDER BY i0 LIMIT 3' % path_query(1)).fetchall()
print conn.execute('%s ORDER BY i0 LIMIT 3' % path_query(2)).fetchall()
[(u'aala1237', 1, u'ramo1244', 0), (u'aant1238', 1, u'nort2920', 0), (u'aari1238', 1, u'ahkk1235', 0)] [(u'aala1237', 2, u'kand1307', 0), (u'aant1238', 2, u'tair1260', 0), (u'aari1238', 2, u'sout2845', 1)]
When all paths in the query are terminal, we have arrived at the maximal depth.
conn.execute(path_query(18)).fetchall()
[(u'patw1249', 18, u'indo1319', 1), (u'yeri1239', 18, u'atla1278', 1), (u'cher1272', 18, u'atla1278', 1), (u'wile1238', 18, u'atla1278', 1), (u'biri1258', 18, u'atla1278', 1), (u'doli1238', 18, u'atla1278', 1), (u'fufu1238', 18, u'atla1278', 1), (u'bule1242', 18, u'atla1278', 1), (u'pato1243', 18, u'indo1319', 1)]
Create a table for the results. Insert path walks of increasing depth until all walks have ended. Count the walks.
conn.execute("""
CREATE TABLE tree (
child TEXT NOT NULL,
steps INTEGER NOT NULL,
parent TEXT NOT NULL,
terminal BOOLEAN NOT NULL,
PRIMARY KEY (child, steps),
UNIQUE (child, parent),
UNIQUE (parent, child),
FOREIGN KEY (child) REFERENCES languoid (id),
FOREIGN KEY (parent) REFERENCES languoid (id),
CHECK (terminal IN (0, 1))
)""")
depth = 1
while True:
rows = conn.execute(path_query(depth)).fetchall()
if not rows:
break
conn.executemany('INSERT INTO tree VALUES (?, ?, ?, ?)', rows)
depth += 1
conn.commit()
conn.execute('SELECT count(*) FROM tree').fetchone()
(145822,)
If the SQlite we use from Python is version 3.8.3 or later, we can also get the rows for the tree closure table with a single query:
WITH RECURSIVE tree(child, steps, parent, terminal) AS (
SELECT l.id, 1, l.parent, 0
FROM languoid AS l
WHERE l.parent IS NOT NULL
UNION ALL
SELECT t.child, t.steps + 1, p.parent, gp.parent IS NULL
FROM languoid AS p
JOIN tree AS t ON p.id=t.parent
LEFT JOIN languoid AS gp ON gp.id=p.parent
WHERE p.parent IS NOT NULL
)
SELECT * FROM tree```
Show the minimal, mean, and maximal number of languages per family. Display the language familes with the most child languages.
print conn.execute("""SELECT min(n), avg(n), max(n) FROM
(SELECT count(*) AS n FROM languoid AS p
JOIN tree AS w ON w.parent=p.id AND w.terminal
JOIN languoid AS c ON w.child=c.id AND c.level='Language'
WHERE p.level='LanguageFamily' GROUP BY p.id)""").fetchone()
conn.execute("""SELECT p.label, count(*) AS n FROM languoid AS p
JOIN tree AS w ON w.parent=p.id AND w.terminal
JOIN languoid AS c ON w.child=c.id AND c.level='Language'
WHERE p.level='LanguageFamily' GROUP BY p.id ORDER BY n DESC LIMIT 3""").fetchall()
(1, 33.781893004115226, 1430)
[(u'Atlantic-Congo', 1430), (u'Austronesian', 1274), (u'Indo-European', 583)]
Determine the languages with the most dialects.
conn.execute("""SELECT p.label, count(*) AS n FROM languoid AS p
JOIN tree AS w ON w.parent=p.id AND w.terminal
JOIN languoid AS c ON w.child=c.id AND c.level='Dialect'
WHERE p.level='Language' GROUP BY p.id ORDER BY n DESC LIMIT 4""").fetchall()
[(u'Gumuz', 19), (u'Basque', 11), (u'Kunama', 9), (u'Berta', 7)]
Display some of the longest paths.
for child, path in conn.execute("""SELECT c.label, (SELECT group_concat(parent, ' <- ')
FROM (SELECT g.child AS child , p.label AS parent
FROM tree AS g JOIN languoid AS p ON g.parent=p.id
WHERE child=c.id ORDER BY g.steps DESC)
GROUP BY child)
FROM languoid AS c JOIN tree AS w ON w.child=c.id AND w.terminal
ORDER BY w.steps DESC, c.id LIMIT 3"""):
print '%s <= %s\n' % (path, child)
Atlantic-Congo <- Volta-Congo <- North Volta-Congo <- Gur <- Central Gur <- Northern Central Gur <- Bwamu-Oti-Volta <- Oti-Volta <- Nuclear Oti-Volta <- Gurma-Yom-Oti-Volta Occidental <- Western Oti-Volta <- Nuclear Oti-Volta Occidental <- Northwest Oti-Volta <- Safaliba-Dagaare <- Dagaaric <- North-West Dagaric <- Birifor <- Malba Birifor <= Birifor Atlantic-Congo <- Volta-Congo <- North Volta-Congo <- Gur <- Central Gur <- Northern Central Gur <- Bwamu-Oti-Volta <- Oti-Volta <- Nuclear Oti-Volta <- Gurma-Yom-Oti-Volta Occidental <- Western Oti-Volta <- Nuclear Oti-Volta Occidental <- Northwest Oti-Volta <- Safaliba-Dagaare <- Dagaaric <- Central-South Dagaric <- South Dagaric <- Wali (Ghana) <= 'Bulengee Atlantic-Congo <- Volta-Congo <- North Volta-Congo <- Gur <- Central Gur <- Northern Central Gur <- Bwamu-Oti-Volta <- Oti-Volta <- Nuclear Oti-Volta <- Gurma-Yom-Oti-Volta Occidental <- Western Oti-Volta <- Nuclear Oti-Volta Occidental <- Northwest Oti-Volta <- Safaliba-Dagaare <- Dagaaric <- Central-South Dagaric <- South Dagaric <- Wali (Ghana) <= Cherii
Note that with SPARQL the number of steps is not available, so it might be difficult to get the path in the right order like this.
Activate inline plotting in this notebook.
%matplotlib inline
Load the language labels into a pandas
(docs) DataFrame
. Display the result.
import pandas as pd
pd.set_option('max_rows', 15)
labels = pd.read_sql_query("""SELECT label.*
FROM label JOIN languoid ON label.id=languoid.id
WHERE languoid.level='Language' ORDER BY label.id""", conn, index_col='id')
labels
lang | label | |
---|---|---|
id | ||
aari1239 | an | Luenga aari |
aari1239 | ar | لغة آري |
aari1239 | en | Aari language |
aari1239 | en | Ara |
aari1239 | en | Ari |
aari1239 | en | Ari-Galila |
aari1239 | en | Aro |
... | ... | ... |
zuoj1238 | en | Zuojiang |
zuoj1238 | x-clld | zuoj1238 |
zyph1238 | br | Zac'hringeg |
zyph1238 | en | Zophei |
zyph1238 | en | Zoptei |
zyph1238 | en | Zyphe language |
zyph1238 | x-clld | zyph1238 |
65685 rows × 2 columns
Break the number of labels down by language.
labels_lang = labels.groupby('lang').size().sort_values(ascending=False)
labels_lang[labels_lang > 400].plot.bar();
Show summary statistics on the number of labels per languoid. Plot the more common label count frequencies.
nlabels = labels.groupby(level='id').size()
nlabels_hist = nlabels.value_counts().sort_index()
print nlabels.describe()
nlabels_hist[nlabels_hist > 30].plot.area();
count 8418.000000 mean 7.802922 std 10.729925 min 1.000000 25% 2.000000 50% 5.000000 75% 9.000000 max 174.000000 dtype: float64
Do statistics on the string length of the labels.
slabel = labels['label'].str.len()
slabel_hist = slabel.value_counts().sort_index()
print slabel.describe()
slabel_hist[slabel_hist > 30].plot.area();
count 65685.000000 mean 9.834315 std 5.168938 min 1.000000 25% 6.000000 50% 8.000000 75% 13.000000 max 65.000000 Name: label, dtype: float64
Load the languages and the full paths into data frames. Join them into one data frame and show the result.
languages = pd.read_sql_query("""SELECT * FROM languoid
WHERE level='Language' AND NOT obsolete ORDER BY id""", conn, index_col='id')
tree = pd.read_sql_query('SELECT * FROM tree WHERE terminal', conn, index_col='child')
langs = languages.join(tree, how='left', rsuffix='_tree')
langs
label | level | parent | obsolete | status | iso | latitude | longitude | steps | parent_tree | terminal | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
aari1239 | Aari | Language | aari1238 | 0 | established | aiw | 5.95034 | 36.5721 | 3 | sout2845 | 1 |
aari1240 | Aariya | Language | book1242 | 0 | spurious | aay | NaN | NaN | 1 | book1242 | 1 |
aari1244 | Aari | Language | book1242 | 0 | spurious retired | aiz | NaN | NaN | 1 | book1242 | 1 |
aasa1238 | Aasax | Language | uncl1457 | 0 | established | aas | -4.00679 | 36.8648 | 4 | afro1255 | 1 |
abad1241 | Abadi | Language | west2850 | 0 | established | kbt | -9.03389 | 146.9920 | 11 | aust1307 | 1 |
abag1245 | Abaga | Language | kama1374 | 0 | established | abg | -6.12028 | 145.6650 | 6 | nucl1709 | 1 |
abai1240 | Abai Sungai | Language | pait1248 | 0 | established | abf | 5.55394 | 118.3060 | 7 | aust1307 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
zulg1242 | Zulgo-Gemzek | Language | meri1245 | 0 | established | gnd | 10.82700 | 14.0578 | 7 | afro1255 | 1 |
zulu1248 | Zulu | Language | zulu1251 | 0 | established | zul | -25.33050 | 31.3512 | 12 | atla1278 | 1 |
zuma1239 | Zumaya | Language | masa1324 | 0 | established | zuy | 10.55800 | 14.4445 | 5 | afro1255 | 1 |
zumb1240 | Zumbun | Language | west2712 | 0 | established | jmb | 10.82700 | 9.9683 | 5 | afro1255 | 1 |
zuni1245 | Zuni | Language | None | 0 | established | zun | 35.00560 | -108.7820 | NaN | NaN | NaN |
zuoj1238 | Zuojiang Zhuang | Language | nort3180 | 0 | established | zzj | 21.83750 | 107.3620 | 5 | taik1256 | 1 |
zyph1238 | Zyphe | Language | nucl1757 | 0 | established | zyp | 22.52400 | 93.2640 | 5 | sino1245 | 1 |
8397 rows × 11 columns
Analyze the number of languages per top-level family.
famsizes = langs.groupby('parent_tree').size().sort_values(ascending=False)
print famsizes.describe()
famsizes[famsizes > 100].plot.bar();
count 243.000000 mean 33.781893 std 137.796044 min 1.000000 25% 2.000000 50% 5.000000 75% 12.000000 max 1430.000000 dtype: float64
Analyze the number of steps from languages to their top-level family.
langs['steps'] = langs['steps'].fillna(0)
print langs['steps'].describe()
langs['steps'].value_counts().sort_index().plot.area();
count 8397.000000 mean 5.691080 std 3.476092 min 0.000000 25% 3.000000 50% 5.000000 75% 8.000000 max 17.000000 Name: steps, dtype: float64
Inspect the geographical distribution of languages.
langs['latitude'].hist(bins=100);
langs['longitude'].hist(bins=100);
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.axis([-180, 180, -90, 90])
plt.xticks(range(-180, 181, 60))
plt.yticks(range(-90, 91, 30))
plt.scatter(langs['longitude'], langs['latitude'], 1);