Now that we have a SQLite database with indices, page titles, and coordinate strings, let's make a database where we extract all the metadata out of those coordinate strings so it's queryable.
This should be run after the other notebook that extracts the coordinate strings.
import csv
import json
from wikiparse import indexer, syntax_parser as sp
import time
import os
import sqlite3
import random
dumps = indexer.load_dumps(build_index=False, scratch_folder='py3')
english = dumps['en']
opening E:/enwiki-20190101-pages-articles-multistream.xml/scratch\py3\index.db current mapping 19.1 m pages __init__ complete
# english.db.close()
c = english.cursor
Before we create the database let's get a complete list of the entries we're going to want. That is, let's look at all the coordinate strings we've extracted from each page and extract the list of keywords from there.
result = c.execute('''SELECT page_num,coords,title FROM indices WHERE coords != ""
''').fetchall()
coordStrings = {item[0]:item[1] for item in result}
idx_to_title = {item[0]:item[2] for item in result}
list(coordStrings.items())[:10]
[(66, 'Coord|32.7|-86.7|type:adm1st_region:US_dim:1000000_source:USGS|display=title'), (86, 'Coord|36|42|N|3|13|E|type:city'), (105, 'coord|42|30|N|1|31|E|display=inline,title'), (114, 'Coord|64|N|150|W|region:US-AK_type:adm1st_scale:10000000|display=title|notes=<ref>{{Cite gnis|1785533|State of Alaska'), (139, 'Coord|13|19|N|169|9|W|type:event|name=Apollo 11 splashdown||Coord|10|36|N|172|24|E|display=inline||Coord|13|19|N|169|9|W|display=inline'), (140, 'Coord|8|8|N|165|1|W|type:event|name=Apollo 8 landing||Coord|30|12|N|74|7|W|name=Apollo 8 S-IC impact||Coord|31|50|N|37|17|W|name=Apollo 8 S-II impact||Coord|8|8|N|165|1|W|name=Apollo 8 estimated splashdown'), (161, 'Coord|12|30|40|N|69|58|27|W|type:isle|display=title||Coord|12|31|07|N|70|02|09|W||Coord|12|31|01|N|70|02|04|W|'), (166, 'coord|0|N|25|W|region:ZZ_type:waterbody|display=inline,title'), (168, 'Coord|12|30|S|18|30|E|display=title||Coord|8|50|S|13|20|E|type:city'), (177, 'Coord|55|N|115|W|type:adm1st_scale:10000000_region:CA-AB|display=title')]
len(coordStrings), type(coordStrings)
(1152376, dict)
If a page has more than one coordinate string, choose the one that's displayed at the top (display=title
) or the first.
for page_num in coordStrings:
if '||' in coordStrings[page_num]:
pageCoordStrings = coordStrings[page_num].split('||')
coordStrings[page_num] = pageCoordStrings[0]
for s in pageCoordStrings:
if "display=title" in s:
coordStrings[page_num] = s
list(coordStrings.items())[:10]
[(66, 'Coord|32.7|-86.7|type:adm1st_region:US_dim:1000000_source:USGS|display=title'), (86, 'Coord|36|42|N|3|13|E|type:city'), (105, 'coord|42|30|N|1|31|E|display=inline,title'), (114, 'Coord|64|N|150|W|region:US-AK_type:adm1st_scale:10000000|display=title|notes=<ref>{{Cite gnis|1785533|State of Alaska'), (139, 'Coord|13|19|N|169|9|W|type:event|name=Apollo 11 splashdown'), (140, 'Coord|8|8|N|165|1|W|type:event|name=Apollo 8 landing'), (161, 'Coord|12|30|40|N|69|58|27|W|type:isle|display=title'), (166, 'coord|0|N|25|W|region:ZZ_type:waterbody|display=inline,title'), (168, 'Coord|12|30|S|18|30|E|display=title'), (177, 'Coord|55|N|115|W|type:adm1st_scale:10000000_region:CA-AB|display=title')]
For some Coord templates, there's a note
(https://en.wikipedia.org/wiki/Template:Coord#Examples) which contains more pipes that will cut off the rest of the template. Since that tag seems to come after the other important tags, let's
ignore this problem.
def getKeywords(coordString, verbose=False):
if verbose:
print(coordString)
keywords = {}
rest = []
items = coordString.split('|')
for item in items:
if '=' in item:
keywords[item.split('=')[0]] = item.split('=')[1]
elif ':' in item:
keywords[item.split(':')[0]] = item.split(':')[1]
else:
rest.append(item)
# return keywords, '|'.join(rest)
return keywords
print(getKeywords('Coord|8|8|N|165|1|W|type:event|name=Apollo 8 landing'))
print(getKeywords('Coord|64|N|150|W|region:US-AK_type:adm1st_scale:10000000|display=title|notes=<ref>{{Cite gnis|1785533|State of Alaska'))
print(getKeywords('Coord|36|42|N|3|13|E|type:city'))
{'type': 'event', 'name': 'Apollo 8 landing'} {'region': 'US-AK_type', 'display': 'title', 'notes': '<ref>{{Cite gnis'} {'type': 'city'}
We only care about a subset of keywords, so let's make a whitelist.
from collections import defaultdict
keywords = defaultdict(int)
for coordString in coordStrings.values():
for kw in getKeywords(coordString).keys():
keywords[kw.strip()] += 1
keywords
defaultdict(int, {'type': 287434, 'display': 1099820, 'region': 432293, 'notes': 1748, 'name': 20998, 'dim': 4491, 'format': 133330, 'source': 67662, 'globe': 2834, 'scale': 8045, 'id': 217, 'accessdate': 305, 'entrydate': 42, 'title': 270, 'journal': 9, 'volume': 9, 'number': 1, 'pages': 12, 'author1': 1, 'date': 62, 'doi': 8, 'bibcode': 2, 'display-authors': 1, 'label': 1, '3': 143, 'url': 151, 'DK_type': 1, 'trans-title': 2, 'language': 7, 'publisher': 99, 'type_landmark_region': 2, 'region:US': 1, '2': 5, "The ''globe'' [[File": 25, 'upright': 25, '': 5, 'first': 13, 'last': 19, 'work': 70, 'deadurl': 6, 'archiveurl': 5, 'archivedate': 5, 'df': 4, '234503_622630_region': 1, 'הערה': 1, 'landmark_region': 4, 'USGS': 3, '4': 8, '6': 3, 'Register of Historic Parks and Gardens]].<ref name': 1, 'num': 1, 'desc': 1, 'access-date': 29, 'mode': 1, 'year': 14, 'page': 9, '44region': 1, '7': 3, 'length_km': 2, '<br>Crow Lane Roundabout': 1, 'nosave': 8, 'website': 12, 'source:https://archnet.org/print/preview/sites': 1, 'For example': 1, '5': 7, 'elevation': 4, 'GNS]] coordinates adjusted using [[Google Maps]] and [http': 42, 'type;landmark_region': 2, 'author': 16, 'location': 9, 'GNS]] coordinates adjusted using [[Google Maps]], and [http': 23, '600700_212618_region': 1, '{{#expr': 457, '<!-- 10 Guittard Rd.<ref name': 1, 'reason': 3, 'adm2nd_region': 3, 'area_km2': 2, 'GE_scale': 1, '526917_188241_region': 1, '405800_439900_region': 1, 'Type': 6, 'authorlink': 3, 'isbn': 3, '{{#invoke': 2, '[[MediaWiki': 1, 'egion': 1, 'last1': 9, 'first1': 9, 'archive-url': 1, 'dead-url': 1, 'archive-date': 1, '_region': 8, '9': 4, 'http': 5, 'GNS]] coordinates adjusted using an Instituto Geográfico Militar [http': 7, 'typetype': 1, 'city': 6, 'ref': 1, 'FR_type': 2, '_source': 13, 'region_BD_type': 1, 'p': 55, 'isle_region': 1, 'GBtype': 1, 'first2': 7, 'last2': 7, 'first3': 7, 'last3': 7, 'issue': 6, 'NZ244651_region': 2, '{{#statements': 1, 'tpye': 1, 'coauthors': 1, "coordinates map template]] and one of the options for either degree, minutes and seconds or decimal latitude and longitude.\n\n*There has been a '''[[Module": 1, 'Simpson]] and [[Template': 1, 'Location map options and instructions]] for an example of its use.\n\n\n6. Update census demographic information to the [http://www12.statcan.ca/english/census06/data/profiles/community/Index.cfm?Lang': 1, 'community]] pages and remove <nowiki>{{</nowiki>Template': 1, '_type': 5, 'type_city_region': 1, 'regionCA-ON_type': 1, 'adm3rd_region': 1, 'issn': 5, 'display{{': 2, "Coord]] - a cool template that displays the coordinates of the county. Notes about this one:\n\n''View [[Template:Infobox U.S. county/Sample]] to see how this infobox is rendered.''": 1, 'Region': 4, 'editprotected request]] is fulfilled. —[[User': 1, 'Para]] 15': 1, 'Michael de la Pole]], seated at Wingfield Castle, who in 1385 was created [[Earl of Suffolk]]. His descendant [[Edmund de la Pole, 3rd Duke of Suffolk]] (1472–1513) was forced to surrender his dukedom in 1493. It was resurrected by King Henry VIII in 1514 for his [[favourite]] [[Charles Brandon, 1st Duke of Suffolk]] (1484–1545), who although he had no close connection with Wingfield Castle and the county of Suffolk, was a great-grandson of Sir [[Robert Wingfield]] (died 1454), of [[Letheringham]] in [[Suffolk]], about 12 miles south of Wingfield.': 1, 'heading': 5, 'region_US-CA_type': 1, 'Coord]], [[Template': 1, 'Coor at dms]], etc. To simplify things, I switch the URL to the one using [[Template': 1, 'Erik Baas]] 15': 1, 'Bryan]] (<small>[[User talk': 1, '[[': 1, 'commons]]</small>) 18': 1, 'Dispenser]] 19': 1, 'Cush]] 10:22, 28 October 2007 (UTC)': 1, 'Rtphokie]] 22': 1, '<font color': 2, 'Gene Nygaard]] 23': 1, 'Anomie]] 01': 1, 'Rtphokie]] 12:32, 28 October 2007 (UTC)': 1, 'NOTE_TEXT': 2, 'NOTE_CAT': 2, 'contribs]]; 1/1) [[': 1, 'Monitored link]] - www.fourinfo.com/newsletter/feb05.htm#greetings%20from%20rural%20ramble - rule': 1, '.com - reason: Automonitor: reported to [[:fr:MediaWiki:Spam-blacklist]] ([http://fr.wikipedia.org/w/index.php?title': 1, 'Rural Ramble]] - [http://en.wikipedia.org/w/index.php?title': 1, 'zoom': 4, 'minority]] when trying to oppose the removal of another more organised linkspamming. I cannot fathom how people can even think that promoting some specific external service over all its alternatives would be following [[WP': 1, 'Para]] ([[User talk': 1, 'talk]]) 12': 2, 'Orderinchaos]] 18': 1, '<b><font color': 1, 'resolution]]; please use them. I am marking this resolved since there is nothing that any admin can or will do. [[User': 1, 'LessHeard vanU]] ([[User talk': 1, 'talk]]) 13': 2, 'Wikidemo]] ([[User talk': 1, 'talk]]) 16': 1, 'talk]]) 17': 2, 'consensus editing]]. Instead, if Brownhaired Girl felt this was highly inappropriate, that is why we have [[WP': 1, 'dispute resolution]]. An RfC, Mediation, etc would have handled it quite properly.[[User': 1, 'Wjhonson]] ([[User talk': 1, "talk]]) 00:05, 3 February 2008 (UTC)\n\n*(resolved label removed - an important point has yet to be made). I've reviewed the [http://en.wikipedia.org/w/index.php?title": 1, 'Carcharoth]] ([[User talk': 1, 'talk]]) 01': 1, 'here]] for a discussion on how best to phrase all this in the documentation. I\'ve left various other notes around, so hopefully people will realise the sense in using custom edit summaries for non-vandalism reverts (regardless of what tool is used to do the reverts, be it rollback, undo, or some combination using scripts). Ideally, people on a dedicated run of rollbacking vandalism (eg. RC patrolling) would also have a custom edit summary enabled (eg. "rollback of vandalism reviewed during recent changes patrolling"). People spotting the ocassional instances of vandalism during normal editing will still find it quicker to hit rollback, and I am aware (before someone says this) that there is no requirement to use edit summaries at all, but it makes a great deal of sense to use informative edit summaries where possible. [[User': 1, 'talk]]) 01:43, 3 February 2008 (UTC)': 1, '{{subst': 1, 'lead coordinator resigned]], the project [[Wikipedia talk': 1, 'modified its A-class review process]] to include a source review, [[Wikipedia talk': 1, "altered its internal guidelines]] for drafting biographies, and questions about the German war effort case are being asked by a couple of our members to the candidates for this year's ArbCom election. In the past 24 hours, it [[Wikipedia talk": 1, 'has resurfaced]] [[Wikipedia talk': 1, 'in a dispute]] about an award the project bestows annually. Might be worth a special report. I for one know it will effect how I will be voting for Arbcom this year. -[[User': 1, 'Indy beetle]] ([[User talk': 1, 'talk]]) 02:46, 3 December 2018 (UTC)': 1, 'unsigned]] comment added by [[User': 1, '108.24.2.34]] ([[User talk': 1, 'talk]] • [[Special': 1, 'contribs]]) 23': 1, 'last4': 4, 'first4': 4, 'coordinate template]], as well as related templates listed in documentation section. This template needs to be placed in the article before it can appear on google maps. [[User': 1, 'Someguy1221]] ([[User talk': 1, 'talk]]) 22:22, 2 June 2008 (UTC)': 1, 'Deproduction]] ([[User talk': 1, '<font STYLE': 1, 'Talk]] or [[Special': 1, 'C]]</sup></b> 19': 1, '70.59.16.211]] ([[User talk': 1, 'talk]]) 16:36, 3 June 2008 (UTC)': 1, 'Mark J Richards]] ([[User talk': 1, 'talk]]) 09': 3, "'''<font color": 1, '<small><sub><font color': 1, '<small><sup><font color': 1, 'Aozukum]] ([[User talk': 1, 'this guide]]. It has a lot of tips for getting started. You may also want to look at this [[WP': 1, 'notability guideline]] for biographies. I have also left a message on [[User talk': 1, 'your talk page]] with some useful links. Good luck and feel free to ask questions! <b><font color': 1, 'TN]]</font>‑<font color': 1, 'X]]</font>-<font color': 1, 'Man]]</font></b> 13:22, 3 June 2008 (UTC)': 1, 'Gaurdians1]] ([[User talk': 1, 'talk]]) 20': 1, "internal links]] created by putting double brackets - [[ ]] - around the name of the article you would like to link to. For example, if I want to link to an article about mice, I would type <nowiki>[[Mice]]</nowiki>, and it would show up as [[Mice]]. If you try to link to an article that doesn't exist, it will show up as a [[Wikipedia": 1, 'red link]], [[like this one]]. -- [[User': 1, 'Nataly<font color': 1, 'reference desks]]. [[User': 1, 'ukexpat]] ([[User talk': 1, 'talk]]) 13:08, 4 June 2008 (UTC)': 1, 'Sibyllenoras]] ([[User talk': 1, "to this article]]? It looks like it's available. If you have any other questions, let me know. Cheers! <b><font color": 1, 'Man]]</font></b> 13': 1, 'Confusing Manifestation]]<small>([[User talk': 1, 'Say hi!]])</small> 23:24, 4 June 2008 (UTC)': 1, '202.67.4.3]] ([[User talk': 1, 'talk]]) 06:44, 4 June 2008 (UTC)': 1, 'Coordinator]]\n* Advertisements: <span id': 1, 'Questions]] <span id': 1, 'Style (articles)]] <span id': 1, 'Quality of articles]]\n**Assessment department': 1, 'Questions]]) <span id': 1, 'Templates]] (in general) and specific topics as well (articles, talk pages, maintenance, etc.)\n* Barnstars': 1, 'Awards]]\n* Biographies: <span id': 1, 'Series boxes]]) \n***[[Wikipedia': 1, 'Resources]]) <span id': 1, 'User scripts]]) <span id': 1, 'numbers': 1, 'sections': 1, 'quote': 4, '\u200bdisplay': 2, 'landmark_scale': 1, 'line': 3, 'other': 1, 'platform': 1, 'tracks': 1, 'parking': 1, 'bicycle': 1, 'passengers': 1, 'pass_year': 1, 'pass_percent': 1, 'opened': 1, 'closed': 1, 'rebuilt': 1, 'ADA': 1, 'code': 1, 'owned': 1, 'zone': 1, 'other_services_header': 1, 'other_services': 1, 'locale': 2, 'Cross-reference 1909 Ordnance Survey map<ref name': 1, 'type:event_region': 1, "'' See here for further help'']]\n* For administrative districts, use the location of the head office, or if that cannot be found, the approximate centre of the district.": 1, 'overly precise]], but do ensure that the single point that the chosen pair of coordinates represents does fall on or within the feature.': 1, 'x,type': 1, 'improvedby': 2, 'landmaark_region': 1, 'mountain_region': 2, 'NS': 4, 'EW': 4, 'deg': 1, 'min': 1, 'sec': 1, 'hem': 1, 'rnd': 1, "the few pages that don't already have it]] and on which it belongs. --[[User talk": 1, 'NE2]] 18': 1, 'Thomas Paine1776]] ([[User talk': 1, 'talk]]) 00': 1, 'nogoogle': 1, 'noosm': 1, 'Geographic coordinates]] and [[Template': 1, "biographical data]]. The framework is already there, but the problem with those is that there's no way to get the data efficiently. <span style": 1, 'Mr.]][[User talk': 2, "'''''Z-'''man'']]</span> 18": 1, 'type;Landmark_region': 1, 'GNS]] coordinates adjusted using a Ministério dos Transportes [http': 2, 'width': 1, 'Year\n! style': 1, 'Winner\n! style': 1, '}': 1, 'GNS]] coordinates adjusted using [[Google Maps]], [http': 3, 'bot': 1, 'fix-attempted': 1, 'services': 1, 'membership': 1, 'type:mountain_elevation:437_region': 1, 'region:US},}<ref>Topographical Map of Mahaska County, Iowa, [http://digital.lib.uiowa.edu/u?/atlases,48 Huebinger\'s Atlas of the State of Iowa], Iowa Publishing Co., 1904</ref> and it quickly developed as one of the most prosperous and largest coal camps in Iowa.<ref>"Iowa\'s Pioneer Coal Operators," [https://books.google.com/books?id': 1, "passenger car]]. It left [[Staunton, Virginia]] on May 12 and traveled via Chicago and [[Marshalltown, Iowa]], arriving in Muchakinock on May 15. Rail fare from Virginia to Iowa was $12, which the company paid and took as an advance against each miner's monthly wages.<ref name": 1, 'adj': 1, 'pmc': 1, 'pmid': 1, 'E\n[[Template': 1, 'lat_deg': 1, 'lat_min': 1, 'lon_deg': 1, 'lon_min': 1, 'type:landmark_region': 1, 'coord]] [[Wikipedia': 1, 'geotagging links]] already provide this kind of information is a more systematic and useful way, which also [[Template': 1, 'does not privilege]] one company over others.': 1, 'HaeB]] ([[User talk': 1, 'talk]]) 21': 1, 'WhatamIdoing]] ([[User talk': 1, 'talk]]) 22': 1, 'Richard McCoy]] ([[User talk': 1, 'talk]]) 12:41, 25 February 2010 (UTC)': 1, 'External links]]" section with a handful of links (four? five?) to the official websites of the most prominent nontrinitarian groups, or directly to the groups\' own explanation of the matter.<br />--[[User': 1, 'AuthorityTam]] ([[User talk': 1, 'talk]]) 23': 3, 'talk]]) 00:26, 27 February 2010 (UTC)': 1, '{{safesubst': 3, '{{<includeonly>safesubst': 1, 'diff': 1, 'box': 1, 'added': 1, 'changed': 1, 'deleted': 1, 'urls': 1, 'isIP': 1, 'function': 1, 'whitelisted': 1, 'timestamp': 1, 'region_IN_type': 1, "locations]], etc, and we currently don't expect perfection from new editors. Article development is iterative": 1, '<b style': 1, 'Rd232]] <sup>[[user talk': 1, 'talk]]</sup> 19': 1, 'Hans]] [[User talk': 1, 'Adler]] 14': 1, '<joke>(also; we were first': 1, 'NOTE_4_IMAGE': 1, 'NOTE_4_FORMAT': 1, 'NOTE_4_CAT': 1, 'alt': 2, 'here]]. <span style': 1, "'''''Z-'''man'']]</span> 05:55, 26 April 2011 (UTC)": 1, 'Mahitgar]] ([[User talk': 1, 'talk]]) 15': 3, 'Genicoord]]<br/>\n[[User': 1, 'Genidealingwithalbumcovers]]<br/>\n[[User': 1, 'Genidealingwithfairuse]]<br/>\n[[User': 1, 'Geniice]]<br/>\n[[User': 1, 'Genisock]]<br/>\n[[User': 1, 'Genisock2]]<br/>\n[[User': 1, 'Genisockrating]]<br/>\n[[User': 1, 'Genisocky]]<br/>\n[[User': 1, "It's Character Forming]]<br/>\n[[User": 1, 'Liveware problem]]<br/>\n[[User': 1, 'UK voteing account]]<br/>\n[[User': 1, '[[Wikipedia': 3, 'December 2004]]<br/>\n<s>[[Wikipedia': 1, 'March 2008]]</s><br/>\n<s>[[Wikipedia': 1, 'August 2008]]</s><br/>\n[[Wikipedia': 1, 'sourtype': 1, 'country': 2, 'operator': 2, 'cause': 2, 'trains': 2, 'pax': 2, 'deaths': 2, 'region_': 1, 'oclc': 1, '1': 2, '8': 2, 'Coordinates Template]]).': 1, 'Wikipedia': 1, 'label 1': 1, '.': 1, 'Template': 1, 'RU_type': 1, 'It is proposed]] that they will each serve as the primary lead coordinator for a four month period, with AustralianRupert taking the first four months followed by Nick and Dank. The other coordinators for the year commencing 1 October are': 1, 'Anotherclown]], [[User': 1, 'Cplakidas]], [[User': 1, 'Grandiose]], [[User': 1, 'Hawkeye7]], [[User': 1, 'HJ Mitchell]], [[User': 1, 'Ian Rose]], [[User': 1, 'MarcusBritish]], [[User': 1, 'Nikkimaria]], [[User': 1, 'The ed17]] and [[User': 1, 'TomStar81]]. In addition, [[User': 1, 'Arius1998]], [[User': 1, 'Johnsc12]], [[User': 1, 'Knight of Gloucestershire]] and [[User': 1, 'RoslynSKP]] also stood for election, but were not successful. The next coordinator election will be held in September 2013. \n\nComments on the election are very welcome, and should be made [[Wikipedia talk': 1, 'here]].\n\nYour editors, [[User': 1, 'Nick-D]] ([[User talk': 1, 'talk]]) and [[User': 1, 'Ed]] <sup>[[User talk': 1, '[talk]]] [[WP': 1, "N''? I haven't added that but I can add it this weekend. Or Friday, hopefully. I'll update the function overview at that time. --[[User": 1, 'Alan]]<sup>[[User talk': 1, '(T)]][[Special': 1, '(E)]]</sup> 03:34, 4 April 2013 (UTC)\n:::Yes; thank you. <span class': 1, 'Andy Mabbett]]</span> (<span class': 2, 'Talk to Andy]]; [[Special': 2, "Andy's edits]]</span> 09": 1, 'type{landmark_region': 1, 'typerailway_station_region': 7, '" does not show up in the included templates? --[[User': 1, 'ThurnerRupert]] ([[User talk': 1, "talk]]) 10:31, 25 August 2013 (UTC)\n:::Sorry, I'm not sure what you mean. <span class": 1, "Andy's edits]]</span> 20": 1, 'talk]]) 01:54, 26 August 2013 (UTC)': 1, 'talk]]) 07': 1, '<span style': 2, '♪ talk ♪]]</sup> 09': 1, 'talk]]) 10': 1, 'table]] to the article instead. If you look at [[Special': 1, 'region:CA-ON_source:<ref name': 1, '<!-- location of Bangui -->display': 1, '{{#property': 5, 'Wtype': 1, 'ype': 3, 'region_US-WA_type': 1, 'island': 1, 'regio': 2, 'Genicoord]], [[User': 1, 'YetanotherGenisock]], [[User': 1, 'Genidealingwithfairuse]], [[User': 1, 'Liveware problem]], [[User': 1, "It's Character Forming]], [[User": 1, 'UK voteing account]], [[User': 1, 'Geniice]]<br/><small>several punctuation accounts to push abusive names off the first page of [[Special': 1, 'December 2004 (A)]]<br/>\n<s>[[Wikipedia': 1, 'September 2006 (B)]]</s><br/>\n<s>[[Wikipedia': 1, 'March 2008 (A)]]</s><br/>\n<s>[[Wikipedia': 1, 'August 2008 (A)]]</s><br/>\n[[Wikipedia': 1, 'region:US-WV_scale:10000_source:placeopedia:display': 1, 'region-iso': 1, 'conference': 1, 'conference-url': 1, 'reg': 4, 'tye': 1, 'range_coordinates': 1, 'nopp': 1, 'last5': 3, 'first5': 3, 'last6': 3, 'first6': 3, 'last7': 3, 'first7': 3, 'scales': 1, 'soutype': 1, 'Avadi': 1, 'E<!--- FIXME': 1, 'region-ISO': 2, 'coordinates link]] (look for a globe icon at the top right corner of an article). Here is the link produced for the article [[United States]]: [https://tools.wmflabs.org/geohack/geohack.php?pagename': 1, 'Finnusertop]]</span> ([[User talk': 1, 'talk]] ⋅ [[Special': 1, 'contribs]]) 06:51, 10 March 2016 (UTC)': 1, 'Famous and Other Notable Maltipoos]]" got added. Now if the dogs listed there had been in something like [[Old Yeller]] or [[Where the Red Fern Grows]] I\'d think they deserved to be included, because they\'d be notable. But they\'re just dogs that had Facebook and YouTube pages made for them by their owners, and the info is sourced off those social media pages. I removed this stuff once already and an IP added it back with the comment, "this editor’s [me] work reflects their bias". Should I remove it again? [[User': 1, 'from': 4, "'''''template parameters'''''}}'''\n\n*'''<code>wikidata page name</code>''' – unnamed positional parameter is the Wikidata page name (starting with Q). Default is the current page.\n*'''<code>coordinate parameters</code>''' – passed through to Template": 1, 'Usage]] at Template': 1, 'W<!--- FIXME': 1, 'fetch': 2, 'disply': 1, 'city_region': 1, 'vn': 2, 'on 1,000,000+ other pages]].)\n* Editors wanting to \'\'remove\'\' the coordinates appear not to have cited any Wikipedia policies at all that would exclude the coordinates.\nEven the closing statement says that there is a policy that would have the coordinates included in the article (even when an external organization wants them removed) but does not mention any policy that would exclude them.\n\nThe result of "no consensus" is not appropriate because it applies equal weighting to opinions that have no basis in policy, whereas those opinions should have been discarded; only those opinions based on policy should have been considered. I submit that result of the RFC should have been to \'\'include\'\' the coordinates because there are several policies and guidelines that say we should include them and explicitly say that we will not remove them at the request of an external organization. There are no policies that would exclude the coordinates from the article. [[User': 2, 'Mitch Ames]] ([[User talk': 2, 'talk]]) 08': 2, 'Only in death does duty end]] ([[User talk': 2, 'Mitch Ames]] that policy reasons were presented, and that they are aware of it. It is their choice to come here and state otherwise. Diffs: [//en.wikipedia.org/w/index.php?title': 2, 'Nabla]] ([[User talk': 2, 'talk]]) 09:52, 6 December 2016 (UTC)\n::I\'m not asserting that the "removers" did not present any policy, I\'m asserting that the removers did not present any policy \'\'that would exclude the coordinates from the article\'\'. [https://en.wikipedia.org/w/index.php?title': 2, 'Hobit]] ([[User talk': 2, 'talk]]) 14': 2, 'Nyttend]] ([[User talk': 2, 'Mr]][[user talk': 2, 'X]] 16': 2, 'type_event_region': 1, 'NRHPcoord]]" or "[[template': 1, 'Nrhpcoord]]" or "[[template': 1, 'nrhpcoord]]" or "[[template': 1, 'NRHPCOORD]]" or "[[template': 1, 'source1': 1, 'source2': 1, 'userdate': 1, 'user': 1, 'user-visited-site': 1, 'usernote': 1, 'type:landmark_region:FR_pop: _elevation:295_dim:1000_mapsize': 1, 'vtype': 1, 'pp': 1, 'region_IN': 5, 'qid': 2, 'type_river_region': 1, '{{#titleparts': 1, 'verb': 1, 'float': 1, 'nocat': 1, 'E are invalid. [[User': 1, 'Graeme Bartlett]] ([[User talk': 1, 'talk]]) 11:58, 26 January 2018 (UTC)': 1, 'soure': 1, 'regior': 2, 'soucre': 1, 'version': 2, 'Display': 1, 'diplay': 1, 'link': 1, 'dms': 1, 'disp': 1})
{kw for kw,count in keywords.items() if count > 10}
{'3', 'GNS]] coordinates adjusted using [[Google Maps]] and [http', 'GNS]] coordinates adjusted using [[Google Maps]], and [http', "The ''globe'' [[File", '_source', 'access-date', 'accessdate', 'author', 'date', 'dim', 'display', 'entrydate', 'first', 'format', 'globe', 'id', 'last', 'name', 'notes', 'p', 'pages', 'publisher', 'region', 'scale', 'source', 'title', 'type', 'upright', 'url', 'website', 'work', 'year', '{{#expr'}
Most of these look good except for a few that are duplicates or a weird artifact from the imperfect string processing.
entriesList = [
'accessdate',
'date',
'dim',
'display',
'elevation',
'format',
'globe',
'id',
'name',
'nosave',
'notes',
'publisher',
'reason',
'region',
'scale',
'source',
'title',
'type',
'upright',
'url',
'work']
for e in entriesList:
print(e, 'TEXT,')
accessdate TEXT, date TEXT, dim TEXT, display TEXT, elevation TEXT, format TEXT, globe TEXT, id TEXT, name TEXT, nosave TEXT, notes TEXT, publisher TEXT, reason TEXT, region TEXT, scale TEXT, source TEXT, title TEXT, type TEXT, upright TEXT, url TEXT, work TEXT,
It would be nice to see examples of values for the keywords so we can see how they're used.
from numpy.random import shuffle
def findKeywordExample(kw, count=1, debug=False):
kws = []
coordStringsList = list(coordStrings.values())
shuffle(coordStringsList)
for cs in coordStringsList:
if f'|{kw}=' in cs:
kws.append(cs.split(f'|{kw}=')[1].split('|')[0])
if debug:
print(cs)
elif f'|{kw}:' in cs:
kws.append(cs.split(f'|{kw}:')[1].split('|')[0])
if debug:
print(cs)
if len(kws) >= count:
break
return kws
findKeywordExample('name', 10, True)
coord|42.440556|-98.148083|type:landmark|name=Ashfall Fossil Beds coord|42|37|37|N|85|1|28|W|name=Vermontville Opera House|region:US_type:landmark coord|76|5|30|N|109|41|59|W|region:CA-NU_type:waterbody_scale:500000|display=inline, title|name=Eldridge Bay coord|-34.42928|172.681582|format=dms|name=Start of SH 1N|type:landmark_region:NZ|display=inline coord|60.872|-69.323|display=inline,title|name=Eider Islands coord|55.3057|N|117.8961|W|name=Amerada Crown GF23-11|display=inline,title coord|75|15|N|105|00|W|region:CA-NU_type:waterbody |display=inline,title|name=Byam Channel coord|53.59526|-2.50612|type:landmark|name=Blundell Arms Public House coord|48|07|49.22|N|07|21|22.32|E|type:city|name=Bennwihr Gare Coord|32.1276|N|110.4317|W|type:mountain_region:US|name=Forest Hill (peak)
['Ashfall Fossil Beds', 'Vermontville Opera House', 'Eldridge Bay', 'Start of SH 1N', 'Eider Islands', 'Amerada Crown GF23-11', 'Byam Channel', 'Blundell Arms Public House', 'Bennwihr Gare', 'Forest Hill (peak)']
Now we can create all the columns en masse.
c.execute('''CREATE TABLE coords
(coords TEXT,
lat REAL DEFAULT 0,
lon REAL DEFAULT 0,
page_num INTEGER PRIMARY KEY,
accessdate TEXT DEFAULT '',
date TEXT DEFAULT '',
dim TEXT DEFAULT '',
display TEXT DEFAULT '',
elevation TEXT DEFAULT '',
format TEXT DEFAULT '',
globe TEXT DEFAULT '',
id TEXT DEFAULT '',
name TEXT DEFAULT '',
nosave TEXT DEFAULT '',
notes TEXT DEFAULT '',
publisher TEXT DEFAULT '',
reason TEXT DEFAULT '',
region TEXT DEFAULT '',
scale TEXT DEFAULT '',
source TEXT DEFAULT '',
title TEXT DEFAULT '',
type TEXT DEFAULT '',
upright TEXT DEFAULT '',
url TEXT DEFAULT '',
work TEXT DEFAULT '')
''')
<sqlite3.Cursor at 0x296b03e0810>
# c.execute('''DROP TABLE coords''')
def extract_lat_lon(coord_string):
split = coord_string.split('|')
coord_list = []
for s in split:
if ':' in s or '=' in s:
break
if 'Coord' not in s and 'LAT' not in s and 'LONG' not in s and 'coord' not in s:
coord_list.append(s)
return coord_list
begin = random.randint(0, len(coordStrings)-10)
for i in range(begin, begin+10):
print(extract_lat_lon(list(coordStrings.values())[i]))
['41.4347', 'N', '25.1328', 'E'] ['48.860', '2.327'] ['50.82', '-1.45'] ['49.49', '0.10'] ['49', '26', '02', 'N', '0', '12', '24', 'E'] ['47', '31', 'N', '11', '09', 'E'] ['51', '30', '44', 'N', '00', '09', '48', 'W'] ['51.4988', '-0.0901'] ['39', '10', 'N', '26', '20', 'E'] []
def convert_to_decimal(coord_list):
# print(' '.join(coord_list), end='\t')
coord_list = [s.strip().lower() for s in coord_list if s.strip() != '']
if len(coord_list) < 2:
return [0, 0]
if len(coord_list) == 2:
return [float(coord_list[0]), float(coord_list[1])]
directions = 0
for s in coord_list:
s = s.strip().lower()
if s and s.strip() in 'nesw':
directions += 1
if directions != 2:
raise Exception(directions, "wrong number of directions for:", coord_list)
lat = []
lon = []
creating_lat = True
for s in coord_list:
s = s.strip().lower()
if s == '':
continue
if creating_lat:
if s in 'ns':
creating_lat = False
while len(lat) < 3:
lat.append(0)
if s == 'n':
lat.append(1)
else:
lat.append(-1)
else:
if ',' in s:
s = s.replace(',', '.')
lat.append(float(s))
else:
if s in 'ew':
while len(lon) < 3:
lon.append(0)
if s == 'e':
lon.append(1)
else:
lon.append(-1)
else:
if ',' in s:
s = s.replace(',', '.')
lon.append(float(s))
return [
(lat[0] + lat[1]/60 + lat[2]/3600) * lat[3],
(lon[0] + lon[1]/60 + lon[2]/3600) * lon[3]
]
count = 1000
begin = random.randint(0, len(coordStrings)-count)
for i in range(begin, begin+count):
coord_list = extract_lat_lon(list(coordStrings.values())[i])
try:
result = convert_to_decimal(coord_list)
except Exception as e:
print(coord_list, e)
# print(result)
def insertKeywordDict(page_num, coords, kws, debug=False):
allowed_keys = [kw for kw in kws.keys() if kw in entriesList]
allowed_vals = [kws[kw] for kw in allowed_keys]
try:
lat,lon = convert_to_decimal(extract_lat_lon(coords))
except:
lat = lon = 0
keys = ['coords', 'lat', 'lon', 'page_num'] + allowed_keys
vals = [coords, lat, lon, page_num] + allowed_vals
qstring = f'INSERT INTO coords (\
{",".join(keys)}) VALUES (\
{",".join(["?" for item in vals])})'
if debug:
print(qstring, vals)
c.execute(qstring, vals)
# insertKeywordDict({'region': 'US-AK_type', 'display': 'title', 'notes': '<ref>{{Cite gnis'})
Grab a drink before running the next cell, it'll take a while.
start = 100_000
succeeded = 0
coordStringsList = list(coordStrings.items())
for i in range(start, len(coordStringsList)):
page_idx,coordString = coordStringsList[i]
try:
keywords = getKeywords(coordString)
keywords['title'] = idx_to_title[page_idx]
insertKeywordDict(page_idx, coordString, keywords)
succeeded += 1
except sqlite3.IntegrityError:
pass
if i % 100 == 0:
english.db.commit()
print(f'{round(100*i/len(coordStrings), 2)} ({i}) succeeded: {succeeded}', end='\r')
99.99 (1152300) succeeded: 212499
# c.execute('''SELECT * FROM coords WHERE page_num > 19000000''').fetchall()
l = c.execute('''SELECT title FROM coords WHERE
lat BETWEEN 47.6 AND 47.7
AND lon BETWEEN -122.35 AND -122.34
''').fetchall()
l
[('Museum of Pop Culture',), ('Bill & Melinda Gates Foundation',), ('The Art Institute of Seattle',), ('Space Needle',), ('School of Visual Concepts',), ('Fremont, Seattle',), ('City University of Seattle',), ('Woodland Park Zoo',), ('Belltown, Seattle',), ('Fremont Troll',), ('Cranium, Inc.',), ('Seattle Aquarium',), ('The Real World: Seattle',), ('Fremont Bridge (Seattle)',), ('Aurora Bridge',), ('Attachmate',), ('Woodland Park (Seattle)',), ('Seattle Opera',), ('Denny Park (Seattle)',), ('Victor Steinbrueck Park',), ('Unexpected Productions',), ('Memorial Stadium (Seattle)',), ('Bad Animals Studio',), ('Westlake, Seattle',), ('B. F. Day Elementary School',), ('Cutter & Buck',), ('Mercer Arena',), ('Seattle Cinerama',), ('Area code 206',), ('Waiting for the Interurban',), ('2006 Seattle Jewish Federation shooting',), ("Beth's Cafe",), ('Pike Place Fish Market',), ('SkyCity',), ('Antioch University Seattle',), ('Moore Theatre',), ('McLeod Residence',), ('Flight to Mars (ride)',), ('Fourth and Blanchard Building',), ('Seattle Parks and Recreation',), ('The Crocodile',), ('American Seafoods',), ('InterConnection.org',), ('Green Lake Aqua Theater',), ('Pike Place Market',), ('A Place for Mom',), ('Waterfront Park (Seattle)',), ('Alaskan Way Viaduct replacement tunnel',), ('History House of Greater Seattle',), ('Gum Wall',), ('Alaska Trade Building',), ('Butterworth Building',), ('Swedish Cultural Center',), ('TheFilmSchool',), ('Eastern Congo Initiative',), ('The 5 Point Cafe',), ('Institute for Health Metrics and Evaluation',), ('Impinj',), ('Insignia Towers',), ('Bell Apartments',), ('Griffin College',), ('Lundeberg Derby Monument',), ('Wagner Houseboat',), ('Tilikum Place',), ('Seattle Great Wheel',), ('Via6 Towers',), ('Canlis',), ('Northwest Woodworkers Gallery',), ('Original Starbucks',), ('Pier 57 (Seattle)',), ('Fremont Brewing',), ('Bell Street Park',), ('Chief Seattle (sculpture)',), ('Arrivé',), ('Tower 12',), ('Barnes Building',), ('Calhoun Hotel',), ('Left Bank Books',), ('McGuire Apartments',), ('The Emerald (building)',), ('Pike Street',), ('Chris Cornell (Marra)',)]
How many cities do we have?
l = c.execute('''SELECT title,coords FROM coords WHERE
type == "city"
''').fetchall()
len(l)
7924
What's the coverage for this tag?
[el[0] for el in l if 'Seattle' in el[0]]
[]
[el[0] for el in l if 'Portland' in el[0]]
['Portland, Indiana', 'South Portland, Maine', 'Portland, Pennsylvania']
[el[0] for el in l if 'Brooklyn' in el[0]]
['Sea Gate, Brooklyn']
Not great.
l = c.execute('''SELECT title,type,lat,lon FROM coords''').fetchall()
[el for el in l if 'Seattle' == el[0]]
[('Seattle', '', 47.609722222222224, -122.33305555555555)]
That's because Seattle isn't tagged as a city.
[el for el in l if "Portland, Oregon" == el[0]]
[('Portland, Oregon', 'city(568380)_region', 45.519999999999996, -122.68194444444445)]
[el for el in l if "San Francisco" == el[0]]
[('San Francisco', '', 37.78333333333333, -122.41666666666667)]
len(l)
1152376
Hmm, probably too many to do TF-IDF on all of them. Can we narrow it down?
l = c.execute('''SELECT title,coords FROM coords WHERE
display == "inline,title"
''').fetchall()
len(l)
602882
[el for el in l if 'Seattle' == el[0]]
[('Seattle', 'coord|47|36|35|N|122|19|59|W|region:US-WA|display=inline,title')]
[el for el in l if 'Portland, Oregon' == el[0]]
[('Portland, Oregon', 'coord|45|31|12|N|122|40|55|W|type:city(568380)_region:US-OR_source:gnis-1136645|display=inline,title')]
This narrows it down by ~50%, at least.
l[:50]
[('Andorra', 'coord|42|30|N|1|31|E|display=inline,title'), ('Atlantic Ocean', 'coord|0|N|25|W|region:ZZ_type:waterbody|display=inline,title'), ('Aegean Sea', 'coord|39|N|25|E|type:waterbody_dim:500000|display=inline,title'), ('Amsterdam', 'coord|52|22|N|4|54|E|region:NL|display=inline,title'), ('Anatolia', 'Coord|39|N|35|E|type:country|display=inline,title'), ('Aberdeenshire', 'coord|57|9|3.6|N|2|7|22.8|W|type:adm2nd_region:GB|format=dms|display=inline,title'), ('Azincourt', 'coord|50.46|2.13|format=dms|display=inline,title'), ('Geography of Azerbaijan', 'Coord|40|30|N|47|30|E|type:country_region:AZ|display=inline,title'), ('Adelaide', 'coord|34|55|44|S|138|36|4|E|type:city_region:AU-SA|display=inline,title'), ('Athens', 'coord|37|59|02.3|N|23|43|40.1|E|format=dms|display=inline,title'), ('Ames, Iowa', 'coord|42|02|05|N|93|37|12|W|region:US-IA|display=inline,title'), ('Abensberg', 'coord|48|48|N|11|51|E|format=dms|display=inline,title'), ('Park Güell', 'Coord|41|24|49|N|2|09|10|E|region:ES-CT_scale:5000|display=inline,title'), ('Andes', 'coord|32|S|70|W|type:mountain|display=inline,title'), ('Anazarbus', 'coord|37|15|50|N|35|54|20|E|display=inline,title'), ('Abbotsford House', 'coord|55|35|59|N|2|46|55|W|region:GB|display=inline,title'), ('Abydos, Egypt', 'coord|26|11|06|N|31|55|08|E|display=inline,title'), ('Acapulco', 'coord|16|51|49|N|99|52|57|W|region:MX|display=inline,title'), ('Aachen', 'coord|50|46|32|N|06|05|01|E|format=dms|display=inline,title'), ('Aga Khan I', 'coord|latitude|longitude|type:landmark|display=inline,title'), ('Aga Khan III', 'coord|latitude|longitude|type:landmark|display=inline,title'), ('Ahmad Shah Durrani', 'coord|31|37|10|N|65|42|25|E|display=inline,title'), ('Aberdeen', 'coord|57.15|-2.11|type:city_region:GB-ABE|display=inline,title'), ('Algiers', 'coord|36|45|14|N|3|3|32|E|region:DZ_type:city|display=inline,title'), ('Amathus', 'coord|34|42|45|N|33|08|30|E|type:city_region:CY|display=inline,title'), ('Amphipolis', 'coord|40|49|6|N|23|50|24|E|format=dms|display=inline,title'), ('Anah', 'coord|34|22|20|N|41|59|15|E|region:IQ|display=inline,title'), ('Andaman Islands', 'Coord|12|30|N|92|45|E|region:IN_type:isle|display=inline,title'), ('Astoria, Oregon', 'coord|46|11|20|N|123|49|16|W|type:city(10045)_region:US-OR_source:gnis-1117076|display=inline,title'), ('Abadan, Iran', 'coord|30|20|21|N|48|18|15|E|region:IR|display=inline,title'), ('Australian Capital Territory', 'coord|35|18|29|S|149|7|28|E|display=inline,title'), ('Austin, Texas', 'coord|30|16|N|97|44|W|region:US-TX|display=inline,title'), ('Acropolis of Athens', 'coord|37|58|15|N|23|43|34|E|type:landmark_region:GR_scale:2000|display=inline,title'), ('Acadia University', 'Coord|45|05|16|N|64|21|58|W|region:CA-NS_type:edu|display=inline,title'), ('Albion, Michigan', 'coord|42|14|48|N|84|45|12|W|region:US-MI|display=inline,title'), ('Abdul Rashid Dostum', 'Coord|LAT|LONG|display=inline,title'), ('Andhra Pradesh', 'coord|16.50|80.64|region:IN-AP_type:adm1st_dim:500000|display=inline,title'), ('Alicante', 'coord|38|20|43|N|0|28|59|W|region:ES_type:city|display=inline,title'), ('Aarau', 'coord|47|24|N|8|03|E|display=inline,title'), ('Canton of Aargau', 'coord|47|5|N|8|0|E|format=dms|display=inline,title'), ('Abadeh', 'coord|31|09|39|N|52|39|02|E|region:IR_type:city|display=inline,title'), ('Abakan', 'coord|53|43|N|91|28|E|display=inline,title'), ('Arc de Triomphe', 'coord|48.8738|2.2950|display=inline,title'), ('Aswan', 'coord|24|05|20|N|32|53|59|E|region:EG|display=inline,title'), ('Angus, Scotland', 'coord|56|40|N|2|55|W|display=inline,title|format=dms'), ('Abano Terme', 'coord|45|21|37|N|11|47|24|E|region:IT_type:city(19062)|display=inline,title'), ('Aeclanum', 'coord|41|3|14|N|15|0|40|E|display=inline,title'), ('Aegina', 'coord|37|45|N|23|26|E|format=dms|display=inline,title'), ('Ajaccio', 'coord|41.9267|8.7369|format=dms|display=inline,title'), ('Ajmer', 'coord|26.4499|N|74.6399|E|display=inline,title')]
I could also open this dataset of cities that would narrow it down a lot further, but I would have to do a bunch of fuzzy name matching probably :(
import pandas as pd
index_col = ['nn', 'name', 'name_ascii',
'other_names', 'lat', 'lon', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'region', 'date']
len(index_col)
19
cities_df = pd.read_csv('cities15000.tsv', sep='\t', header=None, names=index_col)
cities_df[cities_df.f != float('nan')].sample(10)
nn | name | alt_name | other_names | lat | lon | a | b | c | d | e | f | g | h | i | j | k | region | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15964 | 1697175 | Olongapo | Olongapo | Bandaraya Olongapo,Ciudad ti Olongapo,Dakbayan... | 14.82917 | 120.28278 | P | PPLA3 | PH | NaN | 03 | 64 | 037107000 | NaN | 221178 | NaN | 11 | Asia/Manila | 2017-12-27 |
4879 | 2881485 | Landshut | Landshut | Gorad Landsgut,Landishuta,Landsgut,Landshuad,L... | 48.52961 | 12.16179 | P | PPLA2 | DE | NaN | 02 | 092 | 09261 | 09261000 | 60488 | NaN | 488 | Europe/Berlin | 2014-02-08 |
298 | 3855974 | Esquel | Esquel | EQS,Ehskel',Eskelis,Esquel,ai si ke er,ayskywl... | -42.91147 | -71.31947 | P | PPL | AR | NaN | 04 | 26035 | NaN | NaN | 28486 | NaN | 577 | America/Argentina/Catamarca | 2015-04-22 |
14950 | 1028434 | Quelimane | Quelimane | Gorad Kelimaneh,Kelimane,Kelimanė,Quelimane,UE... | -17.87861 | 36.88833 | P | PPLA | MZ | NaN | 09 | NaN | NaN | NaN | 188964 | NaN | 9 | Africa/Maputo | 2012-01-19 |
4007 | 3682108 | Garzón | Garzon | GLJ,Garzon,Garzón | 2.19593 | -75.62777 | P | PPL | CO | NaN | 16 | 41298 | NaN | NaN | 29451 | NaN | 834 | America/Bogota | 2017-06-21 |
3084 | 1529452 | Hoxtolgay | Hoxtolgay | Chia-shih-t'o-lo-kai,Chia-shih-t’o-lo-kai,Hesh... | 46.51872 | 86.00214 | P | PPLA4 | CN | NaN | 13 | NaN | NaN | NaN | 22000 | NaN | 804 | Asia/Urumqi | 2013-06-04 |
16582 | 756868 | Tomaszów Lubelski | Tomaszow Lubelski | Liublino Tomasuvas,Liublino Tomašuvas,Tomashev... | 50.44767 | 23.41616 | P | PPLA2 | PL | NaN | 75 | 0618 | 061801 | NaN | 20261 | 275.0 | 274 | Europe/Warsaw | 2010-10-14 |
10703 | 1269834 | Ichalkaranji | Ichalkaranji | Icalkarandzi,Ichalkaranji,aychalkaranjy,aychl ... | 16.69117 | 74.46054 | P | PPL | IN | NaN | 16 | 530 | NaN | NaN | 274383 | NaN | 561 | Asia/Kolkata | 2014-10-13 |
12517 | 248414 | Kurayyimah | Kurayyimah | Kareime,Kereime,Kuraiyima,Kuraymah,Kurayyimah,... | 32.27639 | 35.59938 | P | PPL | JO | NaN | 18 | NaN | NaN | NaN | 17837 | NaN | -183 | Asia/Amman | 2016-05-07 |
18765 | 65170 | Baardheere | Baardheere | NaN | 2.34464 | 42.27644 | P | PPLL | SO | NaN | 06 | NaN | NaN | NaN | 42240 | NaN | 97 | Africa/Mogadishu | 2006-01-27 |
Insert fuzzy name matching (if necessary)