Notebook

find love with NLTK <3¶

mail: tamkien@cri-paris.org
twitter: @wekeypedia
website: http://github.com/wekeypedia

While looking at the content of wikipedia diff of Love, I was surprised by the number and the specificity of "vandalisms" on it. It make this page looking more like a high school table full of markings or a carved tree in the park. I find it more funny that deterioring the encyclopedia, a sign of the exteriority of this object of knowledge far from being a pure ideal space of wisdom. Luckily a lot of bots and users are working hard to delete and make those signs of love invisible. I just wanted to see what are the proportions of the phenomenom and also explore the other side of collective intelligence. In my mind, using nltk would make it very easy and lower the supervision to a minimum and would also be a good excercice to start using that library.

First we are going to load the diff from a cache and also used the wekeypedia python library to skip the data acquision and parsing part. to jump into the fun of basic usage of nltk tree tagger.

loading the datasets¶

In [1]:

%run "libraries.ipynb"

common libraries loaded

In [2]:

def from_file(name):
  diff_txt = ""

  with codecs.open(name, "r", encoding="utf-8-sig") as f:
    data = json.load(f)

  return data

def list_revisions(page):
  return os.listdir("data/%s" % (page))

def load_revisions(s):
  revisions = defaultdict(dict)
  
  p = wekeypedia.WikipediaPage(s)
  
  revisions_list = list_revisions(s)
  revisions_list = map(lambda revid: revid.split(".")[0], revisions_list)
  
  revisions = { revid : from_file("data/%s/%s.json" % (s, revid)) for revid in revisions_list }
  
  return revisions

revisions = load_revisions("Love")

print "revisions: %s" % len(revisions)

revisions: 6324

In [3]:

page = wekeypedia.WikipediaPage("Love")

detect love¶

The first thing that need to be done is to tokenize sentences with NLTK with nltk.word_tokenize that will use the wordnet corpus and nltk.pos_tag that use the Penn Treebank tagset.

We are going to use very basically the three tagging by looking at sentences that include at least two proper nouns (NNP) and 3 words. Keeping every variations of "x loves y", "x + y = love", or "love is about x and y". The pos_tag function gives back basic results. For more accurate analysis, it will more usefull to usefull nltk.ne_chunks() and look for 2 x PERSON + 1 x VERB.

def i_love_u(pos_tags):
  return len([ t for t in pos_tags if t[1] == "NNP" and not("love" in t[0].lower())]) >= 2

We then make sure the sentence as at least 3 words but is also not too long. It is the usual edit&run signature. Some of them have produce more elaborate declarations of love but we will check them another time with other strategies.

def correct_size(pos_tags):
  return 2 < len(pos_tags) < 20

We also make sure one of the addition contains at least once the word "love" whatever its inflections or position.

def contains_love(sentence):
  return "love" in sentence.lower()

We then compose all those conditions into one big chain of and.

In [4]:

def is_it_love(sentence):
  result = False
 
  pos_tags = nltk.pos_tag(nltk.word_tokenize(sentence))

  def i_love_u(pos_tags):
    return len([ t for t in pos_tags if t[1] == "NNP" and not("love" in t[0].lower())]) >= 2
  
  def correct_size(pos_tags):
    return 2 < len(pos_tags) < 20
  
  def contains_love(sentence):
    return "love" in sentence.lower()
  
  result = i_love_u(pos_tags) and correct_size(pos_tags) and contains_love(sentence)
  
  return result

def find_love_line(source, sentence):
  line = -1
  d = BeautifulSoup(source, 'html.parser')

  tr = [ tr for tr in d.find_all("tr") if sentence in tr.get_text() ]

  for previous in tr[0].find_previous_siblings():
    if type(previous) == type(tr[0]) and len(previous.find_all("td", "diff-lineno")) > 0:
      line = previous.find("td").get_text()
      break

  line = line.split(" ")[1]
  line = line[0:-1]

  return int(line)

def detect_love(revid):
  result = {
    "revid": revid,
    "love": [],
    "plusminus": {},
    "lines": []
  }
  
  diff_html = revisions[revid]["diff"]["*"]
  
  diff = page.extract_plusminus(diff_html)
  
  result["plusminus"] = diff
  
  rev_index = revisions.keys()
  print "\rrevision: %s/%s" % ( rev_index.index(revid), len(rev_index)),
  
  # result["love"] = [ sentence in diff["added"] if is_it_love(sentence) ]
  pos = 0
  for sentence in diff["added"]:
    if is_it_love(sentence):
      result["love"].append(sentence)
      result["lines"].append(find_love_line(diff_html, sentence))
      pos += 1
      print " ♥︎",

  return result

# revlist = random.sample(revisions.keys(), 100)
revlist = revisions.keys()

results = [ detect_love(revid) for revid in revlist]
#results = [ detect_love(revid) for revid in [ "98452213" ] ]

print "\r ",

love = [ s for s in results if len(s["love"]) > 0 ]

print "♥︎" * len(love)
print len(love)
# print love

  ♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎
684

alternative strategies¶

look for revert done by bots
find other tree structures
use reverse ip geolocalization to find where that kind of good stuff happened

cleaning and saving the result for manual checking¶

In [5]:

final_result = []

for l in love:
  for s in l["love"]:
    if not("[" in s) and not("*" in s) and not("==" in s):
      final_result.append( [ l["revid"], s ] )

final_result = pd.DataFrame(final_result)
final_result.columns = ["revision id", "sentence"]
final_result.head()

print len(final_result)

final_result.to_csv("data/find_love.csv", encoding="utf-8")

The csv can be found on github and google docs

manual cleaning of the dataset¶

We proceed to check manually our result to find ~70 false positive and not knowing the number of false negative. However this is a solid base to start a semi-supervised machine learning classifier to find more fancy grammatical structures.

You can find the final cleaned csv on github too.

In [6]:

final_result = pd.DataFrame.from_csv("data/find_love-checked.csv", encoding="utf-8")
final_result = final_result.drop(final_result[final_result["false positive"] == 1].index)

print len(final_result)

insert the love back into the page¶

In order to write back all the marks of affection into the wikipedia page, we first need to extract where it was added first. The following implementation is not entirely accurate since its retrieve the line number in the previous version of the page and not the current one. Since it is only for play purpose, we will not be bother by that detail.

Retrieving the line number is relatively easy, we just need to extract <td class="diff-lineno"> tags. For more information, you can still consult the previous notebook: parsing wikipedia diff

In [7]:

def get_line_diff(revid):
  line = 0
  
  content = revisions[u""+str(revid)]["diff"]["*"]
  html = BeautifulSoup(content, 'html.parser')
  
  # <td colspan="2" class="diff-lineno">
  line = html.find("td", class_="diff-lineno").get_text()
  line = line.split(" ")[1]
  line = line[0:-1]
  
  return line

In [8]:

# content = page.get_revision()
content = BeautifulSoup(page.get_revision(), 'html.parser')

def insert_love(content, line, text):
  content.insert(int(line), BeautifulSoup(text, 'html.parser'))

results_split = final_result.to_dict(orient="split")
  
for i in results_split["index"]:
  current_love = [ l for l in love if l["revid"] == str(i) ][0]
  revid = i
  sentence = results_split["data"][results_split["index"].index(i)][0]
  
  # beurk
  if sentence in current_love["love"]:
    line = current_love["lines"][current_love["love"].index(sentence)]

    tag = u"<span class=\"love\" data-revision_id=\"%s\">♥︎ %s ♥︎</span>" % (revid, sentence)

    insert_love(content, line, tag) 

with codecs.open("outputs/findlove.html", "w", "utf-8-sig") as f:
  f.write(content.prettify())
  f.close()
  
display(HTML("<h1>Preview</h1>"))
display(HTML("<iframe src=outputs/findlove.html width=700 height=350></iframe>"))

Preview

As you can see most of the love happens on the first lines of the page but some lovers also venture in the depth of the page with more subtile tactics that the usual edit&run. Looking at user id, some of them do not even do that while unlogged.

You can also check the page with a better CSS and link that redirect to the corresponding revision.

bonus: computing the top love!!!¶

In [9]:

names = defaultdict(int)

for sentence in list(final_result["sentence"]):
  pos_tags = nltk.pos_tag(nltk.word_tokenize(sentence))
  
  for w in pos_tags:
    if w[1] == "NNP":
      names[w[0].lower()] += 1

names = sorted(names.items(), key=lambda x: -x[1])

ignore = [ "love", "loves", "is", "<", ">", "in", "with", "}", "{", "(", ")", "=", "[", "]"]
names = [ n for n in names if not(n[0] in ignore) ]
#names = { x[0]: x[1] for x in names if x[1] not in ignore }

for n in names:
  print "%s (%s)" % (n[0], n[1])

you (13)
and (12)
forever (10)
u (10)
example (9)
god (8)
john (8)
the (8)
sarah (7)
more (7)
yu (7)
laura (6)
simon (6)
greek (6)
x (6)
anthony (6)
danielle (5)
helen (5)
him (5)
amanda (5)
sex (5)
josh (5)
fisher (5)
ali (4)
emily (4)
to (4)
baby (4)
my (4)
fuck (4)
a (4)
alex (4)
derek (4)
nature (4)
andrew (4)
chemistry (4)
true (4)
nate (4)
steph (4)
kathleen (4)
romantic (4)
jessy (4)
michael (3)
ever (3)
christian (3)
lee (3)
avery (3)
mel (3)
drew (3)
boy (3)
rory (3)
chang (3)
kristina (3)
from (3)
call (3)
babe (3)
/ref (3)
help (3)
stephen (3)
hall (3)
nick (3)
mandy (3)
muhammad (3)
need (3)
dom (3)
abby (3)
best (3)
whole (3)
tina (3)
life (3)
dylan (3)
i (3)
kelly (3)
world (3)
like (3)
always (3)
eric (3)
much (3)
jason (2)
augustine (2)
hilda (2)
agron (2)
go (2)
hate (2)
gabrielle (2)
jessie (2)
hannah (2)
joey (2)
cameron (2)
victor (2)
men (2)
xharah (2)
box (2)
bible (2)
nguyen (2)
taub (2)
nina (2)
akil (2)
alyssa (2)
crazy (2)
dolphin (2)
poetry (2)
schneider (2)
over (2)
blake (2)
curtis (2)
now (2)
james (2)
lucy (2)
page=169 (2)
girl (2)
shannon (2)
day (2)
xxxmarc (2)
jessica (2)
lizzie (2)
nobbs (2)
david (2)
cain (2)
constantly (2)
yair (2)
ian (2)
jummana (2)
carole.. (2)
saint (2)
alan (2)
hillary (2)
jordan (2)
constanza (2)
roberts (2)
lover (2)
brandon (2)
gill (2)
geosits (2)
m (2)
chester (2)
=d (2)
duncan (2)
'erota (2)
get (2)
between (2)
christopher (2)
emo (2)
daugherty (2)
pants (2)
matveet (2)
amer (2)
c (2)
amazing (2)
tarek (2)
marc (2)
bianchi (2)
vince (2)
andrea (2)
will (2)
it (2)
kortney (2)
brenna (2)
madly (2)
harris (2)
albie (2)
kyle (2)
lauren (2)
bryce (2)
courtney (2)
jake (2)
armbruster (2)
emma (2)
rachael (2)
dean (2)
fucking (2)
kenrick (2)
brent (2)
cooper (2)
bobbi (2)
by (2)
on (2)
brittany (2)
hot (2)
beth (2)
happy (2)
alexa (2)
farina (2)
caitlin (2)
manion (2)
an (2)
nico (2)
bianca (2)
chris (2)
natalie (2)
all (1)
cynthia (1)
known (1)
colleen (1)
kate (1)
cheney (1)
edward (1)
sona (1)
word (1)
mandi (1)
joel (1)
t. (1)
soooo (1)
dix (1)
lisa (1)
dhaliwal (1)
carly (1)
sign (1)
poll (1)
n (1)
what (1)
chicago (1)
+ (1)
sooooooo (1)
goes (1)
angelone (1)
petra (1)
edited (1)
bo (1)
zach (1)
chinar (1)
darcie (1)
strong (1)
institute (1)
k (1)
experience (1)
belford (1)
bnmike (1)
male (1)
alexandria (1)
barkman (1)
sgj (1)
bre (1)
huzzah (1)
would (1)
m. (1)
sutphin (1)
jonathan (1)
tell (1)
jordyn (1)
biatch (1)
charity (1)
ma (1)
this (1)
sucks (1)
trang (1)
whitten (1)
rohan (1)
cabeza (1)
v (1)
pleasurable (1)
heart (1)
waaaay (1)
@ (1)
genetailia (1)
mohan (1)
browne (1)
chemical (1)
ilir (1)
pure (1)
dannielle (1)
mccoy (1)
beach (1)
aniket (1)
''i (1)
pat (1)
nympho (1)
man (1)
rodrigue (1)
jodie (1)
susan (1)
chapman (1)
kaity (1)
so (1)
grande (1)
okcupid.com (1)
hsnnu (1)
singaporean (1)
sanchez (1)
september (1)
soon (1)
murray (1)
shiels (1)
scott (1)
/a (1)
mabel (1)
sharmin (1)
bulcock (1)
ariam (1)
mizzie (1)
brook (1)
piny (1)
hoang (1)
romeo (1)
alexander (1)
eugene (1)
tattvasmasi (1)
thei (1)
olivia (1)
adri (1)
name (1)
trac (1)
betty (1)
urieeeeeeeeeeeee (1)
mean (1)
el (1)
neil (1)
diana (1)
hutsoall (1)
steven (1)
avisha (1)
our (1)
sexual (1)
out (1)
flaccid (1)
daniel (1)
she (1)
ass (1)
samya (1)
jill (1)
caro (1)
keara (1)
card (1)
definition (1)
kristyn (1)
kirsty (1)
language (1)
thing (1)
lindsey (1)
austin (1)
south (1)
omgzorz (1)
rachel (1)
kaela (1)
valentine (1)
president (1)
george (1)
koh (1)
little (1)
garrick (1)
anyone (1)
bitz (1)
tom (1)
jen (1)
hurt. (1)
caitlyn (1)
dolan (1)
giselle (1)
jamie (1)
huw (1)
shirley (1)
vishal (1)
than (1)
ben (1)
arun (1)
sophie (1)
feeling (1)
stacey (1)
are (1)
fvnshdlhvb (1)
sam (1)
craig (1)
love|url=http (1)
snet-puss (1)
ant (1)
duvall (1)
nikolay (1)
have (1)
goodie (1)
well (1)
boners (1)
sherri (1)
gilburt (1)
connell (1)
note (1)
shamoon (1)
bourgeois (1)
green (1)
dannille (1)
brendon (1)
who (1)
skinner (1)
nadine (1)
marie (1)
michelle (1)
yang (1)
tattvamasi (1)
jocelyn (1)
dot (1)
thomas (1)
christyy (1)
joshi (1)
kapel (1)
naomi (1)
limerence (1)
scotti (1)
it’s (1)
babyyyyyyy (1)
randy (1)
black (1)
rich (1)
grutta (1)
monica (1)
dj (1)
mortally (1)
agapo (1)
dustin (1)
sanaz (1)
schembri (1)
yahoo.co.uk (1)
joy (1)
travis (1)
leslie (1)
sean (1)
davyion (1)
ari (1)
juliet (1)
chogan (1)
temptation (1)
hajir (1)
stepehen (1)
melissa (1)
horrible (1)
mum (1)
latin (1)
hazim (1)
key (1)
hurts (1)
article (1)
reaction (1)
mindy (1)
olga (1)
keaton (1)
s (1)
annie (1)
whorton (1)
lovez (1)
jeff (1)
imadlak (1)
leon (1)
taylor (1)
church (1)
kayte.. (1)
nicole (1)
smith (1)
robbie (1)
mark (1)
empty (1)
sacha (1)
website (1)
gay (1)
tomei (1)
kandice (1)
remy (1)
undescribable (1)
air (1)
phuck (1)
cox (1)
eii (1)
gupita (1)
lacey (1)
mitchell (1)
robin (1)
-justine (1)
hewitt (1)
lust (1)
galipchak (1)
im (1)
heels (1)
keerthana (1)
whore (1)
babii (1)
mclaury (1)
any (1)
matthew (1)
ruth (1)
bunny (1)
qaz3d (1)
phi (1)
goodwin (1)
ella (1)
blair (1)
stacy (1)
barber (1)
kim (1)
tung (1)
navas (1)
phu (1)
ashley (1)
com (1)
obviously (1)
kyran (1)
lim (1)
harney (1)
ajitpal (1)
..i (1)
being (1)
human (1)
paul (1)
seductively (1)
cocks (1)
moretti (1)
jose (1)
death (1)
bryn (1)
tim (1)
rawrrrrrrrrrrrrrrrrrrrrrrrrr (1)
rose (1)
nikki (1)
bullshit (1)
thermodynamics (1)
maisey (1)
real (1)
aniika (1)
monique (1)
amy (1)
bastards (1)
holly (1)
rebecca (1)
adams (1)
christina (1)
christine (1)
kennedy (1)
danelle (1)
patrick (1)
ciardi (1)
clay (1)
thrower (1)
hung (1)
alexandra (1)
jaimee (1)
shea (1)
shev (1)
vsd (1)
peter (1)
scale (1)
for (1)
soto (1)
gaskell (1)
crossley (1)
be (1)
bb (1)
a. (1)
anything (1)
of (1)
hutchinson (1)
amber (1)
bowny (1)
johns (1)
“to (1)
tajkowski (1)
wade (1)
contreras (1)
your (1)
katie (1)
her (1)
loh (1)
there (1)
/math (1)
joseph (1)
assholers (1)
was (1)
head (1)
shane (1)
bitches (1)
melanie (1)
hi (1)
'bold (1)
line (1)
botts (1)
he (1)
beckie (1)
fanella (1)
faggot (1)
jenny (1)
nazish.. (1)
felix (1)
chan (1)
mike (1)
erin (1)
walters (1)
sargeant (1)
heather (1)
no (1)
djrttu (1)
sharif (1)
parikh (1)
nat (1)
love_life_advice (1)
gemma (1)
adam (1)
shaun (1)
walton (1)
bonding (1)
evan (1)
jeremy (1)
'i (1)
jones (1)
casey (1)
vice (1)
together (1)
corey (1)
nelson (1)
paige (1)