While looking at the content of wikipedia diff of Love, I was surprised by the number and the specificity of "vandalisms" on it. It make this page looking more like a high school table full of markings or a carved tree in the park. I find it more funny that deterioring the encyclopedia, a sign of the exteriority of this object of knowledge far from being a pure ideal space of wisdom. Luckily a lot of bots and users are working hard to delete and make those signs of love invisible. I just wanted to see what are the proportions of the phenomenom and also explore the other side of collective intelligence. In my mind, using nltk would make it very easy and lower the supervision to a minimum and would also be a good excercice to start using that library.
First we are going to load the diff from a cache and also used the wekeypedia python library to skip the data acquision and parsing part. to jump into the fun of basic usage of nltk tree tagger.
%run "libraries.ipynb"
common libraries loaded
def from_file(name):
diff_txt = ""
with codecs.open(name, "r", encoding="utf-8-sig") as f:
data = json.load(f)
return data
def list_revisions(page):
return os.listdir("data/%s" % (page))
def load_revisions(s):
revisions = defaultdict(dict)
p = wekeypedia.WikipediaPage(s)
revisions_list = list_revisions(s)
revisions_list = map(lambda revid: revid.split(".")[0], revisions_list)
revisions = { revid : from_file("data/%s/%s.json" % (s, revid)) for revid in revisions_list }
return revisions
revisions = load_revisions("Love")
print "revisions: %s" % len(revisions)
revisions: 6324
page = wekeypedia.WikipediaPage("Love")
The first thing that need to be done is to tokenize sentences with NLTK with nltk.word_tokenize
that will use the wordnet corpus and nltk.pos_tag
that use the Penn Treebank tagset.
We are going to use very basically the three tagging by looking at sentences that include at least two proper nouns (NNP
) and 3 words. Keeping every variations of "x loves y", "x + y = love", or "love is about x and y". The pos_tag
function gives back basic results. For more accurate analysis, it will more usefull to usefull nltk.ne_chunks()
and look for 2 x PERSON
+ 1 x VERB
.
def i_love_u(pos_tags):
return len([ t for t in pos_tags if t[1] == "NNP" and not("love" in t[0].lower())]) >= 2
We then make sure the sentence as at least 3 words but is also not too long. It is the usual edit&run signature. Some of them have produce more elaborate declarations of love but we will check them another time with other strategies.
def correct_size(pos_tags):
return 2 < len(pos_tags) < 20
We also make sure one of the addition contains at least once the word "love" whatever its inflections or position.
def contains_love(sentence):
return "love" in sentence.lower()
We then compose all those conditions into one big chain of and
.
def is_it_love(sentence):
result = False
pos_tags = nltk.pos_tag(nltk.word_tokenize(sentence))
def i_love_u(pos_tags):
return len([ t for t in pos_tags if t[1] == "NNP" and not("love" in t[0].lower())]) >= 2
def correct_size(pos_tags):
return 2 < len(pos_tags) < 20
def contains_love(sentence):
return "love" in sentence.lower()
result = i_love_u(pos_tags) and correct_size(pos_tags) and contains_love(sentence)
return result
def find_love_line(source, sentence):
line = -1
d = BeautifulSoup(source, 'html.parser')
tr = [ tr for tr in d.find_all("tr") if sentence in tr.get_text() ]
for previous in tr[0].find_previous_siblings():
if type(previous) == type(tr[0]) and len(previous.find_all("td", "diff-lineno")) > 0:
line = previous.find("td").get_text()
break
line = line.split(" ")[1]
line = line[0:-1]
return int(line)
def detect_love(revid):
result = {
"revid": revid,
"love": [],
"plusminus": {},
"lines": []
}
diff_html = revisions[revid]["diff"]["*"]
diff = page.extract_plusminus(diff_html)
result["plusminus"] = diff
rev_index = revisions.keys()
print "\rrevision: %s/%s" % ( rev_index.index(revid), len(rev_index)),
# result["love"] = [ sentence in diff["added"] if is_it_love(sentence) ]
pos = 0
for sentence in diff["added"]:
if is_it_love(sentence):
result["love"].append(sentence)
result["lines"].append(find_love_line(diff_html, sentence))
pos += 1
print " ♥︎",
return result
# revlist = random.sample(revisions.keys(), 100)
revlist = revisions.keys()
results = [ detect_love(revid) for revid in revlist]
#results = [ detect_love(revid) for revid in [ "98452213" ] ]
print "\r ",
love = [ s for s in results if len(s["love"]) > 0 ]
print "♥︎" * len(love)
print len(love)
# print love
♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎♥︎ 684
final_result = []
for l in love:
for s in l["love"]:
if not("[" in s) and not("*" in s) and not("==" in s):
final_result.append( [ l["revid"], s ] )
final_result = pd.DataFrame(final_result)
final_result.columns = ["revision id", "sentence"]
final_result.head()
print len(final_result)
final_result.to_csv("data/find_love.csv", encoding="utf-8")
396
The csv can be found on github and google docs
We proceed to check manually our result to find ~70 false positive and not knowing the number of false negative. However this is a solid base to start a semi-supervised machine learning classifier to find more fancy grammatical structures.
You can find the final cleaned csv on github too.
final_result = pd.DataFrame.from_csv("data/find_love-checked.csv", encoding="utf-8")
final_result = final_result.drop(final_result[final_result["false positive"] == 1].index)
print len(final_result)
320
In order to write back all the marks of affection into the wikipedia page, we first need to extract where it was added first. The following implementation is not entirely accurate since its retrieve the line number in the previous version of the page and not the current one. Since it is only for play purpose, we will not be bother by that detail.
Retrieving the line number is relatively easy, we just need to extract <td class="diff-lineno">
tags. For more information, you can still consult the previous notebook: parsing wikipedia diff
def get_line_diff(revid):
line = 0
content = revisions[u""+str(revid)]["diff"]["*"]
html = BeautifulSoup(content, 'html.parser')
# <td colspan="2" class="diff-lineno">
line = html.find("td", class_="diff-lineno").get_text()
line = line.split(" ")[1]
line = line[0:-1]
return line
# content = page.get_revision()
content = BeautifulSoup(page.get_revision(), 'html.parser')
def insert_love(content, line, text):
content.insert(int(line), BeautifulSoup(text, 'html.parser'))
results_split = final_result.to_dict(orient="split")
for i in results_split["index"]:
current_love = [ l for l in love if l["revid"] == str(i) ][0]
revid = i
sentence = results_split["data"][results_split["index"].index(i)][0]
# beurk
if sentence in current_love["love"]:
line = current_love["lines"][current_love["love"].index(sentence)]
tag = u"<span class=\"love\" data-revision_id=\"%s\">♥︎ %s ♥︎</span>" % (revid, sentence)
insert_love(content, line, tag)
with codecs.open("outputs/findlove.html", "w", "utf-8-sig") as f:
f.write(content.prettify())
f.close()
display(HTML("<h1>Preview</h1>"))
display(HTML("<iframe src=outputs/findlove.html width=700 height=350></iframe>"))
As you can see most of the love happens on the first lines of the page but some lovers also venture in the depth of the page with more subtile tactics that the usual edit&run. Looking at user id, some of them do not even do that while unlogged.
You can also check the page with a better CSS and link that redirect to the corresponding revision.
names = defaultdict(int)
for sentence in list(final_result["sentence"]):
pos_tags = nltk.pos_tag(nltk.word_tokenize(sentence))
for w in pos_tags:
if w[1] == "NNP":
names[w[0].lower()] += 1
names = sorted(names.items(), key=lambda x: -x[1])
ignore = [ "love", "loves", "is", "<", ">", "in", "with", "}", "{", "(", ")", "=", "[", "]"]
names = [ n for n in names if not(n[0] in ignore) ]
#names = { x[0]: x[1] for x in names if x[1] not in ignore }
for n in names:
print "%s (%s)" % (n[0], n[1])
you (13) and (12) forever (10) u (10) example (9) god (8) john (8) the (8) sarah (7) more (7) yu (7) laura (6) simon (6) greek (6) x (6) anthony (6) danielle (5) helen (5) him (5) amanda (5) sex (5) josh (5) fisher (5) ali (4) emily (4) to (4) baby (4) my (4) fuck (4) a (4) alex (4) derek (4) nature (4) andrew (4) chemistry (4) true (4) nate (4) steph (4) kathleen (4) romantic (4) jessy (4) michael (3) ever (3) christian (3) lee (3) avery (3) mel (3) drew (3) boy (3) rory (3) chang (3) kristina (3) from (3) call (3) babe (3) /ref (3) help (3) stephen (3) hall (3) nick (3) mandy (3) muhammad (3) need (3) dom (3) abby (3) best (3) whole (3) tina (3) life (3) dylan (3) i (3) kelly (3) world (3) like (3) always (3) eric (3) much (3) jason (2) augustine (2) hilda (2) agron (2) go (2) hate (2) gabrielle (2) jessie (2) hannah (2) joey (2) cameron (2) victor (2) men (2) xharah (2) box (2) bible (2) nguyen (2) taub (2) nina (2) akil (2) alyssa (2) crazy (2) dolphin (2) poetry (2) schneider (2) over (2) blake (2) curtis (2) now (2) james (2) lucy (2) page=169 (2) girl (2) shannon (2) day (2) xxxmarc (2) jessica (2) lizzie (2) nobbs (2) david (2) cain (2) constantly (2) yair (2) ian (2) jummana (2) carole.. (2) saint (2) alan (2) hillary (2) jordan (2) constanza (2) roberts (2) lover (2) brandon (2) gill (2) geosits (2) m (2) chester (2) =d (2) duncan (2) 'erota (2) get (2) between (2) christopher (2) emo (2) daugherty (2) pants (2) matveet (2) amer (2) c (2) amazing (2) tarek (2) marc (2) bianchi (2) vince (2) andrea (2) will (2) it (2) kortney (2) brenna (2) madly (2) harris (2) albie (2) kyle (2) lauren (2) bryce (2) courtney (2) jake (2) armbruster (2) emma (2) rachael (2) dean (2) fucking (2) kenrick (2) brent (2) cooper (2) bobbi (2) by (2) on (2) brittany (2) hot (2) beth (2) happy (2) alexa (2) farina (2) caitlin (2) manion (2) an (2) nico (2) bianca (2) chris (2) natalie (2) all (1) cynthia (1) known (1) colleen (1) kate (1) cheney (1) edward (1) sona (1) word (1) mandi (1) joel (1) t. (1) soooo (1) dix (1) lisa (1) dhaliwal (1) carly (1) sign (1) poll (1) n (1) what (1) chicago (1) + (1) sooooooo (1) goes (1) angelone (1) petra (1) edited (1) bo (1) zach (1) chinar (1) darcie (1) strong (1) institute (1) k (1) experience (1) belford (1) bnmike (1) male (1) alexandria (1) barkman (1) sgj (1) bre (1) huzzah (1) would (1) m. (1) sutphin (1) jonathan (1) tell (1) jordyn (1) biatch (1) charity (1) ma (1) this (1) sucks (1) trang (1) whitten (1) rohan (1) cabeza (1) v (1) pleasurable (1) heart (1) waaaay (1) @ (1) genetailia (1) mohan (1) browne (1) chemical (1) ilir (1) pure (1) dannielle (1) mccoy (1) beach (1) aniket (1) ''i (1) pat (1) nympho (1) man (1) rodrigue (1) jodie (1) susan (1) chapman (1) kaity (1) so (1) grande (1) okcupid.com (1) hsnnu (1) singaporean (1) sanchez (1) september (1) soon (1) murray (1) shiels (1) scott (1) /a (1) mabel (1) sharmin (1) bulcock (1) ariam (1) mizzie (1) brook (1) piny (1) hoang (1) romeo (1) alexander (1) eugene (1) tattvasmasi (1) thei (1) olivia (1) adri (1) name (1) trac (1) betty (1) urieeeeeeeeeeeee (1) mean (1) el (1) neil (1) diana (1) hutsoall (1) steven (1) avisha (1) our (1) sexual (1) out (1) flaccid (1) daniel (1) she (1) ass (1) samya (1) jill (1) caro (1) keara (1) card (1) definition (1) kristyn (1) kirsty (1) language (1) thing (1) lindsey (1) austin (1) south (1) omgzorz (1) rachel (1) kaela (1) valentine (1) president (1) george (1) koh (1) little (1) garrick (1) anyone (1) bitz (1) tom (1) jen (1) hurt. (1) caitlyn (1) dolan (1) giselle (1) jamie (1) huw (1) shirley (1) vishal (1) than (1) ben (1) arun (1) sophie (1) feeling (1) stacey (1) are (1) fvnshdlhvb (1) sam (1) craig (1) love|url=http (1) snet-puss (1) ant (1) duvall (1) nikolay (1) have (1) goodie (1) well (1) boners (1) sherri (1) gilburt (1) connell (1) note (1) shamoon (1) bourgeois (1) green (1) dannille (1) brendon (1) who (1) skinner (1) nadine (1) marie (1) michelle (1) yang (1) tattvamasi (1) jocelyn (1) dot (1) thomas (1) christyy (1) joshi (1) kapel (1) naomi (1) limerence (1) scotti (1) it’s (1) babyyyyyyy (1) randy (1) black (1) rich (1) grutta (1) monica (1) dj (1) mortally (1) agapo (1) dustin (1) sanaz (1) schembri (1) yahoo.co.uk (1) joy (1) travis (1) leslie (1) sean (1) davyion (1) ari (1) juliet (1) chogan (1) temptation (1) hajir (1) stepehen (1) melissa (1) horrible (1) mum (1) latin (1) hazim (1) key (1) hurts (1) article (1) reaction (1) mindy (1) olga (1) keaton (1) s (1) annie (1) whorton (1) lovez (1) jeff (1) imadlak (1) leon (1) taylor (1) church (1) kayte.. (1) nicole (1) smith (1) robbie (1) mark (1) empty (1) sacha (1) website (1) gay (1) tomei (1) kandice (1) remy (1) undescribable (1) air (1) phuck (1) cox (1) eii (1) gupita (1) lacey (1) mitchell (1) robin (1) -justine (1) hewitt (1) lust (1) galipchak (1) im (1) heels (1) keerthana (1) whore (1) babii (1) mclaury (1) any (1) matthew (1) ruth (1) bunny (1) qaz3d (1) phi (1) goodwin (1) ella (1) blair (1) stacy (1) barber (1) kim (1) tung (1) navas (1) phu (1) ashley (1) com (1) obviously (1) kyran (1) lim (1) harney (1) ajitpal (1) ..i (1) being (1) human (1) paul (1) seductively (1) cocks (1) moretti (1) jose (1) death (1) bryn (1) tim (1) rawrrrrrrrrrrrrrrrrrrrrrrrrr (1) rose (1) nikki (1) bullshit (1) thermodynamics (1) maisey (1) real (1) aniika (1) monique (1) amy (1) bastards (1) holly (1) rebecca (1) adams (1) christina (1) christine (1) kennedy (1) danelle (1) patrick (1) ciardi (1) clay (1) thrower (1) hung (1) alexandra (1) jaimee (1) shea (1) shev (1) vsd (1) peter (1) scale (1) for (1) soto (1) gaskell (1) crossley (1) be (1) bb (1) a. (1) anything (1) of (1) hutchinson (1) amber (1) bowny (1) johns (1) “to (1) tajkowski (1) wade (1) contreras (1) your (1) katie (1) her (1) loh (1) there (1) /math (1) joseph (1) assholers (1) was (1) head (1) shane (1) bitches (1) melanie (1) hi (1) 'bold (1) line (1) botts (1) he (1) beckie (1) fanella (1) faggot (1) jenny (1) nazish.. (1) felix (1) chan (1) mike (1) erin (1) walters (1) sargeant (1) heather (1) no (1) djrttu (1) sharif (1) parikh (1) nat (1) love_life_advice (1) gemma (1) adam (1) shaun (1) walton (1) bonding (1) evan (1) jeremy (1) 'i (1) jones (1) casey (1) vice (1) together (1) corey (1) nelson (1) paige (1)