#!/usr/bin/env python # coding: utf-8 # # Parsing and decoding wikipedia page diffs # # - contact: tamkien@cri-paris.com # - github: https://github.com/WeKeyPedia/notebooks/ # # This notebook is a short helper for python beginners and wikipedia API diggers that want to learn how to use the result of queries towarded to diff between pages. We will also cover very basic usage of Natural Language Processing (NLP) using the [nltk library](http://www.nltk.org/). # # The code shown below have been produced within the [wekeypedia project](https://github.com/wekeypedia) to produce term-editor and term-page bipartite networks. Equivalent procedures have been implemented within our [python library](http://toolkit-python.readthedocs.org/references/page.html#retrieving-and-parsing-diff) # # Most likely you will perform queries on the [wikipedia API](http://www.mediawiki.org/wiki/API:Query) that will look something like: # # ```json # { # "format": "json", # "action": "query", # "titles": [page title], # "redirects":"true", # "prop": "info|revisions", # "inprop": "url", # "rvdiffto" : "prev" # } # ``` # # As this notebook is not about to make queries, we are going to use directly the wrappers that have been package within our [wekeypedia python library](https://github.com/wekeypedia/toolkit-python). There is a bundle object `wekeypedia.wikipedia.api` that allows to build queries and get back the json result. You can also use the [WikipediaPage.get_diff](http://toolkit-python.readthedocs.org/references/generated/wekeypedia.wikipedia.page.WikipediaPage.get_diff.html). # In[1]: import os, sys, pprint, random from collections import defaultdict from bs4 import BeautifulSoup import nltk from IPython.display import display, HTML sys.path.append(os.path.abspath('../../WKP-python-toolkit')) import wekeypedia # In[2]: p = wekeypedia.WikipediaPage("Love") revisions_list = p.get_revisions_list() # diff = p.get_diff(100000308) diff = p.get_diff(194033798) # ## Information extraction from the json response # # When you ask for a diff between two revisions, the wikipedia API will likely give you back something like that: # # ```html # # Line 172: # Line 172: # # #   #
''[[Adveṣa]]'' and ''[[metta|maitrī]]'' are benevolent love. This love is unconditional and requires considerable self-acceptance. This is quite different from the ordinary love, which is usually about attachment and sex, which rarely occur without self-interest. Instead, in Buddhism it refers to detachment and unselfish interest in others' welfare.
#   #
''[[Adveṣa]]'' and ''[[metta|maitrī]]'' are benevolent love. This love is unconditional and requires considerable self-acceptance. This is quite different from the ordinary love, which is usually about attachment and sex, which rarely occur without self-interest. Instead, in Buddhism it refers to detachment and unselfish interest in others' welfare.
# # #   # #   # # # # − #
The Bodhisattva ideal in Tibetan Buddhism involves the complete renunciation of oneself in order to take on the burden of a suffering world. The strongest motivation one has in order to take the path of the Bodhisattva is the idea of salvation within unselfish love for others.
# + #
The [[Bodhisattva]] ideal in Mahayana Buddhism involves the complete renunciation of oneself in order to take on the burden of a suffering world. The strongest motivation one has in order to take the path of the Bodhisattva is the idea of salvation within unselfish, altustic love for all sentient beings.
# # #   # #   # # # #   #
===Hindu===
#   #
===Hindu===
# # ``` # In[3]: display(HTML("

raw html query result

")) css = """ """ display(HTML(css)) display(HTML(diff)) # There is several kind of information we can extract. # # 1. Inline additions/deletions/substitutions to ``, ``, and various combinations of both within `` and `` tags # 2. Full block additionds and deletions enclosed within `` and `` tags # # The only tricky thing is to not register false positive because `class="diff-addedline"` and `class="diff-deletedline"` are also respectively used to show the previous state of an deletion or current state of an addition. That is why the following code target rows (``) instead of cells. The strategy is to keep only added blocks that are preceded by an empty row (``) before and deleted blocks that are followed by an empty cell. # In[4]: def extract(diff_html): diff = { "added": [], "deleted" : [] } d = BeautifulSoup(diff_html, 'html.parser') tr = d.find_all("tr") for what in [ ["added", "ins"], ["deleted", "del"] ]: a = [] # checking block # we also check this is not only context showing for non-substition edits a = [ t.find("td", "diff-%sline" % (what[0])) for t in tr if len(t.find_all(what[1])) == 0 and len(t.find_all("td", "diff-empty")) > 0 ] # checking inline a.extend(d.find_all(what[1])) # filtering empty extractions a = [ x for x in a if x != None ] # registering diff[what[0]] = [ tag.get_text() for tag in a ] return diff def print_plusminus_overview(diff): for minus in diff["deleted"]: print "- %s" % (minus) for plus in diff["added"]: print "+ %s" % (plus) display(HTML("

plus/minus overview

")) diff = extract(diff) print_plusminus_overview(diff) # ## Natural language processing # # We are now going to proceed to a little bit of language processing. NLTK provides very usefull starter tools to manipulate bits of natural language. The core of the workflow is about tokenization and normalization. # # The first stem is to be able to count words correctly, it is were normalization intervens: # # - **stemming** is the process of reducing a word to its roots. For example, you may want to transform "gods" to "god", "is" to "be", etc # - **lemmatization** is closely related to stemming. Whereas the first one is a context-free procedure, lemmatization take care of variables related to grammar like the position in the phrase to have a less agressive approach. # # Right now, we apply lemmatization without the grammatical information. This is just in order to prepare advanced NLP work. # In[5]: def normalize(word): lemmatizer = nltk.WordNetLemmatizer() stemmer = nltk.stem.porter.PorterStemmer() word = word.lower() word = stemmer.stem_word(word) word = lemmatizer.lemmatize(word) return word # The process of counting stems is mainly about mapping the result of the **tokenization** of plus/minus contents. Dividing sentences into parts and words can be a very tedious work without the right parser or if you are looking for a universal grammar. It is also very related to the language itself. For example parsing english or german is very different. For the moment, we are going to use the [Punkt tokenizer](http://www.nltk.org/api/nltk.tokenize.html) because it is now all about english sentences. # # Tokenization, stemming and lemmatization are very sensitive points. It is possible to develop more precise strategies depending on what you are looking for. We are going to let it fuzzy to give space to later use and keep a broad mindset about what can be done with diff information. Meanwhile, for counting purpose, the basic implementation of these methods are largely sufficient. # In[6]: def count_stems(sentences, inflections=None): stems = defaultdict(int) ignore_list = "{}()[]<>./,;\"':!?&#=*&%" if inflections == None: inflections = defaultdict(dict) for sentence in sentences: for word in nltk.word_tokenize(sentence): old = word word = normalize(word) if not(word in ignore_list): stems[word] += 1 # keeping track of inflection usages inflections[word].setdefault(old,0) inflections[word][old] += 1 return stems def print_plusminus_terms_overview(stems): print "\n%s|%s\n" % ("+"*len(stems["added"].items()), "-"*len(stems["deleted"].items())) def print_plusminus_terms(stems): for k in stems.keys(): display(HTML("

%s:

" % (k))) for term in stems[k]: print term # In[7]: inflections = defaultdict(dict) display(HTML("

plus/minus ---> terms

")) stems = {} stems["added"] = count_stems(diff["added"], inflections) stems["deleted"] = count_stems(diff["deleted"], inflections) print_plusminus_terms(stems) # ## inflections # # We have also kept trace of inflections. This is not very important over one diff but it is interesting if you have collected inflections over a large set of words. For example, you might want to use the most common inflection instead of the stem form to produce more readable/pretty words cloud. # In[8]: display(HTML("

inflections

")) for stem, i in inflections.iteritems(): print "[%s] %s" % (stem, ", ".join(map(lambda x: "%s (%s)" % (x[0], x[1]), i.items()))) # ## See also # # This procedure is extensively used in the [words of wisdom and love](words%20of%20wisdom%20and%20love.ipynb) notebook about counting reccuring terms in diff of love, ethics, wisdom and morality pages.