#!/usr/bin/env python # coding: utf-8 # # Parsing and decoding wikipedia page diffs # # - contact: tamkien@cri-paris.com # - github: https://github.com/WeKeyPedia/notebooks/ # # This notebook is a short helper for python beginners and wikipedia API diggers that want to learn how to use the result of queries towarded to diff between pages. We will also cover very basic usage of Natural Language Processing (NLP) using the [nltk library](http://www.nltk.org/). # # The code shown below have been produced within the [wekeypedia project](https://github.com/wekeypedia) to produce term-editor and term-page bipartite networks. Equivalent procedures have been implemented within our [python library](http://toolkit-python.readthedocs.org/references/page.html#retrieving-and-parsing-diff) # # Most likely you will perform queries on the [wikipedia API](http://www.mediawiki.org/wiki/API:Query) that will look something like: # # ```json # { # "format": "json", # "action": "query", # "titles": [page title], # "redirects":"true", # "prop": "info|revisions", # "inprop": "url", # "rvdiffto" : "prev" # } # ``` # # As this notebook is not about to make queries, we are going to use directly the wrappers that have been package within our [wekeypedia python library](https://github.com/wekeypedia/toolkit-python). There is a bundle object `wekeypedia.wikipedia.api` that allows to build queries and get back the json result. You can also use the [WikipediaPage.get_diff](http://toolkit-python.readthedocs.org/references/generated/wekeypedia.wikipedia.page.WikipediaPage.get_diff.html). # In[1]: import os, sys, pprint, random from collections import defaultdict from bs4 import BeautifulSoup import nltk from IPython.display import display, HTML sys.path.append(os.path.abspath('../../WKP-python-toolkit')) import wekeypedia # In[2]: p = wekeypedia.WikipediaPage("Love") revisions_list = p.get_revisions_list() # diff = p.get_diff(100000308) diff = p.get_diff(194033798) # ## Information extraction from the json response # # When you ask for a diff between two revisions, the wikipedia API will likely give you back something like that: # # ```html #