from IPython import display
display.Markdown('README.md')
This README is included in the pywikibot2gephi.ipynb Jupyter notebook, you might as well head right over there.
These are my experiments in learning how to use any Mediawiki API through Pywikibot to collect any data I am interested in processing as a network graph (originally in convenient tool Gephi). Besides pretty pictures, network graphs offer powerful methods to visualize and elucidate data in ways which would be difficult otherwise. Two applications I have in mind are are story narrative charts and as aid in corporate wiki management.
For the longest time, I've enjoyed playing with various types of network graphs, mainly using the revered open source tool Gephi. However, data acquisition and preparation is usually the challenge, so I've been wanting to get into programmatic ways of working with the data. I haven't so far explored... lower-level tools like neo4j or Wikibase (together with Mediawiki the software powering Wikidata, itself the structured data storage behind Wikipedia).
From the beginning, I've usually fed Gephi by creating some type of CSV/spreadsheet node-, edge lists or adjacency matrix. With some learning effort and assuming your data is in for instance Wikidata, you can feed it from the SemanticWebImport Gephi plugin and SPARQL queries (see my previous tutorial on that). There is also the option to simply scrape the web (that example using Beautiful Soup, Selenium, Pandas, NetworkX and Matplotlib), but that can easily get messy.
Thus follows my current toolchain - there are lots of exciting data accessible through the Mediawiki APIs (not only Wikipedia but also sites in the Fandom/Wikia family or perhaps your corporate wiki?). Pywikibot offers a convenient and well-used (though not as well documented?) Python wrapping for Mediawiki API, such as authentication, parsing, caching and configurable throttling. igraph (n.b. python-igraph
) seems sensible and provides an interface to the Gephi GraphStreaming plugin. Jupyter notebooks are practically a given (see also my advent-of-code solutions, also in Jupyter notebooks)
As for potential paths of development, for when you don't have Gephi running along the notebook, one could opt for NetworkX as well as further usage of IPython.display, or even trying out something like pyvis (article here, see also their two previous articles).
Wikimedians may expect more Wikidata or even PAWS, but that's not where I'm going currently.
For inspiration, I draw mainly on Aidan Hogan et al. article "Knowledge Graphs (extended version)" and Alberto Cottica et al. presentation "People, words, data: Harvesting collective intelligence" and article "Semantic Social Networks: A Mixed Methods Approach to Digital Ethnography"
virtualenv ./venv
. Activate it like source venv/bin/activate
. In it, pip install -r requirements.txt
jupyter pywikibot2gephi.ipynb
Alternatively, just click the "launch binder" button and mybinder.org together with repo2docker GitHub Actions should have all the requirements sorted for you in a jiffy (obviously a binder won't be able to connect to a local Gephi). Another option is to install the notebook in PAWS: A Web Shell (PAWS), but I haven't tried that.
Cleaned up the project, wrote down (and reconstructed) this README. Currently working in the dev3
branch, because of how I want things tidy before I set them in stone (a.k.a. main
)
Created this repository and started my experiments. Yesterday I confirmed that by hard-coding our corporate SSO session cookie (figure out a way to pull it from the browser?) in pywikibot.lwp
("Set-Cookie3 format"), the regular MediaWiki API of the corporate wiki is usable. So while this is playing around for personal education, my hope is it could be quite useful also for various methods of corporate information- and community management
Imports are as expected. Note that the pip package for igraph (for now) is python-igraph
. See the Usage section regarding general information on how you should and can run this code (as soon as it is described).
import pywikibot
import igraph as ig
import igraph.remote.gephi as igg
Pywikibot encodes its configuration of each mediawiki it may operate on (wikipedias in different languages, wikidata, other mediawikis etc.) in "families". Thus in user-config.py
we have as an example encoded the basic configuration to work on the Mr Robot Fandom wiki:
display.Code('user-config.py')
mylang = 'mrrobot'
family = 'mrrobot'
usernames['mrrobot']['en'] = 'ExampleBot'
family_files['mrrobot'] = 'https://mrrobot.fandom.com/api.php'
Instantiate pywikibot mediawiki as defined in user-config.py
. In this case an AutoFamily
is sufficient, otherwise consider the families/
folder beneath where pywikibot is located.
site = pywikibot.Site()
Instantiate two pywikibot.Page
s, corresponding to two we know exist in our example wiki. Note that I don't think any API calls are made yet, but only as you access any attributes, and then they seem cached and throttled for sanity, as well as controllable content fetching or prefetched in page generators.
page1 = pywikibot.Page(site, 'Eps1.1 ones-and-zer0es.mpeg')
page2 = pywikibot.Page(site, 'Elliot Alderson')
Manually, for now, prepare two attribute dictionaries from each Page
, to be added as nodes in our graph. Note that at least igraph will not take care of all compatibility issues with GephiGraphStreamer
. Some I have come across are:
None
in igraph, but will be rejected by Gephi - this is particularly inconvenient if you increase the number of attributes at runtimepage1_attributes = {
'name': page1.title(),
# Note how several attributes/properties are methods, while this is an int
'pageid': page1.pageid,
'revision_count': page1.revision_count(),
# pywikibot frequently returns objects which may not be serializable
'namespace': str(page1.namespace()),
# again, here we convert Category objects into a string
'categories': ';'.join([category.title() for category in page1.categories()]),
# Contributors are a dict of usernames and number of revisions
'contributors': ';'.join(page1.contributors().keys()),
}
# We will have a function to perform this shortly
page2_attributes = {
'name': page2.title(),
'pageid': page2.pageid,
'revision_count': page2.revision_count(),
'namespace': str(page2.namespace()),
'categories': ';'.join([category.title() for category in page2.categories()]),
'contributors': ';'.join(page2.contributors().keys()),
}
Instantiate our igraph.Graph
object (again, I seem able to remove this ID generator and leave it to igraph). I'm curious whether this and pywikibot.Site
are possible and useful to pickle to resume a previous session. Regardless I am likely to write some experiment class to keep these for me.
g = ig.Graph()
# Seems igraph doesn't provide much convenience,
# better keep track of vertex IDs (edges will be fine)
vertex_ids = ig.UniqueIdGenerator()
Perform some automated housekeeping on the page attribute dictionaries (and again, this is subject to change):
for attributes in page1_attributes, page2_attributes:
# Gephi expect vertex/node names in the `Label` field
attributes['Label'] = attributes['name']
# igraph.UniqueIdGenerator will retrieve an ID if key exists, or register the next one
attributes['id'] = vertex_ids[attributes['name']]
# ... so make sure to add the vertice/vertex if you generate IDs:
g.add_vertices(1, attributes)
Once added, page attributes are now igraph vertices in the igraph.Graph.vs
array property:
for v in g.vs:
print(v)
igraph.Vertex(<igraph.Graph object at 0x1151fb8b0>, 0, {'name': 'Eps1.1 ones-and-zer0es.mpeg', 'pageid': 2338, 'revision_count': 98, 'namespace': ':', 'categories': 'Category:Broadcast episodes;Category:Season 1;Category:Season 1 episodes;Category:Episodes', 'contributors': 'Devinthe66;Theropod from the North;LeverageGuru;Lilgroot;Muhdika;Homersprairies;Azerty8;Tsimaile;BerzekerLT;ToxicNutellaStudios;Sharkdavid77;Aaron Warner-Perfection;92.24.143.225;2.101.0.99;105.225.82.18;PLLLOVER1234;X-IT', 'Label': 'Eps1.1 ones-and-zer0es.mpeg', 'id': 0}) igraph.Vertex(<igraph.Graph object at 0x1151fb8b0>, 1, {'name': 'Elliot Alderson', 'pageid': 2167, 'revision_count': 212, 'namespace': ':', 'categories': 'Category:Allsafe Cybersecurity;Category:Hackers;Category:Season 1 characters;Category:Season 2 characters;Category:Season 3 characters;Category:Season 4 characters;Category:Characters;Category:Fsociety;Category:Major characters', 'contributors': 'Lilgroot;Lacunite;Scoothare;AnEvildoer;Skybluehaneul;Tanya AZian;LeverageGuru;Heliocopters;TheOriginalRiza32;Hecknawboi;Nad Sterk;Ithnam;Femtocell;Jakubgawka;Mayhem12367;Saturnkind;Protranslator;Empulsgfx;MonsterousMan;PhanAwesomeness;Original Authority;PhonixUK;Minirugman;Jon Ambrose 81;Monochromatic Bunny;GodVicious;HotheadedBirdy;F7 ACiD;Thootly;SomeCrazyObsessedFan;Tracadilla;HardLogic;ToxicSleeper;EmilyReedus-Dixon;Drvroom;Mrsrobot24;ToxicNutellaStudios;Shan282;Thegamingweirdo;199.119.233.205;72.191.69.125;99.248.157.182;199.7.157.101;Miokeyll;209.221.90.250;199.7.157.94;99.248.38.61;67.187.178.154;172.88.44.40;216.126.81.155;121.97.32.218;Pcnoic;104.177.94.189;66.85.139.244;197.35.7.173;Elliot Alderson;174.113.80.211;105.228.127.119;P-t-x;37.203.122.219;198.23.143.218;Mda228;217.208.115.32;73.132.17.165;108.9.159.237;JRob528;166.137.244.18;100.38.199.197;41.254.6.41;PLLLOVER1234;Zoeyadams;Spirit freak;146.115.155.4;Effectofthemassvariety;Vlop12;Sovq;X-IT;67.42.52.214', 'Label': 'Elliot Alderson', 'id': 1})
And once again, keeping track of node IDs in an external generator and noting them in the id
attribute, I check they are as expected in the graph, upon which we can add our first edge! igraph.Graph.add_vertices
and add_edges
are slightly curious about how you can add whole sequences or a single object to the graph, just note how they function.
for attributes in [page1_attributes, page2_attributes]:
assert attributes['name'] == g.vs[attributes['id']]['name']
print(f'Page "{attributes["name"]}" got the vertex ID we expected')
g.add_edge(page1_attributes['id'], page2_attributes['id'])
Page "Eps1.1 ones-and-zer0es.mpeg" got the vertex ID we expected Page "Elliot Alderson" got the vertex ID we expected
igraph.Edge(<igraph.Graph object at 0x1151fb8b0>, 0, {})
Pausing here for now - I would like to make the option to display the graph in the notebook if Gephi isn't available, as well as render to image or graphml, but that's for later.
import matplotlib.pyplot as plt
layout = g.layout(layout='circle')
fig, ax = plt.subplots()
ig.plot(g, layout=layout, target=ax, vertex_label=[v['name'] for v in g.vs])
# gephi = igg.GephiConnection()
# streamer = igg.GephiGraphStreamer()
# streamer.post(g, gephi)
Similarly I should get acquainted with how i can pickle/cache Site
s and Graph
s:
# import pickle
# with open('mrrobotgraph.pickle', 'wb') as handle:
# pickle.dump(g, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('mrrobotgraph.pickle', 'rb') as handle:
# g = pickle.load(handle)
A generic status printout to keep at the bottom of the experiment:
print(f'vertices: {len(g.vs)}, edges: {len(g.es)}')
print(g.vs[0])
print(g.vs[1])
# print(g)
vertices: 2, edges: 1 igraph.Vertex(<igraph.Graph object at 0x1151fb8b0>, 0, {'name': 'Eps1.1 ones-and-zer0es.mpeg', 'pageid': 2338, 'revision_count': 98, 'namespace': ':', 'categories': 'Category:Broadcast episodes;Category:Season 1;Category:Season 1 episodes;Category:Episodes', 'contributors': 'Devinthe66;Theropod from the North;LeverageGuru;Lilgroot;Muhdika;Homersprairies;Azerty8;Tsimaile;BerzekerLT;ToxicNutellaStudios;Sharkdavid77;Aaron Warner-Perfection;92.24.143.225;2.101.0.99;105.225.82.18;PLLLOVER1234;X-IT', 'Label': 'Eps1.1 ones-and-zer0es.mpeg', 'id': 0}) igraph.Vertex(<igraph.Graph object at 0x1151fb8b0>, 1, {'name': 'Elliot Alderson', 'pageid': 2167, 'revision_count': 212, 'namespace': ':', 'categories': 'Category:Allsafe Cybersecurity;Category:Hackers;Category:Season 1 characters;Category:Season 2 characters;Category:Season 3 characters;Category:Season 4 characters;Category:Characters;Category:Fsociety;Category:Major characters', 'contributors': 'Lilgroot;Lacunite;Scoothare;AnEvildoer;Skybluehaneul;Tanya AZian;LeverageGuru;Heliocopters;TheOriginalRiza32;Hecknawboi;Nad Sterk;Ithnam;Femtocell;Jakubgawka;Mayhem12367;Saturnkind;Protranslator;Empulsgfx;MonsterousMan;PhanAwesomeness;Original Authority;PhonixUK;Minirugman;Jon Ambrose 81;Monochromatic Bunny;GodVicious;HotheadedBirdy;F7 ACiD;Thootly;SomeCrazyObsessedFan;Tracadilla;HardLogic;ToxicSleeper;EmilyReedus-Dixon;Drvroom;Mrsrobot24;ToxicNutellaStudios;Shan282;Thegamingweirdo;199.119.233.205;72.191.69.125;99.248.157.182;199.7.157.101;Miokeyll;209.221.90.250;199.7.157.94;99.248.38.61;67.187.178.154;172.88.44.40;216.126.81.155;121.97.32.218;Pcnoic;104.177.94.189;66.85.139.244;197.35.7.173;Elliot Alderson;174.113.80.211;105.228.127.119;P-t-x;37.203.122.219;198.23.143.218;Mda228;217.208.115.32;73.132.17.165;108.9.159.237;JRob528;166.137.244.18;100.38.199.197;41.254.6.41;PLLLOVER1234;Zoeyadams;Spirit freak;146.115.155.4;Effectofthemassvariety;Vlop12;Sovq;X-IT;67.42.52.214', 'Label': 'Elliot Alderson', 'id': 1})