-- A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel
Chapter is still in DRAFT stage
In this chapter we will introduce a new programming paradigm: Object Oriented Programming. We will build an application that builds a social network and computes a graph of relations between people on Twitter. The nodes of the graph will be the twitter users, and the directed edges indicate that one speaks to another. The edges will carry a weight representing the number of times messages were sent.
Given a twitter corpus, we will extract who talks to whom, and whenever a connection is found, an edge is added to our graph, or an existing edge is strenghtened.
Object oriented programming is a data-centered programming paradigm that is based on the idea of grouping data and functions that act on particular data in so-called classes. A class can be seen as a complex data-type, a template if you will. Variables that are of that data type are said to be objects or instances of that class.
An example will clarify things:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
Ok, several things happen here. Here we created a class Person with a function __init__
. Functions that start with underscores are always special functions to Python which are connected with other built-in aspects of the language. The initialisation function will be called when an object of that initialised. Let's do so:
author = Person("Maarten", 30)
print("My name is " + author.name)
print("My age is " + str(author.age))
Functions within a class are called methods. The initialisation method assigns the two parameters that are passed to variables that belong to the object, within a class definition the object is always represented by self
.
The first argument of a method is always self
, and it will always point to the instance of the class. This first argument however is never explicitly specified when you call the method. It is implicitly passed by Python itself. That is why you see a discrepancy between the number of arguments in the instantiation and in the class definition.
Any variable or methods in a class can be accessed using the period (.
) syntax:
object.variable
or:
object.method
In the above example we printed the name and age. We can turn this into a method as well, thus allowing any person to introduce himself/herself. Let's extend our example:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def introduceyourself(self):
print("My name is " + self.name)
print("My age is " + str(self.age))
author = Person("Maarten",30)
author.introduceyourself()
Do you see what happens here? Do you understand the role of self
and notation with the period?
Unbeknowst to you, we have already made use of countless objects and methods throughout this course. Things like strings, lists, sets, dictionaries are all objects! Isn't that a shock? :) The object oriented paradigm is ubiquitous in Python!
Add a variable gender
(a string) to the Person class and adapt the initialisation method accordingly. Also add a method ismale()
that uses this new information and returns a boolean value (True/False).
#adapt the code:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def introduceyourself(self):
print("My name is " + self.name)
print("My age is " + str(self.age))
author = Person("Maarten",30)
author.introduceyourself()
One of the neat things you can do with classes is that you can build more specialised classes on top of more generic classes. Person
for instance is a rather generic concept. We can use this generic class to build a more specialised class Teacher
, a person that teaches a course. If you use inheritance, everything that the parent class could do, the inherited class can do as well!
The syntax for inheritance is as follows, do not confuse it with parameters in a function/method definition. We also add an extra method stateprofession()
otherwise Teacher
would be no different than Person
:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def introduceyourself(self):
print("My name is " + self.name)
print("My age is " + str(self.age))
class Teacher(Person): #this class inherits the class above!
def stateprofession(self):
print("I am a teacher!")
author = Teacher("Maarten",30)
author.introduceyourself()
author.stateprofession()
If the class Person
would have already had a method stateprofession
, then it would have been overruled (we say overloaded) by the one in the Teacher
class. Edit the example above, add a print like "I have no profession! :'(" and see that nothings changes
Instead of completely overloading a method, you can also call the method of the parent class. The following example contains modified versions of all methods, adds some extra methods and variables to keep track of the courses that are taught by the teacher. The edited methods call the method of the parent class the avoid repetition of code (one of the deadly sins of computer programming):
class Teacher(Person): #this class inherits the class above!
def __init__(self, name, age):
self.courses = [] #initialise a new variable
super().__init__(name,age) #call the init of Person
def stateprofession(self):
print("I am a teacher!")
def introduceyourself(self):
super().introduceyourself() #call the introduceyourself() of the Person
self.stateprofession()
print("I teach " + str(self.nrofcourses()) + " course(s)")
for course in self.courses:
print("I teach " + course)
def addcourse(self, course):
self.courses.append(course)
def nrofcourses(self):
return len(self.courses)
author = Teacher("Maarten",30)
author.addcourse("Python")
author.introduceyourself()
If you write your own classes, you can define what needs to happen if an operator such as for example +
,/
or <
is used on your class. You can also define what happens when the keyword in
or built-in functions such as len()
are you used with your class. This allows for a very elegant way of programming. Each of these operators and built-in functions have an associated method which you can overload. All of these methods start, like __init__
, with a double underscore.
For example. Let's allow comparison of tweets using the '<' and '>' operators. The methods for the opertors are respectively __lt__
and __gt__
, both take one argument, the other object to compare to. A tweet qualifies as greater than another if it is a newer, more recent, tweet:
class Tweet:
def __init__(self, message, time):
self.message = message
self.time = time # we will assume here that time is a numerical value
def __lt__(self, other):
return self.time < other.time
def __gt__(self, other):
return self.time > other.time
oldtweet = Tweet("this is an old tweet",20)
newtweet = Tweet("this is a new tweet",1000)
print(newtweet > oldtweet)
You may not yet see much use in this, but consider for example the built-in function sorted()
. Having such methods defined now means we can sort our tweets! And because we defined the methods __lt__
and __gt__
based on time. It will automatically sort them on time, from old to new:
tweets = [newtweet,oldtweet]
for tweet in sorted(tweets):
print(tweet.message)
Remember the in
keyword? Used checking items in lists and keys in dictionaries? To recap:
fruits = ['banana','pear','orange']
print('pear' in fruits)
Overloading this operator is done using the __contains__
method. It takes as extra argument the item that is being searched for ('pear' in the above example). The method should return a boolean value. For tweets, let's implement support for the in
operator and have it check whether a certain word is in the tweet.
class Tweet:
def __init__(self, message, time):
self.message = message
self.time = time
def __lt__(self, other):
return self.time < other.time
def __contains__(self, word):
#Implement the method
tweet = "I just flushed my toilet"
#now write code to check if the word "flushed" is in the tweet
#and print something nice if that's the case
Remember how we can iterate over lists and dictionaries using a for loop? To recap:
fruits = ['banana','pear','orange']
for fruit in fruits:
print(fruit)
We can do the same for our own object. We can make them support iteration. This is done by overloading the __iter__
method. It takes no extra arguments and should be a generator. Which if you recall means that you should use yield
instead of return
. Consider the following class TwitterUser
, if we iterate over an instance of that class, we want to iterate over all tweets. To make it more fun, let's iterate in chronologically sorted order:
class TwitterUser:
def __init__(self, name):
self.name = name
self.tweets = [] #This will be a list of all tweets, these should be Tweet objects
def append(self, tweet):
assert isinstance(tweet, Tweet) #this code will check if tweet is an instance
#of the Tweet class. If not, an exception
#will be raised
#append the tweet to our list
self.tweets.append(tweet)
def __iter__(self):
for tweet in sorted(self.tweets):
yield tweet
tweeter = TwitterUser("proycon")
tweeter.append(Tweet("My peanut butter sandwich has just fallen bottoms-down",4))
tweeter.append(Tweet("Tying my shoelaces",2))
tweeter.append(Tweet("Wiggling my toes",3))
tweeter.append(Tweet("Staring at a bird",1))
for tweet in tweeter:
print(tweet.message)
The method __len__
is invoked when the built-in function len()
is used. We want it to return the number of tweets a user has. Implement it in the example above and then run the following test, which should return True
if you did well:
print(len(tweeter) == 4)
Now we will turn to the practical assignment of this chapter. The extraction of a graph of who tweets whom. For this purpose we make available the dataset twitterdata.zip , download and extract it in a location of your choice.
The program we are writing will consist of three classes: Tweet
,TweetUser
and TweetGraph
. TweetGraph
will maintain a dictionary of users (TweetUser
), these are the nodes of our graph. TweetUser
will in turn maintain a list of tweets (Tweet
). TweetUser
will also maintain a dictionary in which the keys are other TweetUser instances and the values indicate the weight of the relationship. This thus makes up the edges of our graph.
You will not have to write everything from scratch, we will provide a full skeleton in which you have to implement certain methods. We are going to use our external editor for this assignment. Copy the below code, edit it, and save it as tweetnet.py
. When done, run the program from the command line, passing it one parameter, the directory where the txt files from twitterdata.zip can be found: python3 tweetnet.py /path/to/twitterdata/*
#! /usr/bin/env python3
# -*- coding: utf8 -*-
import sys
import preprocess
class Tweet:
def __init__(self, message, time):
self.message = message
self.time = time
class TwitterUser:
def __init__(self, name):
self.name = name
self.tweets = [] #This will be a list of all tweets
self.relations = {} #This will be a dictionary in which the keys are TwitterUser objects and the values are the weight of the relation (an integer)
def append(self, tweet):
assert isinstance(tweet, Tweet) #this is a test, if tweet is not an instance
#of Tweet, it will raise an Exception.
self.tweets.append(tweet)
def __iter__(self):
#This function, a generator, should iterate over all tweets
#<INSERT YOUR CODE HERE>
def __hash__(self):
#For an object to be usable as a dictionary key, it must have a hash method. Call the hash() function over something that uniquely defined this object
#and thus can act as a key in a dictionary. In our case, the user name is good, as no two users will have the same name:
return hash(self.name)
def addrelation(self, user):
if user and user != self.name: #user must not be empty, and must not be the user itself
if user in self.relations:
#the user is already in our relations, strengthen the bond:
self.relations[user] += 1
elif user in graph:
#the user exists in the graph, we can add a relation!
self.relations[user] = 1
#if the user does not exist in the graph, no relations will be added
def computerelations(self, graph):
for tweet in self:
#tokenise the actual tweet content (use the tokeniser in preprocess!):
tokens = #<INSERT YOUR CODE HERE>
#Search for @username tokens, extract the username, and call self.addrelation()
#<INSERT YOUR CODE HERE>
def printrelations(self):
#print the relations, include both users and the weight
#<INSERT YOUR CODE HERE>
def gephioutput(self):
#produce CSV output that gephi can import
for recipient, weight in self.relations.items():
for i in range(0, weight):
yield self.name + "," + recipient
class TwitterGraph:
def __init__(self, corpusdirectory):
self.users = {} #initialisation of dictionary that will store all twitter users. They keys are the names, the values are TwitterUser objects.
#the keys are the usernames (strings), and the values are
# TweetUser instances
#Load the twitter corpus
#tip: use preprocess.find_corpusfiles and preprocess.read_corpus_file,
#do not use preproces.readcorpus as that will include sentence segmentation
#which we do not want
#Each txt file contains the tweets of one user.
#all files contain three columns, separated by a TAB (\t). The first column
#is the user, the second the time, and the third is the tweetmessage itself.
#Create Tweet instances for every line that contains a @ (ignore other lines
#to conserve memory). Add those tweet instances to the right TweetUser. Create
#TweetUser instances as new users are encountered.
#self.users[user], which user being the username (string), should be an instance of the
#of TweetUser
#<INSERT YOUR CODE HERE>
#Compute relations between users
for user in self:
assert isinstance(user,TweetUser)
user.computerelations(self)
def __contains__(self, user):
#Does this user exist?
return user in self.users
def __iter__(self):
#Iterate over all users
for user in self.users.values():
yield user
def __getitem__(self, user):
#Retrieve the specified user
return self.users[user]
#this is the actual main body of the program. The program should be passed one parameter
#on the command line: the directory that contains the *.txt files from twitterdata.zip.
#We instantiate the graph, which will load and compute all relations
twittergraph = TwitterGraph(sys.argv[1])
#We output all relations:
for twitteruser in twittergraph:
twitteruser.printrelations()
#And we output to a file so you can visualise your graph in the program GEPHI
f = open('gephigraph.csv','wt',encoding='utf-8')
for twitteruser in twittergraph:
for line in twitteruser.gephioutput():
f.write(line + "\n")
f.close()
Ignore this, it's only here to make the page pretty:
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Python Programming for the Humanities by http://fbkarsdorp.github.io/python-course is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/fbkarsdorp/python-course.