Welcome to chapter two of Text Mining for Historians. This Notebook takes a closer look at a few very basic concepts for coding in Python.
#from IPython.display import HTML
#HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/S_f2qV2_U00?rel=0&controls=0&showinfo=0" frameborder="0" allowfullscreen></iframe>')
The overarching aims the Part I of the course, is to process large collections of text documents. But first, we have to explain the basic elements of the Python syntax, then we show you how to read one document, process its contents after which we demonstate working at the collection level. Once you have established what to do with a single document, scaling up your research is rather straightforward (unless you work with really, really big corpora of multiple billions of words. But that is outside the scope of the this.
Please be patient at the start, we have to begin by discussing a few basic elements of Python syntax, before we begin working with textual data. We start very slow, but speed up considerably in the next section, promised!
Just to check if your Notebook works, run the cell below, which should print "It works!".
print('It works')
It works
So let's start with the very basics.
Python makes a distinction between text and numbers, which it interprets as values belonging to different data types. You can just enter any number in the following code cell and it will be printed shown below the cell.
8
8
In a Notebook, the last element of a code cell is always returned. You can also use the print()
function to make this explicit.
print(8)
8
This looks the same, however when you are adding more lines to one cell the difference becomes clear. The cell below returns a 6
4
50
6
6
While this one will print number 4
and eventually return 6
. Note that Python evaluates each line sequentially.
print(4)
50
6
4
6
print()
Try it yourself: the cell below, you are confronted with your first comment # Enter your favorite number here
Comments are marked by hashtags. Python ignores everything that is followed a #
. This is often useful when we want to explain our code in more detail. In these Notebooks we will extensily use comments to guide you through the Python code.
However, "real" code is often less well documented: only passages that prove particularly tricky are commented.
Returning the exercise:
SyntaxError
. We tell you more about errors later. For the moment, just observe that Python here attracts your attention to the fact that your code is syntactically incorrect.Enter your favorite number here
and type a number. This time the error should disappear and the number you entered should be printed below the cell.# Enter your favorite number here
You can ask Python to make the data type of a value explicit by using the type()
function.
Functions are an essential component to almost any program. So far we've only used the print()
function (rather sneakin) and we will discuss them more extensively later, but for now only pay attention to the form (the key work type
with 8 enclosed by parentheses) of the expression and what it returns (int
).
Besically, the expression type(8)
gives you to the data type of a value, more formally, it shows that 8 is an instance of the int
(short for integer) class (just to mention it, you can forget about the technical jargon).
type(8)
int
Now, look at the cell below, you'll notice that here the number is surrounded by single quotation marks. These indicate indicate that, in this case, the value is a string and not a number
'8'
'8'
Use the type()
function to give you the data type of the string '8'
.
# Check this using the type function, i.e. type('8')
Text in Python is usually represented as a string (or str
), for example below I enter my first name as a string.
'Kaspar'
'Kaspar'
If I remove the quotation marks, I get a NameError
, this because Python now thinks Kaspar
refers to a variable and not a string. (We will discuss this distinction soon, don't worry, the important thing here is not to forget to put quotation (single or double!) around words to they are read as a string.
Kaspar
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-17-edfa14f4a91d> in <module> ----> 1 Kaspar NameError: name 'Kaspar' is not defined
Create a new code cell below and print your name.
Just returning values you enter manually isn't very useful. But in Python you can manipulate these values by performing operations on them such addition (+
) and substraction (-
).
Operations work different for each data type (i.e. for integers or strings).
For example, you can use python as a simple calculator: summing 2 and 3 returns 5.
2+3
5
You can also add strings together but this will return a different result.
'2'+'3'
'23'
Can you explain what happened and why the results are different
What happens when you subtract 2 and 3 (first as integers then as strings?).
# Experiment here
Please note that these operations return a value of the same data type (which is not always the case, so be careful).
type(2+3)
int
type('2'+'3')
str
## Breakout
:
A powerful feature of a programming language is the ability to store and manipulate variables. A variable is a name that refers to a value. The assignment statement creates new variables and relates them to concrete values. Instead of passing these elements as an argument to the print()
function, we can store them, by creating a variable that refers to the "Hello, World!" string.
# declare a variable
x = 'Hello World.'
# print what is in the box
print(x)
Variable assignment contains a =
, this statement declares that x
now refers to the string "Hello world"
. Variables are also often described as boxes that store information (even though the metaphor is technically less correct, it is still useful, and I won't mind if it's easier to remember).
Variable assignment follows the syntax variable_name = value
Please note that quotations marks only appear right of the =
sign, because we want the variable with the name x
to refer to the string "Hello world"
Of course you can also assign variables to integers.
# declare a variable
x = 15
y = 22
# print what is in the box
print(x)
print(y)
print(x+y)
15 22 37
In the code block above, two things happen. First, we fill x
with a value, in our case 15
and y
with 22
, print them in turn and the print the sum of both x
and y
, which look like high-school math.
One last remark about variables before ending this notebook, is that you can change the value of the variable, i.e. make it refer to something else. Below the variable text
first refers to the string "Hello World!"
and then we chahge it to make it refer to "Hello Planet!"
text = 'Hello, World!'
print(text)
text = 'Hello, Planet!'
print(text)
Hello, World! Hello, Planet!
Create and print two variables, one containing your name (string) and another on your year of birth (integer)