Introduction to Python for Data Sciences |
Franck Iutzeler |
Python is a programming language that is widely spread nowadays. It is used in many different domains thanks to its versatility.
It is an interpreted language meaning that the code is not compiled but translated by a running Python engine.
See https://www.python.org/about/gettingstarted/ for how to install Python (but it is probably already installed).
In Data Science, it is common to use Anaconda to download and install Python and its environment. (see also the quickstart.
Several options exists, more ore less user-friendly.
The python shell can be launched by typing the command python
in a terminal (this works on Linux, Mac, and Windows with PowerShell). To exit it, type exit()
.
Warning: Python (version 2.x) and Python3 (version 3.x) coexists in some systems as two different softwares. The differences appear small but are real, and Python 2 is no longer supported, to be sure to have Python 3, you can type python3
.
From the shell, you can enter Python code that will be executed on the run as you press Enter. As long as you are in the same shell, you keep your variables, but as soon as you exit it, everything is lost. It might not be the best option...
You can write your code in a file and then execute it with Python. The extension of Python files is typically .py
.
If you create a file test.py
(using any text editor) containing the following code:
a = 10
a = a + 7
print(a)
Then, you can run it using the command python test.py
in a terminal from the same folder as the file.
This is a conveniant solution to run some code but it is probably not the best way to code.
You can edit you Python code files with IDEs that offer debuggers, syntax checking, etc. Two popular exemples are:
Jupyter notebooks are browser-based notebooks for Julia, Python, and R, they correspond to .ipynb
files. The main features of jupyter notebooks are:
In a terminal, enter python -m pip install notebook
or simply pip install notebook
Note : Anaconda directly comes with notebooks, they can be lauched from the Navigator directly.
To lauch Jupyter, enter jupyter notebook
.
This starts a kernel (a process that runs and interfaces the notebook content with an (i)Python shell) and opens a tab in the browser. The whole interface of Jupyter notebook is web-based and can be accessed at the address http://localhost:8888 .
Then, you can either create a new notebook or open a notebooks (.ipynb
file) of the current folder.
Note : Closing the tab does not terminate the notebook, it can still be accessed at the above adress. To terminate it, use the interface (File -> Close and Halt) or in the kernel terminal type Ctrl+C
.
Without any installation, you can:
Notebook documents contains the inputs and outputs of an interactive python shell as well as additional text that accompanies the code but is not meant for execution. In this way, notebook files can serve as a complete computational record of a session, interleaving executable code with explanatory text, mathematics, and representations of resulting objects. These documents are saved with the .ipynb
extension.
Notebooks may be exported to a range of static formats, including HTML (for example, for blog posts), LaTeX, PDF, etc. by File->Download as
You can open a notebook by the file explorer from the Home (welcome) tab or using File->Open
from an opened notebook. To create a new notebook use the New
button top-right of Home (welcome) tab or using File->New Notebook
from an opened notebook, the programming language will be asked.
You can modify the title (that is the file name) by clicking on it next to the jupyter logo. The notebooks are a succession of cells, that can be of four types:
code
for python code (as in ipython)markdown
for text in Markdown formatting (see this Cheatsheet). You may additionally use HTML and Latex math formulas.raw
and heading
are less used for raw text and titlesYou can edit a cell by double-clicking on it.
You can run a cell by using the menu or typing Ctrl+Enter
(You can also run all cells, all cells above a certain point). It if is a text cell, it will be formatted. If it is a code cell it will run as it was entered in a ipython shell, which means all previous actions, functions, variables defined, are persistent. To get a clean slate, your have to restart the kernel by using Kernel->Restart
.
Tab
autocompletesShift+Tab
gives the docstring of the input function?
return the help2 + 2 + 1 # comment
5
a = 4
print(a)
print(type(a))
4 <class 'int'>
a,x = 4, 9000
print(a)
print(x)
4 9000
Variables names can contain a-z
, A-Z
, 0-9
and some special character as _
but must always begin by a letter. By convention, variables names are smallcase.
Variables are weakly typed in python which means that their type is deduced from the context: the initialization or the types of the variables used for its computation. Observe the following example.
print("Integer")
a = 3
print(a,type(a))
print("\nFloat")
b = 3.14
print(b,type(b))
print("\nComplex")
c = 3.14 + 2j
print(c,type(c))
print(c.real,type(c.real))
print(c.imag,type(c.imag))
Integer 3 <class 'int'> Float 3.14 <class 'float'> Complex (3.14+2j) <class 'complex'> 3.14 <class 'float'> 2.0 <class 'float'>
This typing can lead to some variable having unwanted types, which can be resolved by casting
d = 1j*1j
print(d,type(d))
d = d.real
print(d,type(d))
d = int(d)
print(d,type(d))
(-1+0j) <class 'complex'> -1.0 <class 'float'> -1 <class 'int'>
e = 10/3
print(e,type(e))
f = (10/3)/(10/3)
print(f,type(f))
f = int((10/3)/(10/3))
print(f,type(f))
3.3333333333333335 <class 'float'> 1.0 <class 'float'> 1 <class 'int'>
The usual operations are
*
and /
**
%
print(7 * 3., type(7 * 3.)) # int x float -> float
21.0 <class 'float'>
print(3/2, type(3/2)) # Warning: int in Python 2, float in Python 3
print(3/2., type(3/2.)) # To be sure
1.5 <class 'float'> 1.5 <class 'float'>
print(2**10, type(2**10))
1024 <class 'int'>
print(8%2, type(8%2))
0 <class 'int'>
Boolean is the type of a variable True
or False
and thus are extremely useful when coding.
>
, >=
(greater, greater or égal), <
, <=
(smaller, smaller or equal) or membership ==
, !=
(equality, different).and
, not
, or
.print('2 > 1\t', 2 > 1)
print('2 > 2\t', 2 > 2)
print('2 >= 2\t',2 >= 2)
print('2 == 2\t',2 == 2)
print('2 == 2.0',2 == 2.0)
print('2 != 1.9',2 != 1.9)
2 > 1 True 2 > 2 False 2 >= 2 True 2 == 2 True 2 == 2.0 True 2 != 1.9 True
print(True and False)
print(True or True)
print(not False)
False True True
Lists are the base element for sequences of variables in python, they are themselves a variable type.
[ ... , ... ]
l[0]
is the first element of l
)Warning: Another type called tuple with the syntax ( ... , ... )
exists in Python. It has almost the same structure than list to the notable exceptions that one cannot add or remove elements from a tuple. We will see them briefly later
l = [1, 2, 3, [4,8] , True , 2.3]
print(l, type(l))
[1, 2, 3, [4, 8], True, 2.3] <class 'list'>
print(l[0],type(l[0]))
print(l[3],type(l[3]))
print(l[3][1],type(l[3][1]))
1 <class 'int'> [4, 8] <class 'list'> 8 <class 'int'>
print(l)
print(l[4:]) # l[4:] is l from the position 4 (included)
print(l[:5]) # l[:5] is l up to position 5 (excluded)
print(l[4:5]) # l[4:5] is l between 4 (included) and 5 (excluded) so just 4
print(l[1:6:2]) # l[1:6:2] is l between 1 (included) and 6 (excluded) by steps of 2 thus 1,3,5
print(l[::-1]) # reversed order
print(l[-1]) # last element
[1, 2, 3, [4, 8], True, 2.3] [True, 2.3] [1, 2, 3, [4, 8], True] [True] [2, [4, 8], 2.3] [2.3, True, [4, 8], 3, 2, 1] 2.3
One can add, insert, remove, count, or test if a element is in a list easily
l.append(10) # Add an element to l (the list is not copied, it is actually l that is modified)
print(l)
[1, 2, 3, [4, 8], True, 2.3, 10]
l.insert(1,'u') # Insert an element at position 1 in l (the list is not copied, it is actually l that is modified)
print(l)
[1, 'u', 2, 3, [4, 8], True, 2.3, 10]
l.remove(10) # Remove the first element 10 of l
print(l)
[1, 'u', 2, 3, [4, 8], True, 2.3]
print(len(l)) # length of a list
print(2 in l) # test if 2 is in l
7 True
Lists are pointer-like types. Meaning that if you write l2=l
, you do not copy l
to l2
but rather copy the pointer so modifying one, will modify the other.
The proper way to copy list is to use the dedicated copy
method for list variables.
l2 = l
l.append('Something')
print(l,l2)
[1, 'u', 2, 3, [4, 8], True, 2.3, 'Something'] [1, 'u', 2, 3, [4, 8], True, 2.3, 'Something']
l3 = list(l) # l.copy() works in Python 3
l.remove('Something')
print(l,l3)
[1, 'u', 2, 3, [4, 8], True, 2.3] [1, 'u', 2, 3, [4, 8], True, 2.3, 'Something']
You can have void lists and concatenate list by simply using the + operator, or even repeat them with * .
l4 = []
l5 =[4,8,10.9865]
print(l+l4+l5)
print(l5*3)
[1, 'u', 2, 3, [4, 8], True, 2.3, 4, 8, 10.9865] [4, 8, 10.9865, 4, 8, 10.9865, 4, 8, 10.9865]
(...,...)
or simply comas. They cannot be changed once created.t = (1,'b',876876.908)
print(t,type(t))
print(t[0])
(1, 'b', 876876.908) <class 'tuple'> 1
a,b = 12,[987,98987]
u = a,b
print(a,b,u)
12 [987, 98987] (12, [987, 98987])
try:
u[1] = 2
except Exception as error:
print(error)
'tuple' object does not support item assignment
{key1 : value1, ...}
This type is often used as a return type in librairies.
d = {"param1" : 1.0, "param2" : True, "param3" : "red"}
print(d,type(d))
{'param1': 1.0, 'param2': True, 'param3': 'red'} <class 'dict'>
print(d["param1"])
d["param1"] = 2.4
print(d)
1.0 {'param1': 2.4, 'param2': True, 'param3': 'red'}
Warning: text formatting and notably the print
method is one of the major differences between Python 2 and Python 3. The method presented here is clean and works in both versions.
s = "test"
print(s,type(s))
test <class 'str'>
print(s[0])
print(s + "42")
t test42
print(s,42)
print(s+"42")
test 42 test42
try:
print(s+42)
except Exception as error:
print(error)
can only concatenate str (not "int") to str
The format
method
print( "test {}".format(42) )
test 42
print( "test with an int {:d}, a float {} (or {:e} which is roughly {:.1f})".format(4 , 3.141 , 3.141 , 3.141 ))
test with an int 4, a float 3.141 (or 3.141000e+00 which is roughly 3.1)
statement1 = False
statement2 = False
if statement1:
print("statement1 is True")
elif statement2:
print("statement2 is True")
else:
print("statement1 and statement2 are False")
statement1 and statement2 are False
statement1 = statement2 = True
if statement1:
if statement2:
print("both statement1 and statement2 are True")
both statement1 and statement2 are True
if statement1:
if statement2: # Bad indentation!
#print("both statement1 and statement2 are True") # Uncommenting Would cause an error
print("here it is ok")
print("after the previous line, here also")
here it is ok after the previous line, here also
statement1 = True
if statement1:
print("printed if statement1 is True")
print("still inside the if block")
printed if statement1 is True still inside the if block
statement1 = False
if statement1:
print("printed if statement1 is True")
print("outside the if block")
outside the if block
The syntax of for
loops is for x in something:
followed by an indentation of one tab which represents what will be executed.
The something
above can be of different nature: list, dictionary, etc.
for x in [1, 2, 3]:
print(x)
1 2 3
sentence = ""
for word in ["Python", "for", "data", "Science"]:
sentence = sentence + word + " "
print(sentence)
Python for data Science
A useful function is range which generated sequences of numbers that can be used in loops.
print("Range (from 0) to 4 (excluded) ")
for x in range(4):
print(x)
print("Range from 2 (included) to 6 (excluded) ")
for x in range(2,6):
print(x)
print("Range from 1 (included) to 12 (excluded) by steps of 3 ")
for x in range(1,12,3):
print(x)
Range (from 0) to 4 (excluded) 0 1 2 3 Range from 2 (included) to 6 (excluded) 2 3 4 5 Range from 1 (included) to 12 (excluded) by steps of 3 1 4 7 10
If the index is needed along with the value, the function enumerate
is useful.
for idx, x in enumerate(range(-3,3)):
print(idx, x)
0 -3 1 -2 2 -1 3 0 4 1 5 2
Similarly to for
loops, the syntax iswhile condition:
followed by an indentation of one tab which represents what will be executed.
i = 0
while i<5:
print(i)
i+=1
0 1 2 3 4
When a command may fail, you can try
to execute it and optionally catch the Exception
(i.e. the error).
a = [1,2,3]
print(a)
try:
a[1] = 3
print("command ok")
except Exception as error:
print(error)
print(a) # The command went through
try:
a[6] = 3
print("command ok")
except Exception as error:
print(error)
print(a) # The command failed
[1, 2, 3] command ok [1, 3, 3] list assignment index out of range [1, 3, 3]
In Python, a function is defined as def function_name(function_arguments):
followed by an indentation representing what is inside the function. (No return arguments are provided a priori)
def fun0():
print("\"fun0\" just prints")
fun0()
"fun0" just prints
Docstring can be added to document the function, which will appear when calling help
def fun1(l):
"""
Prints a list and its length
"""
print(l, " is of length ", len(l))
fun1([1,'iuoiu',True])
[1, 'iuoiu', True] is of length 3
help(fun1)
Help on function fun1 in module __main__: fun1(l) Prints a list and its length
return
outputs a variable, tuple, dictionary, ...
def square(x):
"""
Return x squared.
"""
return(x ** 2)
help(square)
res = square(12)
print(res)
Help on function square in module __main__: square(x) Return x squared. 144
def powers(x):
"""
Return the first powers of x.
"""
return(x ** 2, x ** 3, x ** 4)
help(powers)
Help on function powers in module __main__: powers(x) Return the first powers of x.
res = powers(12)
print(res, type(res))
(144, 1728, 20736) <class 'tuple'>
two,three,four = powers(3)
print(three,type(three))
27 <class 'int'>
def powers_dict(x):
"""
Return the first powers of x as a dictionary.
"""
return{"two": x ** 2, "three": x ** 3, "four": x ** 4}
res = powers_dict(12)
print(res, type(res))
print(res["two"],type(res["two"]))
{'two': 144, 'three': 1728, 'four': 20736} <class 'dict'> 144 <class 'int'>
It is possible to
def fancy_power(x, p=2, debug=False):
"""
Here is a fancy version of power that computes the square of the argument or other powers if p is set
"""
if debug:
print( "\"fancy_power\" is called with x =", x, " and p =", p)
return(x**p)
print(fancy_power(5))
print(fancy_power(5,p=3))
25 125
res = fancy_power(p=8,x=2,debug=True)
print(res)
"fancy_power" is called with x = 2 and p = 8 256
Classes are at the core of object-oriented programming, they are used to represent an object with related attribues (variables) and methods (functions).
They are defined as functions but with the keyword class class my_class(object):
followed by an indentation. The definition of a class usually contains some methods:
self
in auto-reference.__init__
: method executed at the creation of the object__str__
: method executed to represent the object as a string for instance when the object is passed ot the function print
class Point(object):
"""
Class of a point in the 2D plane.
"""
def __init__(self, x=0.0, y=0.0):
"""
Creation of a new point at position (x, y).
"""
self.x = x
self.y = y
def translate(self, dx, dy):
"""
Translate the point by (dx , dy).
"""
self.x += dx
self.y += dy
def __str__(self):
return("Point: ({:.2f}, {:.2f})".format(self.x, self.y))
p1 = Point()
print(p1)
p1.translate(3,2)
print(p1)
p2 = Point(1.2,3)
print(p2)
Point: (0.00, 0.00) Point: (3.00, 2.00) Point: (1.20, 3.00)
open
returns a file object, and is most commonly used with two arguments: open(filename, mode)
.
The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used (optional, 'r' will be assumed if it’s omitted.):
f = open('./data/test.txt', 'w')
print(f)
<_io.TextIOWrapper name='./data/test.txt' mode='w' encoding='UTF-8'>
f.write(string)
writes the contents of string to the file.
f.write("This is a test\n")
15
f.close()
Warning: For the file to be actually written and being able to be opened and modified again without mistakes, it is primordial to close the file handle with f.close()
f.read()
will read an entire file and put the pointer at the end.
f = open('./data/text.txt', 'r')
f.read()
'This is an example file\nMade specially for this course\nThis is already the third line\nLine 4\nTHE END\n'
f.read()
''
The end of the file has be reached so the command returns ''.
To get to the top, use f.seek(offset, from_what)
. The position is computed from adding offset
to a reference point; the reference point is selected by the from_what
argument. A from_what
value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of the file as the reference point. Thus f.seek(0)
goes to the top.
f.seek(0)
0
f.readline()
reads a single line from the file; a newline character (\n) is left at the end of the string
f.readline()
'This is an example file\n'
f.readline()
'Made specially for this course\n'
For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:
f.seek(0)
for line in f:
print(line)
f.close()
This is an example file Made specially for this course This is already the third line Line 4 THE END
Exercise 1: Odd or Even
The code snippet below enable the user to enter a number. Check if this number is odd or even. Optionnaly, handle bad inputs (character, float, signs, etc)
num = input("Enter a number: ")
print(num)
Enter a number: 3 3
Exercise 2: Fibonacci
The Fibonacci seqence is a sequence of numbers where the next number in the sequence is the sum of the previous two numbers in the sequence. The sequence looks like this: 1, 1, 2, 3, 5, 8, 13. Write a function that generate a given number of elements of the Fibonacci sequence.
Exercise 3: Implement quicksort
The wikipedia page describing this sorting algorithm gives the following pseudocode:
function quicksort('array')
if length('array') <= 1
return 'array'
select and remove a pivot value 'pivot' from 'array'
create empty lists 'less' and 'greater'
for each 'x' in 'array'
if 'x' <= 'pivot' then append 'x' to 'less'
else append 'x' to 'greater'
return concatenate(quicksort('less'), 'pivot', quicksort('greater'))
Create a function that sorts a list using quicksort
def quicksort(l):
# ...
return None
res = quicksort([-2, 3, 5, 1, 3])
print(res)
None
Exercise 4: Project Euler
Project Euler is a website of competitive programming mainly based on solving cleverly otherwise computation-intensive mathematical problems. It is a good way to learn a new scientific programming language. Problem 1 reads
If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.
Write a script that solves this problem.
You can continue by solving other problems. The first ones (eg. 4, 31) are the easiest.