Introduction to Python for Data Sciences

0 - Installation and Quick Start¶

Python is a programming language that is widely spread nowadays. It is used in many different domains thanks to its versatility.

It is an interpreted language meaning that the code is not compiled but translated by a running Python engine.

Installation¶

See https://www.python.org/about/gettingstarted/ for how to install Python (but it is probably already installed).

In Data Science, it is common to use Anaconda to download and install Python and its environment. (see also the quickstart.

Writing Code¶

Several options exists, more ore less user-friendly.

In the python shell¶

The python shell can be launched by typing the command python in a terminal (this works on Linux, Mac, and Windows with PowerShell). To exit it, type exit().

Warning: Python (version 2.x) and Python3 (version 3.x) coexists in some systems as two different softwares. The differences appear small but are real, and Python 2 is no longer supported, to be sure to have Python 3, you can type python3.

From the shell, you can enter Python code that will be executed on the run as you press Enter. As long as you are in the same shell, you keep your variables, but as soon as you exit it, everything is lost. It might not be the best option...

From a file¶

You can write your code in a file and then execute it with Python. The extension of Python files is typically .py.

If you create a file test.py (using any text editor) containing the following code:

a = 10
a = a + 7

print(a)

Then, you can run it using the command python test.py in a terminal from the same folder as the file.

This is a conveniant solution to run some code but it is probably not the best way to code.

Using an integrated development environment (IDE)¶

You can edit you Python code files with IDEs that offer debuggers, syntax checking, etc. Two popular exemples are:

Spyder which is quite similar to MATLAB or RStudio

* [VS Code](https://code.visualstudio.com/) which has a very good Python integration while not being restricted to it.

Jupyter notebooks¶

Jupyter notebooks are browser-based notebooks for Julia, Python, and R, they correspond to .ipynb files. The main features of jupyter notebooks are:

In-browser editing for code, with automatic syntax highlighting, indentation, and tab completion/introspection.
The ability to execute code from the browser and plot inline.
In-browser editing for rich text using the Markdown markup language.
The ability to include mathematical notation within markdown cells using LaTeX, and rendered natively by MathJax.

Installation¶

In a terminal, enter python -m pip install notebook or simply pip install notebook

Note : Anaconda directly comes with notebooks, they can be lauched from the Navigator directly.

Use¶

To lauch Jupyter, enter jupyter notebook.

This starts a kernel (a process that runs and interfaces the notebook content with an (i)Python shell) and opens a tab in the browser. The whole interface of Jupyter notebook is web-based and can be accessed at the address http://localhost:8888 .

Then, you can either create a new notebook or open a notebooks (.ipynb file) of the current folder.

Note : Closing the tab does not terminate the notebook, it can still be accessed at the above adress. To terminate it, use the interface (File -> Close and Halt) or in the kernel terminal type Ctrl+C.

Remote notebook exectution¶

Without any installation, you can:

view notebooks using NBViewer
fully interact with notebooks (create/modify/run) using UGA's Jupyter hub, Binder or Google Colab

Interface¶

Notebook documents contains the inputs and outputs of an interactive python shell as well as additional text that accompanies the code but is not meant for execution. In this way, notebook files can serve as a complete computational record of a session, interleaving executable code with explanatory text, mathematics, and representations of resulting objects. These documents are saved with the .ipynb extension.

Notebooks may be exported to a range of static formats, including HTML (for example, for blog posts), LaTeX, PDF, etc. by File->Download as

Accessing notebooks¶

You can open a notebook by the file explorer from the Home (welcome) tab or using File->Open from an opened notebook. To create a new notebook use the New button top-right of Home (welcome) tab or using File->New Notebook from an opened notebook, the programming language will be asked.

Editing notebooks¶

You can modify the title (that is the file name) by clicking on it next to the jupyter logo. The notebooks are a succession of cells, that can be of four types:

code for python code (as in ipython)
markdown for text in Markdown formatting (see this Cheatsheet). You may additionally use HTML and Latex math formulas.
raw and heading are less used for raw text and titles

Cells¶

You can edit a cell by double-clicking on it. You can run a cell by using the menu or typing Ctrl+Enter (You can also run all cells, all cells above a certain point). It if is a text cell, it will be formatted. If it is a code cell it will run as it was entered in a ipython shell, which means all previous actions, functions, variables defined, are persistent. To get a clean slate, your have to restart the kernel by using Kernel->Restart.

Useful commands¶

Tab autocompletes
Shift+Tab gives the docstring of the input function
? return the help

1- Numbers and Variables¶

Variables¶

In [1]:

2 + 2 + 1 # comment

Out[1]:

In [2]:

a = 4
print(a)
print(type(a))

4
<class 'int'>

In [3]:

a,x = 4, 9000
print(a)
print(x)

4
9000

Variables names can contain a-z, A-Z, 0-9 and some special character as _ but must always begin by a letter. By convention, variables names are smallcase.

Types¶

Variables are weakly typed in python which means that their type is deduced from the context: the initialization or the types of the variables used for its computation. Observe the following example.

In [4]:

print("Integer")
a = 3
print(a,type(a))

print("\nFloat")
b = 3.14
print(b,type(b))

print("\nComplex")
c = 3.14 + 2j
print(c,type(c))
print(c.real,type(c.real))
print(c.imag,type(c.imag))

Integer
3 <class 'int'>

Float
3.14 <class 'float'>

Complex
(3.14+2j) <class 'complex'>
3.14 <class 'float'>
2.0 <class 'float'>

This typing can lead to some variable having unwanted types, which can be resolved by casting

In [5]:

d = 1j*1j
print(d,type(d))
d = d.real
print(d,type(d))
d = int(d)
print(d,type(d))

(-1+0j) <class 'complex'>
-1.0 <class 'float'>
-1 <class 'int'>

In [6]:

e = 10/3
print(e,type(e))
f = (10/3)/(10/3)
print(f,type(f))
f = int((10/3)/(10/3))
print(f,type(f))

3.3333333333333335 <class 'float'>
1.0 <class 'float'>
1 <class 'int'>

Operation on numbers¶

The usual operations are

Multiplication and Division with respecively * and /
Exponent with **
Modulo with %

In [7]:

print(7 * 3., type(7 * 3.))  # int x float -> float

21.0 <class 'float'>

In [8]:

print(3/2, type(3/2))  # Warning: int in Python 2,  float in Python 3
print(3/2., type(3/2.)) # To be sure

1.5 <class 'float'>
1.5 <class 'float'>

In [9]:

print(2**10, type(2**10))  

1024 <class 'int'>

In [10]:

print(8%2, type(8%2))  

0 <class 'int'>

Booleans¶

Boolean is the type of a variable True or False and thus are extremely useful when coding.

They can be obtained by comparisons >, >= (greater, greater or égal), <, <= (smaller, smaller or equal) or membership == , != (equality, different).
They can be manipulated by the logical operations and, not, or.

In [11]:

print('2 > 1\t', 2 > 1)   
print('2 > 2\t', 2 > 2) 
print('2 >= 2\t',2 >= 2) 
print('2 == 2\t',2 == 2) 
print('2 == 2.0',2 == 2.0) 
print('2 != 1.9',2 != 1.9) 

2 > 1	 True
2 > 2	 False
2 >= 2	 True
2 == 2	 True
2 == 2.0 True
2 != 1.9 True

In [12]:

print(True and False)
print(True or True)
print(not False)

False
True
True

Lists¶

Lists are the base element for sequences of variables in python, they are themselves a variable type.

The syntax to write them is [ ... , ... ]
The types of the elements may not be all the same
The indices begin at $0$ (l[0] is the first element of l)
Lists can be nested (lists of lists of ...)

Warning: Another type called tuple with the syntax ( ... , ... ) exists in Python. It has almost the same structure than list to the notable exceptions that one cannot add or remove elements from a tuple. We will see them briefly later

In [13]:

l = [1, 2, 3, [4,8] , True , 2.3]
print(l, type(l))

[1, 2, 3, [4, 8], True, 2.3] <class 'list'>

In [14]:

print(l[0],type(l[0]))
print(l[3],type(l[3]))
print(l[3][1],type(l[3][1]))

1 <class 'int'>
[4, 8] <class 'list'>
8 <class 'int'>

In [15]:

print(l)
print(l[4:]) # l[4:] is l from the position 4 (included)
print(l[:5]) # l[:5] is l up to position 5 (excluded)
print(l[4:5]) # l[4:5] is l between 4 (included) and 5 (excluded) so just 4
print(l[1:6:2])  # l[1:6:2] is l between 1 (included) and 6 (excluded) by steps of 2 thus 1,3,5
print(l[::-1]) # reversed order
print(l[-1]) # last element

[1, 2, 3, [4, 8], True, 2.3]
[True, 2.3]
[1, 2, 3, [4, 8], True]
[True]
[2, [4, 8], 2.3]
[2.3, True, [4, 8], 3, 2, 1]
2.3

Operations on lists¶

One can add, insert, remove, count, or test if a element is in a list easily

In [16]:

l.append(10)   # Add an element to l (the list is not copied, it is actually l that is modified)
print(l)

[1, 2, 3, [4, 8], True, 2.3, 10]

In [17]:

l.insert(1,'u')   # Insert an element at position 1 in l (the list is not copied, it is actually l that is modified)
print(l)

[1, 'u', 2, 3, [4, 8], True, 2.3, 10]

In [18]:

l.remove(10) # Remove the first element 10 of l 
print(l)

[1, 'u', 2, 3, [4, 8], True, 2.3]

In [19]:

print(len(l)) # length of a list
print(2 in l)  # test if 2 is in l

7
True

Handling lists¶

Lists are pointer-like types. Meaning that if you write l2=l, you do not copy l to l2 but rather copy the pointer so modifying one, will modify the other.

The proper way to copy list is to use the dedicated copy method for list variables.

In [20]:

l2 = l 
l.append('Something')
print(l,l2)

[1, 'u', 2, 3, [4, 8], True, 2.3, 'Something'] [1, 'u', 2, 3, [4, 8], True, 2.3, 'Something']

In [21]:

l3 = list(l) # l.copy() works in Python 3 
l.remove('Something')
print(l,l3)

[1, 'u', 2, 3, [4, 8], True, 2.3] [1, 'u', 2, 3, [4, 8], True, 2.3, 'Something']

You can have void lists and concatenate list by simply using the + operator, or even repeat them with * .

In [22]:

l4 = []
l5 =[4,8,10.9865]
print(l+l4+l5)
print(l5*3)

[1, 'u', 2, 3, [4, 8], True, 2.3, 4, 8, 10.9865]
[4, 8, 10.9865, 4, 8, 10.9865, 4, 8, 10.9865]

Tuples, Dictionaries [*]¶

Tuples are similar to list but are created with (...,...) or simply comas. They cannot be changed once created.

In [23]:

t = (1,'b',876876.908)
print(t,type(t))
print(t[0])

(1, 'b', 876876.908) <class 'tuple'>
1

In [24]:

a,b = 12,[987,98987]
u = a,b
print(a,b,u)

12 [987, 98987] (12, [987, 98987])

In [25]:

try:
    u[1] = 2
except Exception as error: 
    print(error)

'tuple' object does not support item assignment

Dictionaries are aimed at storing values of the form key-value with the syntax {key1 : value1, ...}

This type is often used as a return type in librairies.

In [26]:

d = {"param1" : 1.0, "param2" : True, "param3" : "red"}
print(d,type(d))

{'param1': 1.0, 'param2': True, 'param3': 'red'} <class 'dict'>

In [27]:

print(d["param1"])
d["param1"] = 2.4
print(d)

1.0
{'param1': 2.4, 'param2': True, 'param3': 'red'}

Strings and text formatting¶

Strings are delimited with (double) quotes. They can be handled globally the same way as lists (see above).
print displays (tuples of) variables (not necessarily strings).
To include variable into string, it is preferable to use the format method.

Warning: text formatting and notably the print method is one of the major differences between Python 2 and Python 3. The method presented here is clean and works in both versions.

In [28]:

s = "test"
print(s,type(s))

test <class 'str'>

In [29]:

print(s[0])
print(s + "42")

t
test42

In [30]:

print(s,42)
print(s+"42")

test 42
test42

In [31]:

try:
    print(s+42)
except Exception as error: 
    print(error)

can only concatenate str (not "int") to str

The format method

In [32]:

print( "test {}".format(42) )

test 42

In [33]:

print( "test with an int {:d}, a float {} (or {:e} which is roughly {:.1f})".format(4 , 3.141 , 3.141 , 3.141 ))

test with an int 4, a float 3.141 (or 3.141000e+00 which is roughly 3.1)

2- Branching and Loops¶

If, Elif, Else¶

In Python, the formulation for branching is the if: condition (mind the :) followed by an indentation of one tab that represents what is executed if the condition is true. The indentation is primordial and at the core of Python.

In [34]:

statement1 = False
statement2 = False

if statement1:
    print("statement1 is True")
elif statement2:
    print("statement2 is True")
else:
    print("statement1 and statement2 are False")

statement1 and statement2 are False

In [35]:

statement1 = statement2 = True

if statement1:
    if statement2:
        print("both statement1 and statement2 are True")

both statement1 and statement2 are True

In [36]:

if statement1:
    if statement2: # Bad indentation!
    #print("both statement1 and statement2 are True") # Uncommenting Would cause an error
        print("here it is ok")
    print("after the previous line, here also")

here it is ok
after the previous line, here also

In [37]:

statement1 = True 

if statement1:
    print("printed if statement1 is True")
    
    print("still inside the if block")

printed if statement1 is True
still inside the if block

In [38]:

statement1 = False 

if statement1:
    print("printed if statement1 is True")
    
print("outside the if block")

outside the if block

For loop¶

The syntax of for loops is for x in something: followed by an indentation of one tab which represents what will be executed.

The something above can be of different nature: list, dictionary, etc.

In [39]:

for x in [1, 2, 3]:
    print(x)

1
2
3

In [40]:

sentence = ""
for word in ["Python", "for", "data", "Science"]:
    sentence = sentence + word + " "
print(sentence)

Python for data Science

A useful function is range which generated sequences of numbers that can be used in loops.

In [41]:

print("Range (from 0) to 4 (excluded) ")
for x in range(4): 
    print(x)   

print("Range from 2 (included) to 6 (excluded) ")
for x in range(2,6): 
    print(x)

print("Range from 1 (included) to 12 (excluded) by steps of 3 ")
for x in range(1,12,3): 
    print(x)

Range (from 0) to 4 (excluded) 
0
1
2
3
Range from 2 (included) to 6 (excluded) 
2
3
4
5
Range from 1 (included) to 12 (excluded) by steps of 3 
1
4
7
10

If the index is needed along with the value, the function enumerate is useful.

In [42]:

for idx, x in enumerate(range(-3,3)):
    print(idx, x)

While loop¶

Similarly to for loops, the syntax iswhile condition: followed by an indentation of one tab which represents what will be executed.

In [43]:

i = 0

while i<5:
    print(i)
    i+=1

Try [*]¶

When a command may fail, you can try to execute it and optionally catch the Exception (i.e. the error).

In [44]:

a = [1,2,3]
print(a)

try:
    a[1] = 3
    print("command ok")
except Exception as error: 
    print(error)
    
print(a) # The command went through

try:
    a[6] = 3
    print("command ok")
except Exception as error: 
    print(error)
    
print(a) # The command failed

[1, 2, 3]
command ok
[1, 3, 3]
list assignment index out of range
[1, 3, 3]

3- Functions¶

In Python, a function is defined as def function_name(function_arguments): followed by an indentation representing what is inside the function. (No return arguments are provided a priori)

In [45]:

def fun0():
    print("\"fun0\" just prints")

fun0()

"fun0" just prints

Docstring can be added to document the function, which will appear when calling help

In [46]:

def fun1(l):
    """
    Prints a list and its length
    """
    print(l, " is of length ", len(l))
    
fun1([1,'iuoiu',True])

[1, 'iuoiu', True]  is of length  3

In [47]:

help(fun1)

Help on function fun1 in module __main__:

fun1(l)
    Prints a list and its length

Outputs¶

return outputs a variable, tuple, dictionary, ...

In [48]:

def square(x):
    """
    Return x squared.
    """
    return(x ** 2)

help(square)
res = square(12)
print(res)

Help on function square in module __main__:

square(x)
    Return x squared.

144

In [49]:

def powers(x):
    """
    Return the first powers of x.
    """
    return(x ** 2, x ** 3, x ** 4)

help(powers)

Help on function powers in module __main__:

powers(x)
    Return the first powers of x.

In [50]:

res = powers(12)
print(res, type(res))

(144, 1728, 20736) <class 'tuple'>

In [51]:

two,three,four = powers(3)
print(three,type(three))

27 <class 'int'>

In [52]:

def powers_dict(x):
    """
    Return the first powers of x as a dictionary.
    """
    return{"two": x ** 2, "three": x ** 3,  "four": x ** 4}


res = powers_dict(12)
print(res, type(res))
print(res["two"],type(res["two"]))

{'two': 144, 'three': 1728, 'four': 20736} <class 'dict'>
144 <class 'int'>

Arguments¶

It is possible to

Give the arguments in any order provided that you write the corresponding argument variable name
Set defaults values to variables so that they become optional

In [53]:

def fancy_power(x, p=2, debug=False):
    """
    Here is a fancy version of power that computes the square of the argument or other powers if p is set
    """
    if debug:
        print( "\"fancy_power\" is called with x =", x, " and p =", p)
    return(x**p)

In [54]:

print(fancy_power(5))
print(fancy_power(5,p=3))

25
125

In [55]:

res = fancy_power(p=8,x=2,debug=True)
print(res)

"fancy_power" is called with x = 2  and p = 8
256

4- Classes [*]¶

Classes are at the core of object-oriented programming, they are used to represent an object with related attribues (variables) and methods (functions).

They are defined as functions but with the keyword class class my_class(object): followed by an indentation. The definition of a class usually contains some methods:

The first argument of a method must be self in auto-reference.
Some method names have a specific meaning:
- __init__: method executed at the creation of the object
- __str__ : method executed to represent the object as a string for instance when the object is passed ot the function print

In [56]:

class Point(object):
    """
    Class of a point in the 2D plane.
    """
    def __init__(self, x=0.0, y=0.0):
        """
        Creation of a new point at position (x, y).
        """
        self.x = x
        self.y = y
        
    def translate(self, dx, dy):
        """
        Translate the point by (dx , dy).
        """
        self.x += dx
        self.y += dy
        
    def __str__(self):
        return("Point: ({:.2f}, {:.2f})".format(self.x, self.y))

In [57]:

p1 = Point()
print(p1)

p1.translate(3,2)
print(p1)

p2 = Point(1.2,3)
print(p2)

Point: (0.00, 0.00)
Point: (3.00, 2.00)
Point: (1.20, 3.00)

5- Reading and writing files¶

open returns a file object, and is most commonly used with two arguments: open(filename, mode).

The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used (optional, 'r' will be assumed if it’s omitted.):

'r' when the file will only be read
'w' for only writing (an existing file with the same name will be erased)
'a' opens the file for appending; any data written to the file is automatically added to the end

In [58]:

f = open('./data/test.txt', 'w')
print(f)

<_io.TextIOWrapper name='./data/test.txt' mode='w' encoding='UTF-8'>

f.write(string) writes the contents of string to the file.

In [59]:

f.write("This is a test\n")

Out[59]:

In [60]:

f.close()

Warning: For the file to be actually written and being able to be opened and modified again without mistakes, it is primordial to close the file handle with f.close()

f.read() will read an entire file and put the pointer at the end.

In [61]:

f = open('./data/text.txt', 'r')
f.read()

Out[61]:

'This is an example file\nMade specially for this course\nThis is already the third line\nLine 4\nTHE END\n'

In [62]:

f.read()

Out[62]:

''

The end of the file has be reached so the command returns ''.

To get to the top, use f.seek(offset, from_what). The position is computed from adding offset to a reference point; the reference point is selected by the from_what argument. A from_what value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of the file as the reference point. Thus f.seek(0) goes to the top.

In [63]:

f.seek(0)

Out[63]:

f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string

In [64]:

f.readline()

Out[64]:

'This is an example file\n'

In [65]:

f.readline()

Out[65]:

'Made specially for this course\n'

For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:

In [66]:

f.seek(0)
for line in f:
    print(line)

f.close()

This is an example file

Made specially for this course

This is already the third line

Line 4

THE END

6- Exercises¶

Exercise 1: Odd or Even

The code snippet below enable the user to enter a number. Check if this number is odd or even. Optionnaly, handle bad inputs (character, float, signs, etc)

In [67]:

num = input("Enter a number: ")
print(num)

Enter a number: 3
3

In [ ]:

Exercise 2: Fibonacci

The Fibonacci seqence is a sequence of numbers where the next number in the sequence is the sum of the previous two numbers in the sequence. The sequence looks like this: 1, 1, 2, 3, 5, 8, 13. Write a function that generate a given number of elements of the Fibonacci sequence.

In [ ]:

Exercise 3: Implement quicksort

The wikipedia page describing this sorting algorithm gives the following pseudocode:

function quicksort('array')
   if length('array') <= 1
        return 'array'
   select and remove a pivot value 'pivot' from 'array'
   create empty lists 'less' and 'greater'
   for each 'x' in 'array'
       if 'x' <= 'pivot' then append 'x' to 'less'
       else append 'x' to 'greater'
   return concatenate(quicksort('less'), 'pivot', quicksort('greater'))

Create a function that sorts a list using quicksort

In [68]:

def quicksort(l):
    # ...
    return None

res = quicksort([-2, 3, 5, 1, 3])
print(res)

None

Exercise 4: Project Euler

Project Euler is a website of competitive programming mainly based on solving cleverly otherwise computation-intensive mathematical problems. It is a good way to learn a new scientific programming language. Problem 1 reads

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

Write a script that solves this problem.

In [ ]:

You can continue by solving other problems. The first ones (eg. 4, 31) are the easiest.

In [ ]: