Imported stuff from python_tutorial.ipynb
.
Python is a popular language for data analysis because of the numerous functions it provides for data management, data visualization, and statistics.
Learning to use these Python functions will y
Learning a few basic Python constructs like the for
loop
will enable you to simulate probability distributions and experimentally verify how statistics procedures work.
This is a really big deal!
If's good to know the statistical formula and recipes,
but it's even better when you can run your own simulations and check when the formulas work and when they fail.
Once you learn the basics of Python syntax,
you'll have access to the best-in-class tools for
data management (Pandas, see pandas_tutorial.ipynb),
data visualization (Seaborn, see seaborn_tutorial.ipynb),
statistics (scipy
and statsmodels
).
Don't worry there won't be any advanced math—just sums, products, exponents, logs, and square roots.
Nothing fancy, I promise.
If you've ever created a formula in a spreadsheet,
then you're familiar with all the operations we'll see.
In a spreadsheet formula you'd use SUM(
in Python we write sum(
.
You see, it's nothing fancy.
Yes, there will be a lot of code (paragraphs of Python commands) in this tutorial, but you can totally handle this. If you ever start to freak out an think "OMG this is too complicated!" remember that Python is just a fancy calculator.
All in all, learning Python gives lots of tools to help you understand math and science topics.
take advantage of everything Python has to offer for data analysis and statistics.
You can run JupyterLab on your computer or run JupyterLab on a remote server using a binder link.
Alternatives: If you don't want to install anything on your computer yet, you have two other options for playing with this notebook:
Live Code
option to make all the cells in this notebook interactive.The notebook interface offers many useful features, but for now, I just want you to think of notebooks as an easy way to run Python code.
Here is another example of an expression that uses the function len
to compute the length of a list of numbers.
len([1, 2, 3])
3
Here the function len
received the list of numbers [1, 2, 3]
as input, and produced the output 3
as output, which is the length of the list.
To store the result of an expression as a variable,
we use the assignment operator =
as follows, from left to right:
=
(which stands for assign to)
Running the above code cell doesn't print anything,
because we have only defined variables: score
, average
, scores
, message
, and above_the_average
, but we didn't display them.
We'll use the generic variable name obj
to refer to an object of any type.
Notebooks allow us run Python commands interactively, which is the best way to learn! Try some Python commands to get a feeling of how code cells work.
Remember you can click the play button in the toolbar (the first button in box (4) in the screenshot) or use the keyboard shortcut SHIFT+ENTER to run the code.
I encourage you to play around with the notebook execution buttons in box (4).
The Python syntax for functions is inspired by the syntax used for math functions, so we'll start with a quick overview the concepts of a function in mathematics. The convention in math to call functions with single letters like $f$, $g$, $h$, etc. We call denote the function inputs as $x$ and its outputs as $y$.
We define the function $f$ by writing expression to compute for a given input $x$.
$$ y = f(x) = \text{some expression involving } x $$For example $f(x) = 2x+3$.
Once we have defined the function $f$, we can evaluate the function for any possible input $x$. For example, the value of the function $f$ when $x=5$ is denoted $f(5)$ and is equal to $2(5) +3 = 10 + 3 = 13$. In other words, $f(5) = 13$.
Functions in Python are similar to functions in math:
...
In the above example, we intentionally chose the function name, and the name of its input and output to highlight the connection with with the math function example we saw earlier.
plotnine
another high-level library for data visualization base don the grammar of graphics principlesscikit-learn
tools and algorithms for machine learning
We'll cover all essential topics required to get to know Python, including:
Getting started where we'll install JupyterLab Desktop coding environment
Expressions and variables: basic building blocks of any program.
Getting comfortable with Python: looking around and getting help.
Lists and for loops: repeating steps and procedures.
Functions are reusable code blocks.
Other data structures: sets, tuples, etc.
Objects and classes: creating custom objects.
Python grammar and syntax: review of all the syntax.
Python libraries and modules: learn why people say Python comes with "batteries included"
After you're done with this tutorial, you'll be ready to read the other two:
It's important for you to try solving the exercises that you'll encounter as you read along. The exercises are a great way to practice what you've been learning.
## ALT. display both value and type on the same line (as a tuple)
# score, type(score)
Python is a "civilized" language
We'll now learn about some of these tools including, "doc strings" (help menus) and the different ways at learning
at what attributes and methods are available to use.
We'll now learn about some of these tools including,
Above all, Python has a culture of being beginner friendly so
This combination of tools allows programmers to answer common questions about Python objects and functions without leaving the JupyterLab environment. Basically, in Python all the necessary info is accessible directly in the coding environment. For example, at the end of this section you'll be able to answer the following questions on your own:
print
expect?print
accept?obj
have?More than 50% of any programmer's time is spent looking at help info and trying to understand the variables, functions, objects, and methods they are working with, so it's important for you to learn these meta-skills.
You can also add longer, multi-line comments using triple-quoted text.
"""
This is a longer comment,
which is written on two lines.
"""
'\nThis is a longer comment,\nwhich is written on two lines.\n'
The doc-strings we talked about earlier,
were created by this kind of multi-line strings included in the source code of the functions abs
, len
, sum
, print
, etc.
#### More exceptions
# ValueError
# int("zz")
# ZeroDivisionError
# 5/0
# KeyError
# d = {}
# d["zz"]
# ImportError
# from math import zz
# AttributeError
# "hello".zz
The computer doesn't like what you entered. The output is a big red box, that tells you your input was REJECTED!
if you type invalid syntax, assign to non-existing variables, or otherwise input something that Python doesn't like, Python will throw an "exception," which is like saying "Yo, I don't understand this, or I can't run this, or the code refers to some data that doesn't exist, etc." You'll see the name of the error that occurred and a message to explain what went wrong.
abs
function¶Let's say you're reading some Python code written by a friend,
and they use the function abs
in their code.
Suppose you've never seen this function before,
and you have no idea what it does.
# put cursor in the middle of function and press SHIFT+TAB
abs(-3)
3
We can also obtain the same information using the help()
function on abs
.
help(abs)
Help on built-in function abs in module builtins: abs(x, /) Return the absolute value of the argument.
The help menu tells you that abs(x)
is the absolute value function,
which is written $|x|$ in math notation, and defined as
We refer to the help menu associated with an object as its "doc string",
since the information is stored as obj.__doc__
.
abs.__doc__
'Return the absolute value of the argument.'
We've already used both type
and print
, so there is nothing new here.
I just wanted to remind you you can always use these functions as first line of inspection.
obj = 3
print(obj)
3
type(obj)
int
repr(obj)
'3'
Side note: You can think of a list as a special type of dictionary that has the integers 0
, 1
, 2
, as keys. Alternatively, you can think of dictionaries as "fancy lists" that allow keys to be arbitrary instead of being limited to sequential integer indices.
Recall the general syntax of an assignment statement is as follows:
<place> = <some expression>
In the above examples, the <place>
refers the the location inside the profile
dictionary identified by a particular key. In the first example,
we assigned the value 77
to the place profile["score"]
which modified the value that was previously stored there. In the second example we assigned the value 42
to the new place profile["age"]
, so Python created it.
True and True, True and False, False and True, False and False
(True, False, False, False)
True or True, True or False, False or True, False or False
(True, True, True, False)
We can also use if
-else
keywords to write conditional expressions on a single line.
The general syntax for these is:
<value1> if <condition> else <value2>
This expressions evaluates to <value1>
if <condition>
is True
,
else it evaluates to <value2>
when <condition>
is False
.
temp = 25
msg = "It's hot!" if temp > 22 else "It's OK."
msg
"It's hot!"
Recall you can see a complete list of all the methods on list objects by typing scores.
then pressing the TAB button to trigger the auto-complete suggestions.
Uncomment the following code block,
place your cursor after the dot,
and try pressing TAB to see what happens.
# scores.
Exercise 12: The default behaviour of the method .sort()
is to
sort the elements in increasing order.
Suppose you want sort the elements in decreasing order instead.
You can pass a keyword argument to the method .sort()
to request the sorting be done in "revese" order (decreasing instead of increasing).
Consult the docstring of the .sort()
method to find the name of the keyword argument
that does this,
then modify the code below to sort the elements of the list scores
in decreasing order.
scores = [61, 79, 98, 72]
scores.sort()
scores
[61, 72, 79, 98]
#@titlesolution Exercise 12 sorted-reverse
# help(scores.sort)
scores.sort(reverse=True)
scores
[98, 79, 72, 61]
In a Jupyter notebook,
you can run the command %whos
to print all the variables and functions that defined in the current namespace.
# %whos
Exercise 7: Display the doc-string of the function sum
.
#@titlesolution Exercise 7 help-sum
help(sum)
Help on built-in function sum in module builtins: sum(iterable, /, start=0) Return the sum of a 'start' value (default: 0) plus an iterable of numbers When the iterable is empty, return the start value. This function is intended specifically for use with numeric values and may reject non-numeric types.
Exercise 8:
Display the doc string of for the function print
.
#@titlesolution Exercise 8 help-print
help(print)
Help on built-in function print in module builtins: print(...) print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False) Prints the values to a stream, or to sys.stdout by default. Optional keyword arguments: file: a file-like object (stream); defaults to the current sys.stdout. sep: string inserted between values, default a space. end: string appended after the last value, default a newline. flush: whether to forcibly flush the stream.
We can choose a different separator between arguments of the print
function
by specifying the value for the keyword argument sep
.
x = 3
y = 2.3
print(x, y)
3 2.3
print(x, y, sep=" --- ")
3 --- 2.3
The formula for the sample standard seviation of a list of numbers is: $$ \text{std}(\textbf{x}) = s = \sqrt{ \tfrac{1}{n-1}\sum_{i=1}^n (x_i-\overline{x})^2 } = \sqrt{ \tfrac{1}{n-1}\left[ (x_1-\overline{x})^2 + (x_2-\overline{x})^2 + \cdots + (x_n-\overline{x})^2\right]}. $$
Note the division is by $(n-1)$ and not $n$. Strange, no? You'll have to wait until stats to see why this is the case.
Write compute_std(numbers)
: computes the sample standard deviation
import math
def mean(numbers):
return sum(numbers)/len(numbers)
def std(numbers):
"""
Computes the sample standard deviation (square root of the sample variance)
using a for loop.
"""
avg = mean(numbers)
total = 0
for number in numbers:
total = total + (number-avg)**2
var = total/(len(numbers)-1)
return math.sqrt(var)
numbers = list(range(0,100))
std(numbers)
29.011491975882016
# compare to known good function...
import statistics
statistics.stdev(numbers)
29.011491975882016
Functions! Finally we get to the good stuff!
Functions allow us to build chunks of reusable code that we can later reuse in other programs.
# # ALT. using the `range` function
# numbers = range(1,6)
# squares = [n**2 for n in numbers]
# squares
Python sets are a representation of the mathematical sets, that is, collections of elements.
s = set()
s
set()
s.add(3)
s.add(5)
s
{3, 5}
3 in s
True
print("The set s contains the elements:")
for el in s:
print(el)
The set s contains the elements: 3 5
Sets are sometimes useful when we want to keep track of which elements have been encountered, but don't care how many times.
s.add(3)
s.add(3)
s
{3, 5}
Tuples are similar to lists but with less features.
2,3
(2, 3)
(2,3)
(2, 3)
We can use the tuples syntax to assign to multiple variables on a single line:
x, y = 3, 4
We can also use tuples to "swap" two values.
# Swap the contexts of the variables x and y
tmp = x
y = x
x = tmp
# Equivalent operation on one line
x, y = y, x
Using the Python keyword class
can be used to define new kinds of objects.
Let's create a custom class of objects Interval
that represent intervals of real numbers like $[a,b] \subset \mathbb{R}$.
We want to be able to use the new interval objects in if
statements to check if a number $x$ is in the interval $[a,b]$ or not.
Recall the in
operator that we can use to check if an element is part of a list
>>> 3 in [1,2,3,4]
True
we want the new objects of type Interval
to test for membership.
Example usage:
>>> 3 in Interval(2,4)
True
>>> 5 in Interval(2,4)
False
The expression x in Y
is corresponds to calling the method __contains__
on the container object Y
:
Y.__contains__(x)
and it will return a bool
ean value (True
or False
).
If we want to support checks like 3 in Interval(2,4)
we therefore have to implement
the method __contains__
on the Interval
class.
class Interval:
"""
Object that embodies the mathematical concept of an interval.
`Interval(a,b)` is equivalent to math interval [a,b] = {𝑥 | 𝑎 ≤ 𝑥 ≤ 𝑏}.
"""
def __init__(self, lowerbound, upperbound):
"""
This method is called when the object is created, and is used to
set the object attributes from the arguments passed in.
"""
self.lowerbound = lowerbound
self.upperbound = upperbound
def __str__(self):
"""
Return a representation of the interval as a string like "[a,b]".
"""
return "[" + str(self.lowerbound) + "," + str(self.upperbound) + "]"
def __contains__(self, x):
"""
This method is called to check membership using the `in` keyword.
"""
return self.lowerbound <= x and x <= self.upperbound
def __len__(self):
"""
This method will get called when you call `len` on the object.
"""
return self.upperbound - self.lowerbound
Create an object that corresponds to the interval $[2,4]$.
interval2to4 = Interval(2,4)
interval2to4
<__main__.Interval at 0x1041639d0>
type(interval2to4)
__main__.Interval
str(interval2to4)
'[2,4]'
3.3 in interval2to4
True
1 in interval2to4
False
len(interval2to4)
2
If you're on macOS or Linux you can ignore this section—skip to the next section Data management with Pandas.
File paths on Windows use the backslash character (\
) as path separator,
while UNIX operating systems like Linux and macOS use forward slash separator /
as path separator.
If you you're on Windows you'll need to manually edit the code examples below to make them work by replacing all occurrences of "/
" with "\\
". The double backslash is required to get a literal backslash because the character \
has special meaning as an escape character.
import os
if os.path.sep == "/":
print("You're on a UNIX system (Linux or macOS).")
print("Enjoy civilization!")
elif os.path.sep == "\\":
print("You're on Windows so you should use \\ as path separator.")
print("Replace any occurence of / (forward slash) in paths with \\\\ (double-backslash).")
You're on a UNIX system (Linux or macOS). Enjoy civilization!
The current working directory is a path on your computer where this notebook is running. The code cell below shows you can get you current working directory.
os.getcwd()
'/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/tutorials'
You're in the notebooks/
directory, which is inside the parent directory noBSstats/
.
The datasets we'll be using in this notebook are located in the datasets/
directory, which is sibling of the notebooks/
directory, inside the parent noBSstats/
. To access data file minimal.csv
in the datasets/
directory from the current directory, we must specify a path that includes the ..
directive (go to parent), then go into the datasets
directory, then open the file minimal.csv
.
This combination of "directions" for getting to the file will look different if you're on a Windows system or on a UNIX system. The code below shows the correct path you should access.
if os.path.sep == "/":
# UNIX path separators
path = "../datasets/players.csv"
else:
# Windows path separators
path = "..\\datasets\\players.csv"
print("The path to the file players.csv in the datasets/ directory is")
path
The path to the file players.csv in the datasets/ directory is
'../datasets/players.csv'
All the code examples provided below assume you're on a UNIX system, hence the need to manually modify them to use double-backslashes in path strings for the code to work.
# ALT.
import os
import pandas as pd
filepath = os.path.join("..", "datasets", "players.csv")
players = pd.read_csv(filepath)
The code "".join(msgs)
can be used to concatenate the a list of strings msgs
.
msgs = ["Hello", "Hi", "Whatsup?"]
"".join(msgs)
'HelloHiWhatsup?'
# join-together using once space " " as separator
" ".join(msgs)
'Hello Hi Whatsup?'
open
and the readlines()
method¶TODO: create a file called story.txt
in the current working directory,
and write a few lines in it.
The code examples below are based on this sample story.txt
(use save as if you want to have the same file).
file = open("story.txt")
lines = file.readlines()
lines
['This is a short story.\n', 'It is very short.\n', 'It has only four lines.\n', 'It ends with the word cat.\n']
for line in lines:
print(line)
This is a short story. It is very short. It has only four lines. It ends with the word cat.
# we can pass a custom `end` keyword argument to avoid double newlines:
for line in lines:
print(line, end="")
This is a short story. It is very short. It has only four lines. It ends with the word cat.
Exercise: write the code that counts the number of words in the string text
.
Here are some examples:
"Hello world"
is 2."Whether it is nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles, and by opposing end them?"
is 30.len.__doc__
is 8.Hint: the string method .split()
might come in handy.
text = "Hello world"
# write here the code that counts the number of words in `text`
# ...
#@titlesolution
text = "Hello world"
#
# text = len.__doc__
#
# text = """Whether it is nobler in the mind to suffer
# the slings and arrows of outrageous fortune,
# or to take arms against a sea of troubles,
# and by opposing end them?"""
words = text.split()
wordcount = 0
for word in words:
wordcount = wordcount + 1
wordcount
2
Exercise write the Python code that open
s the file story.txt
, read
s the contexts of the file to a string text
, then computes their word count.
Hint: reuse the code we saw earlier for opening file
Hint 2: try the .read()
method on file object
Hint 3: reuse the code you wrote earlier for doing the word count of the string text
file = open("story.txt")
text = file.read()
# ... (continue your code here)
#@titlesolution
file = open("story.txt")
text = file.read()
words = text.split()
wordcount = 0
for word in words:
wordcount = wordcount + 1
wordcount
20
Given the string text
, we want to count the number of occurrences of each word in the text.
The words "HELLO", "Hello", "Hello," should all be considered the same as "hello".
In other words, we want to convert all letters to lowercase and strip punctuation signs.
Given the string "HELLO Hello Hello, hello"
,
the bag of words representation corresponds to the dictionary
wcounts = {"hello":4}
text = "Hello world"
#
# text = len.__doc__
#
# text = """Whether it is nobler in the mind to suffer
# the slings and arrows of outrageous fortune,
# or to take arms against a sea of troubles,
# and by opposing end them?"""
words = text.split()
clean_words = [word.strip(",.?") for word in words]
words_lower = [word.lower() for word in clean_words]
wcounts = {}
for word in words_lower:
if word not in wcounts:
wcounts[word] = 0
wcounts[word] += 1
wcounts
{'hello': 1, 'world': 1}
Exercise Write the Python code that computes the final score for each student, based on the data in this spreadsheet. The final score is computed as:
Once you get the score
for each student,
you should also convert it to a letter grade
.
Print the student name, their final score, and their letter grade.
Hint: use a for loop to iterate over all students
Hint 2: reuse the code you wrote earlier for converting numeric score
to letter grade
# via https://stackoverflow.com/a/33727897/127114
#
url_tpl = "https://docs.google.com/spreadsheets/d/{key}/gviz/tq?tqx=out:csv&sheet={name}"
sheet_id = "1_DRn3FXpLERVhO71pHsYbf_jwxQF8p54M6Niy3If3x0"
sheet_name = "Grades"
url = url_tpl.format(key=sheet_id, name=sheet_name)
import requests
response = requests.get(url)
# response.text
# Convert CSV text (contents of a file) to a list of dictionaries
import csv, io
studentsf = io.StringIO(response.text)
rows = list(csv.DictReader(studentsf))
# rows
for row in rows:
# row is a dict containing a student's results
# the keys are ['id', 'name', 'hw1', 'hw2', 'hw3', 'midterm',
# 'hw4', 'hw5', 'final', 'score', 'grade']
# the values are strings
name = row["name"] # access value under the key "name" in this row
hw1 = int(row["hw1"]) # access key "hw1" and convert it to `int`
print(type(row), len(row), "name =", name, "got", hw1, "on the first homework")
# continue your code at the ... below
# PART 1: computing the final scrore
# PART 2: assigning a letter
# ...
<class 'dict'> 11 name = Haydon Jeffery got 12 on the first homework <class 'dict'> 11 name = Julie Beattie got 13 on the first homework <class 'dict'> 11 name = Malachy Hull got 15 on the first homework <class 'dict'> 11 name = Sheila Solis got 14 on the first homework <class 'dict'> 11 name = Joni Rowe got 12 on the first homework <class 'dict'> 11 name = Husna Millar got 11 on the first homework <class 'dict'> 11 name = Tonya Fleming got 11 on the first homework <class 'dict'> 11 name = Jak Rennie got 19 on the first homework <class 'dict'> 11 name = Noor Odonnell got 14 on the first homework <class 'dict'> 11 name = Krystal Dickerson got 13 on the first homework <class 'dict'> 11 name = Joe Pickett got 3 on the first homework <class 'dict'> 11 name = Alicia Rosario got 10 on the first homework <class 'dict'> 11 name = Ailish Hensley got 13 on the first homework <class 'dict'> 11 name = Aliyah Duncan got 12 on the first homework <class 'dict'> 11 name = Jad Kumar got 15 on the first homework <class 'dict'> 11 name = Margaret Parry got 14 on the first homework <class 'dict'> 11 name = Danica Chen got 11 on the first homework <class 'dict'> 11 name = Jose Hernandez got 13 on the first homework <class 'dict'> 11 name = Rimsha Carlson got 20 on the first homework <class 'dict'> 11 name = Giselle Thompson got 18 on the first homework
#@titlesolution
import csv
import io
studentsf = io.StringIO(response.text)
rows = list(csv.DictReader(studentsf))
for row in rows:
# row is a dict containing a student's results
# the keys are ['id', 'name', 'hw1', 'hw2', 'hw3', 'midterm',
# 'hw4', 'hw5', 'final', 'score', 'grade']
# the values are strings
name = row["name"] # access value under the key "name" in this row
print("Processing results of", name, "...")
# PART 1: computing the final scrore
####################################################################################
# Convert all available student results to integers
hw1 = int(row["hw1"]) # out of 20
hw2 = int(row["hw2"]) # out of 20
hw3 = int(row["hw3"]) # out of 20
midterm = int(row["midterm"]) # out of 100
hw4 = int(row["hw4"]) # out of 20
hw5 = int(row["hw5"]) # out of 20
final = int(row["final"]) # out of 100
# compute homeworks average out of 100,
# which is simple because each homework is out of 20 and there are 5 of them
homeworks = hw1 + hw2 + hw3 + hw4 + hw5
# we now need to make a "mix" of homeworks, midterm, and final
# to create the student's final score (out of 100)
score = 0.5*homeworks + 0.2*midterm + 0.3*final
print(" - final score = ", score)
# PART 2: assigning a letter
####################################################################################
if score >= 85:
grade = "A"
elif score >= 80:
grade = "A-"
elif score >= 75:
grade = "B+"
elif score >= 70:
grade = "B"
elif score >= 65:
grade = "B-"
elif score >= 60:
grade = "C+"
elif score >= 55:
grade = "C"
elif score >= 50:
grade = "D"
else:
grade = "F"
print(" - final grade = ", grade)
# PART 3: (optional) save the results in the `row` dictionary
####################################################################################
row["score"] = score
row["grade"] = grade
Processing results of Haydon Jeffery ... - final score = 79.39999999999999 - final grade = B+ Processing results of Julie Beattie ... - final score = 69.19999999999999 - final grade = B- Processing results of Malachy Hull ... - final score = 78.8 - final grade = B+ Processing results of Sheila Solis ... - final score = 77.8 - final grade = B+ Processing results of Joni Rowe ... - final score = 81.6 - final grade = A- Processing results of Husna Millar ... - final score = 77.89999999999999 - final grade = B+ Processing results of Tonya Fleming ... - final score = 74.69999999999999 - final grade = B Processing results of Jak Rennie ... - final score = 86.2 - final grade = A Processing results of Noor Odonnell ... - final score = 79.5 - final grade = B+ Processing results of Krystal Dickerson ... - final score = 78.2 - final grade = B+ Processing results of Joe Pickett ... - final score = 46.0 - final grade = F Processing results of Alicia Rosario ... - final score = 77.3 - final grade = B+ Processing results of Ailish Hensley ... - final score = 76.8 - final grade = B+ Processing results of Aliyah Duncan ... - final score = 77.8 - final grade = B+ Processing results of Jad Kumar ... - final score = 81.5 - final grade = A- Processing results of Margaret Parry ... - final score = 76.5 - final grade = B+ Processing results of Danica Chen ... - final score = 79.4 - final grade = B+ Processing results of Jose Hernandez ... - final score = 76.19999999999999 - final grade = B+ Processing results of Rimsha Carlson ... - final score = 90.0 - final grade = A Processing results of Giselle Thompson ... - final score = 89.4 - final grade = A
# # Display last row
# rows[-1]
Lists of bool
eans can be "joined" together
using and
or or
operations,
but calling all
and any
list-related built-in functions.
all(conditions)
: and
-together all elements in list of conditionsany(conditions)
: or
-together all elements in list of conditions# list of three conditions, all being True
alltrue = [True, True, True]
# list of conditions where only one condition is True
onetrue = [True, False, False]
# list of conditions that are all False
allfalse = [False, False, False]
all(alltrue), all(onetrue), all(allfalse)
(True, False, False)
any(alltrue), any(onetrue), any(allfalse)
(True, True, False)
head
¶We often want to print first few lines from a file to see what data it contains.
# TODO
Python supports and alternative syntax for defining functions, that is sometimes used to define simple functions. For example, let's say you need a function that computes the square of the input, which is written as $f(x) = x^2$ in math notation.
You can use the standard Python syntax ...
def f(x):
return x**2
f
... or you can use the lambda
expression for defining that function:
lambda x: x**2
The general syntax is lambda <function inputs>: <function output value>
.
This lambda
-shortcut for defining functions is useful
when calling other functions that expect functions as their arguments.
To illustrate what is going on,
let's define a python function plot_function(f)
that plots
the graph of the function f
it receives as its input.
The graph of the function $f(x)$ is the set of points $(x,f(x))$ in the Cartesian plane, over the interval of $x$ inputs (we'll use the $x$-limits -10 as the starting point and until x=10 as the end point).
import numpy as np
import matplotlib.pyplot as plt
def plot_function(f, xlims=[-10,10]):
xstart, xend = xlims
xs = np.linspace(xstart, xend, 1000)
ys = [f(x) for x in xs]
plt.plot(xs,ys)
If we want to use the function plot_function
to plot the graph of the function $f(x)=x^2$,
we can define a Python function f
using the standard def
-syntax and then pass the function f
to plot_function
to generate the graph,
as shown below:
def f(x):
return x**2
plot_function(f)
Alternatively,
we can use the lambda
-shortcut Python syntax to define the function inline,
when calling plot_function
.
plot_function(lambda x: x**2)
The lambda
-expression lambda x: x**2
is equivalent to the Python function f
we defined using the two-line def
-statement.
Both ways of defining f
are the same type
of object:
type(f)
function
type(lambda x: x**2)
function
Since f
and lambda x: x**2
are both expressions of the type function
,
we can call both of them the same way (by passing in the argument in parentheses).
For example, if we want to evaluate the function $f(x)$ at the input $x=3$,
we can call f(3)
...
f(3)
9
or we could define the function $f(x)=x^2$ using an inline lambda
expression,
then call the result of the lambda expression by passing in the argument in parentheses.
(lambda x: x**2)(3)
9
The lambda
-shortcut for defining functions is not used often,
but sometimes it is very convenient to be able to use inline function definition,
so I want you to be familiar with this syntax.
Consider the math function $f(x)=x^2$. We'll identify the output of the function as the variable $y$.
In Python, the function $f(x)=x^2$ is
def f(x):
return x**2
Suppose we want to compute the output values $y=f(x)$
for each $x$ in the list of values $[1,2,3,4]$,
which we'll call xs
in the code.
xs = [1,2,3,4]
Option A You can use a for
loop to compute the function output $y=f(x)$ for every $x$ in the list of values. First we create an empty list ys
to store the outputs, then we .append
to it the values one-by-one as we go though the for loop.
ys = []
for x in xs:
y = f(x)
ys.append(y)
ys
[1, 4, 9, 16]
Option B We can shorten the code using the list comprehension syntax:
ys = [f(x) for x in xs]
ys
[1, 4, 9, 16]
Option C
A third alternative would be to use the function map(f,xs)
which returns a list of the outputs f(x)
for all x
in the list xs
.
ys = map(f, xs)
list(ys)
[1, 4, 9, 16]
# ALT. we can specify the function argument to map as a lambda expression
# list(map(lambda x: x**2, xs))
This notion of obtaining ys
from xs
for entire lists of values,
instead of individual values like x
and y
is super useful.
We'll see this idea come up again later on in this tutorial
when we discuss the Python module NumPy,
which allows you to do math operations with "universal functions"
that do the right thing whether you input a number x
,
a list of numbers xs
, or even more complicated data structures (e.g. two-dimensional matrices, or higher-dimensional tensors).
Notebooks are an example of "interactive" use of the Python interpreter.
You enter some commands 2+3
in a code cell,
press SHIFT+ENTER to run the code,
and you see the result.
There are several different ways you can access the Python interpreter.
python
shell.
This is what you get in you install Python on your computer.
You can open a command prompt (terminal or cmd.exe) and type in the
command python
to start the interactive Python shell.
> python
Python 3.6.9 (default, Oct 6 2019, 21:40:49)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 2+3
5
>>>
ipython
shell.
This is a fancier shell with line numbering and
many helpful commands.
> ipython
Python 3.6.9 (default, Oct 6 2019, 21:40:49)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: 2+3
Out[1]: 5
In [2]:
Jupyter notebooks are web-based coding environments that allow you
to mix code cells and Markdown cells to create "code documents."
Notebook files have an extension .ipynb
and can be created using JupyterLab.
Several other systems like nbviewer, GitHub, VSCode, Google Colab,
can also be used "open" notebooks for viewing and "run" the notebooks interactively.
Colab notebooks. Google operates a service called "Google Colaboratory" (Colab for short) that allows you to run Python code as Colab notebooks.
Note the "Python calculator" functionality works the same way in each case.
The basic Python shell, the fancy ipython
shell, and the notebook interface
all offer a place to input your commands,
they READ your command input,
EVALUATE them (i.e. run them),
PRINT the output of the commands execution.
At the end of the READ-EVAL-PRINT steps,
the Python in interpreter goes back into "listening mode"
waiting for your next command input.
The overall behaviour of the Python interpreter is an example of the
READ-EVAL-PRINT Loop (REPL) that appears in professional human-computer interfaces.
The command line prompt (terminal on UNIX or cmd.exe
on Windows),
database prompts,
the JavaScript console in your browser,
the Ruby interactive console irb
,
and any other interface which accepts commands.
Given this multitude of choices, we've opted to use a Jypyter notebook to present this tutorial. Keep in mind you could run all the code examples in python shell, or ipython shell, or as a Colab notebook.
While we're on the topic of running Python code, let's briefly mention the other ways Python applications can operate. This is completely out of scope for the remainder of the discussion in this tutorial, since we're just using Python as a fancy calculator, but I though I'd mention some of the other uses of Python codes.
import io
import pandas as pd
data2 = io.StringIO("""
student_ID,background,curriculum,effort,score
1,arts,,10.96,75
2,science,lecture,8.69,75
""")
df2 = pd.read_csv(data2)
df2
student_ID | background | curriculum | effort | score | |
---|---|---|---|---|---|
0 | 1 | arts | NaN | 10.96 | 75 |
1 | 2 | science | lecture | 8.69 | 75 |
argparse
)Tricks:
enumerate
: provides an index when iterating through a list.zip
: allows you to iterate over multiple lists in parallel.enumerate
to get for
-loop with index¶Use enumerate(somelist)
to iterate over tuples (index, value)
,
from a list of values from the list somelist
.
In each iteration, the index
tells you the index of the value
in the current iteration.
scores = [61, 79, 98, 72]
list(enumerate(scores))
[(0, 61), (1, 79), (2, 98), (3, 72)]
# example usage
for idx, score in enumerate(scores):
# this for loop has two variables index and score
print("Processing score", score, "which is at index", idx, "in the list")
Processing score 61 which is at index 0 in the list Processing score 79 which is at index 1 in the list Processing score 98 which is at index 2 in the list Processing score 72 which is at index 3 in the list
zip
¶Use zip(list1,list2)
to get an iterator over tuples (value1, value2)
,
where value1
and value2
are elements taken from list1
and list2
,
in parallel, one at a time.
The name "zip" is reference to the way a zipper joins together the teeth of the two sides of the zipper when it is closing.
# example 1
list( zip([1,2,3], ["a","b","c"]) )
[(1, 'a'), (2, 'b'), (3, 'c')]
# example 2
list1 = [1, 2, 3]
list2 = [4, 5, 6]
list(zip(list1, list2))
[(1, 4), (2, 5), (3, 6)]
# compute the sum of the matching values in two lists
for value1, value2 in zip(list1, list2):
print("The sum of", value1, "and", value2, "is", value1+value2)
The sum of 1 and 4 is 5 The sum of 2 and 5 is 7 The sum of 3 and 6 is 9
functools.partial
for currying functions (e.g sample-generator callables)
The term "iterable" is used in Python to refer to any list-like object that can be used in a for
-loop.
Examples of iterables:
range
(lazy generator for lists of integers)range(0, 4)
range(0, 4)
list(range(0, 4))
[0, 1, 2, 3]
profile = {"first_name":"Julie", "last_name":"Tremblay", "score":98}
list(profile.keys())
['first_name', 'last_name', 'score']
# ALT.
list(profile)
['first_name', 'last_name', 'score']
list(profile.values())
['Julie', 'Tremblay', 98]
list(profile.items())
[('first_name', 'Julie'), ('last_name', 'Tremblay'), ('score', 98)]
We'll talk more about dictionaries later on.
Under the hood, Python uses all kinds of list-like data structures called iterables". We don't need to talk about these or understand how they work—all you need to know is they are behave like lists.
In the code examples above,
we converted several fancy list-like data structures into ordinary lists,
by wrapping them in a call to the function list
,
in order to display the results.
Let's look at why need to use list(iterable)
when printing,
instead of just iterable
.
For examples,
the set of keys for a dictionary is a dict_keys
iterable object:
profile.keys()
dict_keys(['first_name', 'last_name', 'score'])
type(profile.keys())
dict_keys
I know, right? What the hell is dict_keys
?
I certainly don't want to have to explain that...
... so instead, you'll see this in the code:
list(profile.keys())
['first_name', 'last_name', 'score']
type(list(profile.keys()))
list
functions with *args
and **kwargs
arguments
Coding a.k.a. programming, software engineering, or software development is a broad topic, which is out of scope for this short tutorial. If you're interested to learn more about coding, see the article What is code? by Paul Ford. Think mobile apps, web apps, APIs, algorithms, CPUs, GPUs, SysOps, etc. There is a lot to learn about applications enabled by learning basic coding skills, it's almost like reading and writing skills.
Learning programming usually takes several years, but you don't need to become a professional coder to start using Python for simple tasks, the same way you don't need to become a professional author to use writing for everyday tasks. If you reached this far in the tutorial, you know enough about basic Python to continue your journey.
In particular, you can read the other two tutorials that appear in the No Bullshit Guide to Statistics:
TODO functions are verbs
The In Python, we generally prefer to use more descriptive names (whole words) for function names and their inputs, as illustrated in the next example.
Exercise 13: Replace the ...
s in the following code cell
with comments that explain the calculation
"adding 10% tax to a purchase that costs $57"
that is being computed.
cost = 57.00 # ...
taxes = 0.10 * cost # ...
total = cost + taxes # ...
total # ...
62.7
#@titlesolution Exercise 13 cost-plus-taxes-total
cost = 57.00 # price before taxes
taxes = 0.10 * cost # 10% taxes = 0.1 times the cost
total = cost + taxes # add taxes to cost and store the result in total
total # print the total
62.7
All the Python variables we've been using until now are different kinds of "objects." An object is a the most general purpose "container" for data, that also provides functions for manipulating this data in the object.
In particular:
msg = "Hello world!"
type(msg)
str
# Uncomment the next line and press TAB after the dot
# msg.
# Methods:
msg.upper()
msg.lower()
msg.__len__()
msg.isascii()
msg.startswith("He")
msg.endswith("!")
True
filename = "message.txt"
file = open(filename, "w")
type(file)
_io.TextIOWrapper
# Uncomment the next line and press TAB after the dot
# file.
# Attributes:
file.name
file.mode
file.encoding
'UTF-8'
# Methods:
file.write("Hello world\n")
file.writelines(["line2", "and line3."])
file.flush()
file.close()
Let's go over some of the things we skipped in the tutorial, because they were not essential for getting started. Now that you know a little bit about Python, it's worth mentioning some of these details, since it's useful context to see how this "Python calculator" business works.