Prerequisites
Outcomes
Welcome to the start of your path to learning how to work with data in the Python programming language!
A programming language is, loosely speaking, a structured subset of natural
language (words) and special characters (e.g. ,
or {
) that allow humans
to describe operations they would like their computer to perform on their behalf.
The programming language translates these words and symbols into instructions the computer can execute.
Among the hundreds of programming languages to choose from, we chose to teach you Python for the following reasons:
However, the general purpose nature of Python comes at a cost: it is often said that Python is “the best language for nothing but the second best language for everything”.
We aren’t sure this is true, but a more optimistic view of that quote is that Python is a great language to have in your toolbox to solve all sorts of problems and patch them together.
A versatile “second-best” language might be the best one to learn first.
Some other languages to consider:
Another consideration for programming language choice is runtime performance. On this dimension, Python, R, and Matlab can be slow for certain types of tasks.
Luckily, this will not be an issue for data science and the types of analysis we will do in this course, because most of the data analytics packages in Python (and R) rely on high-performance code written in other languages in the background.
If you are writing more traditional scientific/technical computing in Python, there are things that can help make Python faster in some situations, but another language like Julia may be a better fit.
Software development has changed radically in the last decade, increasingly becoming a process of stitching together both established high quality libraries, and state-of-the-art research projects.
A major disadvantage of Matlab, Stata, and other proprietary languages is that they are not open-source, and unable to work within this new paradigm.
Forgetting the cost for a moment, the benefits of using an open-source language are pragmatic rather than ideological.
These materials are meant to be interacted with, not passively read.
To help you do this, we use a software called Jupyter and files known as Jupyter notebooks which allow us to bundle a mixture of text, code, and code output together.
In fact, right now you are either directly reading a Jupyter notebook or a website that was generated from a Jupyter notebook.
We will refer to two components of Jupyter’s software: JupyterLab and Jupyter Notebook.
JupyterLab
JupyterLab is a software that runs in your browser and allows you to do a variety of things such as: edit text, view files, and (most importantly) work with Jupyter notebooks.
Jupyter Notebook
This is the actual file that allows you to mix code and text.
The content inside a Jupyter notebook is organized into cells.
Cells can have inputs and outputs.
There are two main types of cells:
Below is an image that demonstrates what a Jupyter Notebook looks like:
Notice a few things about this image:
[ ]:
to the left of them and have a darker background than the
surrounding area.[ ]:
box and have no
corresponding output.[#]:
to the left of them (where #
is a number) and, depending
on what the code in that cell does, may or may not have an output.Being able to include both text and code allows us to do interesting computations and explain them.
This combination has caused leading companies like Netflix and Bloomberg to adopt Jupyter as a tool of choice for data analytics and reporting.
We will follow in their path and leverage Jupyter Notebook throughout these materials.
The interactivity of a Jupyter notebook is driven by two main components:
We will edit the content of the notebook and request code execution from the web GUI.
The Jupyter application will then ask the server to execute the code and send the results back to the GUI.
You can choose to interact with these materials from two server environments:
Cloud Computing
A cloud solution provides a pre-installed environment for you.
As long as you have an internet connection, you will be able to interact with the lectures through a few cloud computing options.
We try to ensure that you’ll be able to run any of the lectures from each of the cloud options, but because these services are hosted by others, we cannot provide any guarantees.
Using the cloud is a great option if
out without any additional commitment.
with our lectures — we often take this route with our colleagues over coffee (or other conversation stimulants).
If you would like to work on these lectures from the cloud, please read the instructions for getting set up with cloud computing.
These instructions describe several of the possible computing environments that you can choose, discuss their pros and cons, and explain what JupyterHub is.
After reading the instructions, return to this page and proceed with the Jupyter Basics section.
Local Installation
With a local installation, you will install the required software onto your own computer.
This is typically a straightforward task and, once you have done this, you will be able to run the lecture code (and any other code you write for your own projects!) on your personal computer.
If you are confident that these are skills you would like to acquire and are willing to have the software installed on your computer, then this is a great option.
If you would like to work from a local installation, please read the instructions in local installation instructions page.
These instructions will walk you through the installation procedure, help with some basic setup, and show you how to open JupyterLab.
Once you have completed installing the software, return to this page and proceed with the Jupyter Basics section.
Now that you can open a JupyterLab instance on the cloud or on your own computer, we can talk about how you should use them.
Note, not all of this will apply if you are using the Google Colab cloud server since they are running a modified version of Jupyter. For more help on using Colab, see their help menu.
JupyterLab Dashboard
When you open a new session in Jupyter, you will be taken to the JupyterLab dashboard page.
This page shows the file system of the machine running the Jupyter server and allows you to navigate and open particular Jupyter Notebooks (or other types of files!).
The dashboard page is shown below.
You can open existing files or change folders by double clicking on them in the left panel of the dashboard (similar to a file explorer you would find on your computer).
You can create new notebooks by clicking Python 3
in the Launcher (see red square in the image).
If you don’t see the Launcher as one of your tabs, you can open it by clicking the +
at the top
of the file explorer section of the JupyterLab dashboard (see the red circle in the image).
Editing Jupyter Notebooks
Once you have opened a particular notebook, you can be in one of two “edit modes”.
itself. For example changing the order of cells, creating a new cell, etc…
a
adds a new cell above the current cell, b
adds one below the current
cell, and dd
deletes the current cell.k
) changes the selected cell to the cell above the
current one and down arrow (or j
) changes to the cell below.Some useful commands:
Shift + Enter
(meaning Shift
and Enter
at
the same time)See exercise 1 in the exercise list.
Advanced Usage and Getting Help
For more help with JupyterLab and Jupyter Notebook, see the user guides:
In the code cell below (notice the [ ]:
to the left) type a quote ("
), your name,
then another quote ("
) and evaluate the cell
# code here!