Notebook

Why Python?¶

Python is now the second most popular language on GitHub, after only JavaScript.

GitHub Languages

Jupyter notebooks growth over 100% every year for the last three years! (in 2019)
Still in the top 10 growing languages! (in 2022)

GitHub Language Growth

State of the Octoverse, 2019 - 2022

Why Python?¶

PyPI Languages

PyPL rankings of some of the most popular languages for data science. Quote: "Worldwide, Python is the most popular language, Python grew the most in the last 5 years (6.9%)" (March 2023)

Timeline of Python¶

1994: Python 1.0 released
1995: First array package: Numeric
2003: Matplotlib
2005: Numeric and numarray merged into NumPy
2008: Pandas introduced, Python 3 released
2012: The Anaconda python distribution was born
2014: IPython produces the Jupyter project and notebook
2016: LIGO's discovery was shown in a Jupyter Notebook, and was written in Python
2017: Google releases TensorFlow
2019: All Machine Learning libraries are primarily or exclusively used through Python
2020: Python 2 died, long live Python 3.6+!
2022: The faster CPython project provides 25% speedup in 3.12!

Timeline of Python, key points¶

2005: NumPy¶

Merged two competing codebases, created single ecosystem

2008: Pandas¶

Took on specialized statistics languages (like R) with a library in a general purpose language
Pioneered "Pythonic" shortcuts, breaking down traditional design barriers

2014: Jupyter¶

The notebook format, with code, outputs, and descriptions interleaved, became multilingual

Python vs. a compiled language¶

Python is an interpreted language. When we talk about Python, we usually mean CPython, which is not even Just In Time (JIT) compiled; it's purely interpreted.

TLDR: Python is slow.

Hundreds to thousands of times slower than C/C++/Fortran/Go/Swift/Rust/Haskell... You get the point.

Python is like a car. Compiled languages are like a plane.

So why use it?

A hybrid approach¶

If you want to get to South America, the fastest way to do so is take a car to get to the airport to catch a plane.

Same idea for Python and compiled languages. You can do the big, common, easy tasks in compiled languages, and steer it with Python.

And, as you'll see today, that's easier than you think!

Mini-courses¶

High Performance Python: CPU¶

Today's class
How to make Python code fast without fully leaving Python

High Performance Python: GPU ¶

Using a GPU to accelerate code
Using accelerators to boost your code

Compiled code & Python ¶

How to interface and accelerate with compiled code (mostly C++)

Lessons¶

00 Intro: The introduction
01 Fractal accelerate: A look at a fractal computation, and ways to accelerate it with NumPy changes, numexpr, and numba.
- 01b Fractal interactive: An interactive example using Numba.
02 Temperatures: A look at reading files and array manipulation in NumPy and Pandas.
03 MCMC: A Marco Chain Monte Carlo generator (and metropolis generator) in Python and Numba, with a focus on profiling.
04 Runge-Kutta: Implementing a popular integration algorithm in NumPy and Numba.
05 Distributed: An exploration of ways to break up code (fractal) into chunks for multithreading, multiproccessing, and Dask distribution.
06 Jax: A look at Google's JAX.
- 06b Jax: More JAX.
07 Callables: A look at Scipy's LowLevelCallable, and how to implement one with Numba.
08 Pandas COVID data: A further look at Pandas for a COVID dataset.

We may not go through these in order.

Survey¶

Before we finish, please complete the survey. We will give you some time near the end to fill it out.

In [ ]:

%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'

Background¶

5 minute pause: Please look through the following text, or ask for help getting setup. We will go over this quickly after the pause. Most of it should be review, except for some ufunc specifics.

Python lists/tuples can contain any Python object, and so waste memory and layout:

In [ ]:

import math

import numpy as np

In [ ]:

lst = [1, "hi", 3.0, "🤣"]

Each Python object stores at least a type and a reference count. They can be different sizes, so Python has to chase pointers down to get them. NumPy introduced an array class:

In [ ]:

arr = np.array([1, 2, 3, 4])

The array object is a normal Python object (with refcounts and such), but the items inside it are stored nicely packed in memory, with a single "dtype" for all the data. You can use dtype=object, but if it is anything else, this is much nicer than pure Python for larger amounts of data. All the standard datatypes are present, rather than the simple 64-bit float and unlimited int that regular Python provides.

NumPy provides "array" processing, where operations and functions are applied to arrays rather than in loops, and this allows the operations to usually loop in a compiled language, skipping the type lookups and such Python would have to do. To facilitate this, NumPy introduced UFuncs, Generalized UFuncs, and functions that operate on arrays. They also helped Python 3 come up with a memory buffer interface for communicating data structure between libraries without NumPy, and an overload system for UFuncs (1.13) and later array functions (1.18).

Out of all of that, let's peak at a UFunct:

In [ ]:

vals = np.linspace(0, np.pi, 9)

# Ufunc: np.sin
print(np.sin(vals))

np.sin is a UFunc. It can be called on any dimension of array, and it will return the same dimensionality array, with the function (sin, in this case) transforming each element. If it took multiple arguments, each could be ND, and the output would be the broadcast combination of the inputs (fails if not compatible). There are a set of standard arguments, such as out= (use an existing array for the output), where= (mask items), casting, order, dtype, and subok. You can also call a set of standard methods, such as accumulate, at, outer, reduce, and reduceat - though some do not work on all ufuncs. There are some properties, too.

Let's use out= to pre-allocate our own output:

In [ ]:

vals = np.linspace(0, np.pi, 9)
out = np.empty_like(vals)
np.sin(vals, out=out)
print(out)

The operators on arrays, along with most of the methods on arrays, are actually ufuncts and array functions defined elsewhere in NumPy:

In [ ]:

out_simple = vals + vals

out_inplace = np.empty_like(vals)
np.add(vals, vals, out=out_inplace)

np.testing.assert_array_equal(out_simple, out_inplace)

We will consider the simple form of this, array manipulation with the simple operations, to be the baseline. There is a "simpler" baseline, or maybe just an older one, of loops over arrays. I think most people who learn Python today or in the last few years start quite early with array programming, and that is the one most familiar, so we will start there.

In [ ]:

# Array looping method, do not use

vals = np.linspace(0, np.pi, 9)
out = []
for val in vals:
    out.append(math.sin(val))
print(out)

Interesting projects¶

I am part of Scikit-HEP, a project to build tools for High Energy Physicists in Python. Some of the projects are applicable outside of HEP:

AwkwardArray: Jagged array structures
Vector: A package for 2D, 3D, and Lorentz vectors
boost-histogram: A compiled package for powerful, fast histograms in Python
- hist, a package for fast analysis and plotting of histograms (in development)
iMinuit: A powerful minimization package (used in HEP and Astrophysics)

Other projects I am a developer on:

scikit-build: A build backend for CMake code in Python
pybind11: Python Bindings in pure C++11+, no other tool needed!
build: Build wheels and SDists for Python.
cibuildwheel: Build redistributable binary wheels for Python!
Plumbum: A toolkit for bash-like scripting in Python
CLI11: A command line parser for C++11

Why Python?¶

Why Python?¶

Timeline of Python¶

Timeline of Python, key points¶

2005: NumPy¶

2008: Pandas¶

2014: Jupyter¶

Python vs. a compiled language¶

A hybrid approach¶

Mini-courses¶

High Performance Python: CPU¶

High Performance Python: GPU ¶

Compiled code & Python ¶

Lessons¶

Survey¶

Background¶

Interesting projects¶

Further reading¶

My Materials¶

Favorite posts and series¶

My classes and books¶

My workshops¶

My projects¶

My sites¶

Jim Pivarski's materials¶

Why Python?¶

Why Python?¶

Timeline of Python¶

Timeline of Python, key points¶

2005: NumPy¶

2008: Pandas¶

2014: Jupyter¶

Python vs. a compiled language¶

A hybrid approach¶

Mini-courses¶

High Performance Python: CPU¶

High Performance Python: GPU¶

Compiled code & Python¶

Lessons¶

Survey¶

Background¶

Interesting projects¶

Further reading¶

My Materials¶

Favorite posts and series¶

My classes and books¶

My workshops¶

My projects¶

My sites¶

Jim Pivarski's materials¶

High Performance Python: GPU ¶

Compiled code & Python ¶