Python is now the second most popular language on GitHub, after only JavaScript.
PyPL rankings of some of the most popular languages for data science. Quote: "Worldwide, Python is the most popular language, Python grew the most in the last 5 years (6.9%)" (March 2023)
Python is an interpreted language. When we talk about Python, we usually mean CPython, which is not even Just In Time (JIT) compiled; it's purely interpreted.
TLDR: Python is slow.
Hundreds to thousands of times slower than C/C++/Fortran/Go/Swift/Rust/Haskell... You get the point.
Python is like a car. Compiled languages are like a plane.
So why use it?
If you want to get to South America, the fastest way to do so is take a car to get to the airport to catch a plane.
Same idea for Python and compiled languages. You can do the big, common, easy tasks in compiled languages, and steer it with Python.
And, as you'll see today, that's easier than you think!
We may not go through these in order.
Before we finish, please complete the survey. We will give you some time near the end to fill it out.
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
5 minute pause: Please look through the following text, or ask for help getting setup. We will go over this quickly after the pause. Most of it should be review, except for some ufunc specifics.
Python lists/tuples can contain any Python object, and so waste memory and layout:
import math
import numpy as np
lst = [1, "hi", 3.0, "🤣"]
Each Python object stores at least a type and a reference count. They can be different sizes, so Python has to chase pointers down to get them. NumPy introduced an array class:
arr = np.array([1, 2, 3, 4])
The array object is a normal Python object (with refcounts and such), but the items inside it are stored nicely packed in memory, with a single "dtype" for all the data. You can use dtype=object
, but if it is anything else, this is much nicer than pure Python for larger amounts of data. All the standard datatypes are present, rather than the simple 64-bit float
and unlimited int
that regular Python provides.
NumPy provides "array" processing, where operations and functions are applied to arrays rather than in loops, and this allows the operations to usually loop in a compiled language, skipping the type lookups and such Python would have to do. To facilitate this, NumPy introduced UFuncs, Generalized UFuncs, and functions that operate on arrays. They also helped Python 3 come up with a memory buffer interface for communicating data structure between libraries without NumPy, and an overload system for UFuncs (1.13) and later array functions (1.18).
Out of all of that, let's peak at a UFunct:
vals = np.linspace(0, np.pi, 9)
# Ufunc: np.sin
print(np.sin(vals))
np.sin
is a UFunc. It can be called on any dimension of array, and it will return the same dimensionality array, with the function (sin
, in this case) transforming each element. If it took multiple arguments, each could be ND, and the output would be the broadcast combination of the inputs (fails if not compatible). There are a set of standard arguments, such as out=
(use an existing array for the output), where=
(mask items), casting
, order
, dtype
, and subok
. You can also call a set of standard methods, such as accumulate
, at
, outer
, reduce
, and reduceat
- though some do not work on all ufuncs. There are some properties, too.
Let's use out=
to pre-allocate our own output:
vals = np.linspace(0, np.pi, 9)
out = np.empty_like(vals)
np.sin(vals, out=out)
print(out)
The operators on arrays, along with most of the methods on arrays, are actually ufuncts and array functions defined elsewhere in NumPy:
out_simple = vals + vals
out_inplace = np.empty_like(vals)
np.add(vals, vals, out=out_inplace)
np.testing.assert_array_equal(out_simple, out_inplace)
We will consider the simple form of this, array manipulation with the simple operations, to be the baseline. There is a "simpler" baseline, or maybe just an older one, of loops over arrays. I think most people who learn Python today or in the last few years start quite early with array programming, and that is the one most familiar, so we will start there.
# Array looping method, do not use
vals = np.linspace(0, np.pi, 9)
out = []
for val in vals:
out.append(math.sin(val))
print(out)
I am part of Scikit-HEP, a project to build tools for High Energy Physicists in Python. Some of the projects are applicable outside of HEP:
Other projects I am a developer on:
C++ 11 14 17 20 • macOS Setup (AS) • Azure DevOps (Python Wheels) • Conda-Forge ROOT • CLI11 • GooFit • cibuildwheel • Hist • Python Bindings • Python 2→3, 3.7, 3.8, 3.9, 3.10, 3.11 • SSH
Modern CMake • CompClass • se-for-sci
CMake Workshop • Python CPU, GPU, Compiled minicourses • Level Up Your Python • Packaging (WIP)
pybind11 (python_example, cmake_example, scikit_build_example) • cibuildwheel • build • scikit-build (core, cmake, ninja, moderncmakedomain) • boost-histogram • Hist • UHI • Scikit-HEP/cookie • Vector • CLI11 • Plumbum • GooFit • Particle • DecayLanguage • Conda-Forge ROOT • POVM • Jekyll-Indico • pytest GHA annotate-failures • uproot-browser • Scikit-HEP-repo-review • meson-python • flake8-errmsg • beautifulhugo
ISciNumPy • IRIS-HEP • Scikit-HEP (Developer pages) • CLARIPHY
Jim taught earlier iterations of this mini-course, and his materials are great: