Notebook

Machine Learning and Statistics for Physicists¶

Material for a UC Irvine course offered by the Department of Physics and Astronomy.

Content is maintained on github and distributed under a BSD3 license.

► View table of contents ¶

This notebook can optionally be viewed as a slide presentation. Click here to view the slides online or, to present the slides locally, use:

jupyter nbconvert Intro.ipynb --to slides --post serve

Introduction¶

ACTIVITY: Discuss these questions:

What is the relationship between machine learning and statistics?
Does your research focus more on data or models?
What is a data scientist?
What is "deep" about deep learning?

What is "Machine Learning"?¶

Using machines to learn how to explain data with models.

What is "Machine Learning"?¶

Using machines to learn how to explain data with models.

The "machines" responsible for most of the progress in ML are:

software algorithms
hardware architectures
human ingenuity

The "learning" consists of passively identifying statistical correlations, which is very different from how we learn with active experimentation and identifying causal relationships.

What is "Machine Learning?"¶

Using machines to learn how to explain data with models.

MLS-triangle1

What is "Machine Learning?"¶

Machine learning uses models to learn from data.

MLS-triangle2

What is Data?¶

Data is (are?) a finite set of measurements:

Usually viewed as a 2D table e.g., spreadsheet, FITS table, Pandas dataframe...
colums = features
rows = samples (observations)
richer data structures (images, ROOT trees, etc) must be flattened.

data-table

What is Data?¶

Data is (are?) a finite set of measurements:

Usually viewed as a 2D table e.g., spreadsheet, FITS table, Pandas dataframe...
colums = features
rows = samples (observations)
richer data structures (images, ROOT trees, etc) must be flattened.

Questions to ask about your data:

Are my features categorical / discrete / continuous?
Is the ordering of my samples significant?
Are my samples statistically independent? drawn from the same distribution?
What are my measurement uncertainties?
Is my data binned / un-binned?
Is there a natural similarity / distance measure on my samples (rows)?

ACTIVITY: Pick one of these ML problems and describe the rows (samples) and columns (features) of the data you might use to solve the problem.

Learn a fast approximation to a slow exact calculation.
Learn to identify Higgs particle decays from LHC event data.
Learn to estimate the distance to a quasar using optical images.

What is a Model?¶

Two important types of models: generative, probabilistic.

All ML algorithms use a model to explain your data.

Models have parameters.

models1

What is a Model?¶

Two important types of models: generative, probabilistic.

Models can explain data and parameters.

Models have parameters and hyper-parameters.

models2

What is Learning?¶

Three broad types of learning:

Unsupervised: learn to predict new data.
- Given data: what patterns are present? (learn a model).
- Given data and model: how likely is new data to be from same model? (generate new data).
Supervised: Learn to predict specific features of new data.
- Classification: predict discrete features (learn a conditional model).
- Regression: predict continuous features (learn a conditional model).
Inference: explain observed data.
- Assuming a model: what parameters (with what uncertainties) best describe my data? (learn a model).
- Given competing models: which best describes my data? (model selection).

(Also: reinforcement learning.)

What is special about ML in Physics and Astronomy?¶

Scientific applications of ML benefit a lot from advances in industry but we work in a different context:

We are data producers, not data consumers:
- Experiment / survey design.
- Optimization of statistical errors.
- Control of systematic errors.
Our data measures physical processes:
- Measurements often reduce to counting photons, etc, with known a-priori random errors.
- Dimensions and units are important.
Our models are usually traceable to an underlying physical theory:
- Models constrained by theory and previous observations.
- Parameter values often intrinsically interesting.
A parameter uncertainty estimate is just as important as its value:
- Prefer methods that handle input data uncertainties (weights) and provide output parameter uncertainty estimates.

How will this course be different from a CS class?¶

Physics and astronomy students have different preparation:

Strong background and experience with mathematical tools (linear algebra, multivariate calculus) needed for rigorous discussion of statistics.
Weak / varied background in traditional CS core topics of fundamental algorithms, databases, etc

Physics and astronomy research also has different needs:

Our data and models are often fundamentally different from those in typical CS contexts.
We ask different types of questions about our data, sometimes requiring new methods.
We have different priorities for judging a "good" method: interpretability, error estimates, etc.

Topics Overview¶

outline

Exercise¶

One of the first tasks when applying machine learning to a new problem is to establish some baselines for the expected performance:

How well does the simplest possible (non-ML) approach work?
What is the current "state of the art"?
If applicable: what is the "human performance" level?

Let's get a "human performance" baseline for the following supervised-learning problem: How many "sources" are present in an image?

You are now the machines:

I will show you 36 training images so you can learn how to perform this task.
Next, you must classify 12 test images. Enter your responses at https://goo.gl/qi2CEV.

Machine Learning and Statistics for Physicists¶

► View table of contents¶

Introduction¶

What is "Machine Learning"?¶

What is "Machine Learning"?¶

What is "Machine Learning?"¶

What is "Machine Learning?"¶

What is Data?¶

What is Data?¶

What is a Model?¶

What is a Model?¶

What is Learning?¶

What is special about ML in Physics and Astronomy?¶

How will this course be different from a CS class?¶

Topics Overview¶

Exercise¶

► View table of contents ¶