#!/usr/bin/env python # coding: utf-8 # # Machine Learning and Statistics for Physicists # Material for a [UC Irvine](https://uci.edu/) course offered by the [Department of Physics and Astronomy](https://www.physics.uci.edu/). # # Content is maintained on [github](github.com/dkirkby/MachineLearningStatistics) and distributed under a [BSD3 license](https://opensource.org/licenses/BSD-3-Clause). # # ##### ► [View table of contents](Contents.ipynb) # This notebook can optionally be viewed as a [slide presentation](https://medium.com/learning-machine-learning/present-your-data-science-projects-with-jupyter-slides-75f20735eb0f). Click [here](https://nbviewer.jupyter.org/format/slides/github/dkirkby/MachineLearningStatistics/blob/master/notebooks/Intro.ipynb#/) to view the slides online or, to present the slides locally, use: # ``` # jupyter nbconvert Intro.ipynb --to slides --post serve # ``` # ## Introduction # **ACTIVITY:** Discuss these questions: # 1. What is the relationship between *machine learning* and *statistics*? # 2. Does your research focus more on *data* or *models*? # 3. What is a *data scientist*? # 4. What is "deep" about *deep learning*? # ### What is "Machine Learning"? # # Using **machines** to **learn** how to explain data with models. # ### What is "Machine Learning"? # # Using **machines** to **learn** how to explain data with models. # # The "machines" responsible for most of the progress in ML are: # - software algorithms # - hardware architectures # - human ingenuity # # The "learning" consists of passively identifying statistical correlations, which is very different from how we learn with active experimentation and identifying causal relationships. # ### What is "Machine Learning?" # # Using machines to learn how to explain **data** with **models**. # # ![MLS-triangle1](img/Intro/MLS-triangle1.png) # ## What is "Machine Learning?" # # Machine learning uses models to learn from data. # # ![MLS-triangle2](img/Intro/MLS-triangle2.png) # Further reading: # - [Data mining and statistics: what's the connection?](http://statweb.stanford.edu/~jhf/ftp/dm-stat.pdf) # - [The rise of the "data engineer"](https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-91be18f1e603) # - [Humorous contrasts between ML and Stats](http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf) # - python$\leftrightarrow$ R # - conference talk$\leftrightarrow$ journal article # ## What is Data? # # Data is (are?) a finite set of measurements: # - Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)... # - **colums = features** # - **rows = samples** (observations) # - richer data structures (images, [ROOT trees](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), etc) must be flattened. # # ![data-table](img/Intro/data-table.png) # ## What is Data? # # Data is (are?) a finite set of measurements: # - Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)... # - **colums = features** # - **rows = samples** (observations) # - richer data structures (images, [ROOT trees](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), etc) must be flattened. # # Questions to ask about your data: # - Are my features categorical / discrete / continuous? # - Is the ordering of my samples significant? # - Are my samples statistically independent? drawn from the same distribution? # - What are my measurement uncertainties? # - Is my data binned / un-binned? # - Is there a natural similarity / distance measure on my samples (rows)? # **ACTIVITY:** Pick one of these ML problems and describe the rows (samples) and columns (features) of the data you might use to solve the problem. # 1. Learn a fast approximation to a slow exact calculation. # 2. Learn to identify Higgs particle decays from LHC event data. # 3. Learn to estimate the distance to a quasar using optical images. # ## What is a Model? # # Two important types of models: generative, probabilistic. # # All ML algorithms use a model to explain your data. # # Models have parameters. # # ![models1](img/Intro/models1.png) # ## What is a Model? # # Two important types of models: generative, probabilistic. # # Models can explain data **and parameters**. # # Models have parameters **and hyper-parameters.** # # ![models2](img/Intro/models2.png) # ## What is Learning? # # Three broad types of learning: # - **Unsupervised: learn to predict new data.** # - Given data: what patterns are present? (learn a model). # - Given data and model: how likely is new data to be from same model? (generate new data). # - **Supervised: Learn to predict specific features of new data.** # - Classification: predict discrete features (learn a conditional model). # - Regression: predict continuous features (learn a conditional model). # - **Inference: explain observed data.** # - Assuming a model: what parameters (with what uncertainties) best describe my data? (learn a model). # - Given competing models: which best describes my data? (model selection). # # (Also: reinforcement learning.) # ## What is special about ML in Physics and Astronomy? # # Scientific applications of ML benefit a lot from advances in industry but we work in a different context: # - **We are data producers, not data consumers:** # - Experiment / survey design. # - Optimization of statistical errors. # - Control of systematic errors. # - **Our data measures physical processes:** # - Measurements often reduce to counting photons, etc, with known a-priori random errors. # - Dimensions and units are important. # - **Our models are usually traceable to an underlying physical theory:** # - Models constrained by theory and previous observations. # - Parameter values often intrinsically interesting. # - **A parameter uncertainty estimate is just as important as its value:** # - Prefer methods that handle input data uncertainties (weights) and provide output parameter uncertainty estimates. # ## How will this course be different from a CS class? # # Physics and astronomy students have different preparation: # - Strong background and experience with mathematical tools (linear algebra, multivariate calculus) needed for rigorous discussion of statistics. # - Weak / varied background in traditional CS core topics of fundamental algorithms, databases, etc # # Physics and astronomy research also has different needs: # - Our data and models are often fundamentally different from those in typical CS contexts. # - We ask different types of questions about our data, sometimes requiring new methods. # - We have different priorities for judging a "good" method: interpretability, error estimates, etc. # ## Topics Overview # # ![outline](img/Intro/outline.png) # ## Exercise # # One of the first tasks when applying machine learning to a new problem is to establish some baselines for the expected performance: # - How well does the simplest possible (non-ML) approach work? # - What is the current "state of the art"? # - If applicable: what is the "human performance" level? # # Let's get a "human performance" baseline for the following supervised-learning problem: # **How many "sources" are present in an image?** # # You are now the machines: # - I will show you 36 training images so you can **learn** how to perform this task. # - Next, you must **classify** 12 test images. Enter your responses at https://goo.gl/qi2CEV.