#!/usr/bin/env python
# coding: utf-8

# # Machine Learning and Statistics for Physicists

# Material for a [UC Irvine](https://uci.edu/) course offered by the [Department of Physics and Astronomy](https://www.physics.uci.edu/).
# 
# Content is maintained on [github](github.com/dkirkby/MachineLearningStatistics) and distributed under a [BSD3 license](https://opensource.org/licenses/BSD-3-Clause).
# 
# ##### &#9658; [View table of contents](Contents.ipynb)

# This notebook can optionally be viewed as a [slide presentation](https://medium.com/learning-machine-learning/present-your-data-science-projects-with-jupyter-slides-75f20735eb0f). Click [here](https://nbviewer.jupyter.org/format/slides/github/dkirkby/MachineLearningStatistics/blob/master/notebooks/Intro.ipynb#/) to view the slides online or, to present the slides locally, use:
# ```
# jupyter nbconvert Intro.ipynb --to slides --post serve
# ```

# ## Introduction

# **ACTIVITY:** Discuss these questions:
# 1. What is the relationship between *machine learning* and *statistics*?
# 2. Does your research focus more on *data* or *models*?
# 3. What is a *data scientist*?
# 4. What is "deep" about *deep learning*?

# ### What is "Machine Learning"?
# 
# Using **machines** to **learn** how to explain data with models.

# ### What is "Machine Learning"?
# 
# Using **machines** to **learn** how to explain data with models.
# 
# The "machines" responsible for most of the progress in ML are:
#  - software algorithms
#  - hardware architectures
#  - human ingenuity
#  
# The "learning" consists of passively identifying statistical correlations, which is very different from how we learn with active experimentation and identifying causal relationships.

# ### What is "Machine Learning?"
# 
# Using machines to learn how to explain **data** with **models**.
# 
# ![MLS-triangle1](img/Intro/MLS-triangle1.png)

# ## What is "Machine Learning?"
# 
# Machine learning uses models to learn from data.
# 
# ![MLS-triangle2](img/Intro/MLS-triangle2.png)

# Further reading:
# - [Data mining and statistics: what's the connection?](http://statweb.stanford.edu/~jhf/ftp/dm-stat.pdf)
# - [The rise of the "data engineer"](https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-91be18f1e603)
# - [Humorous contrasts between ML and Stats](http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf)
#     - python$\leftrightarrow$ R
#     - conference talk$\leftrightarrow$ journal article

# ## What is Data?
# 
# Data is (are?) a finite set of measurements:
# - Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)...
# - **colums = features**
# - **rows = samples** (observations)
# - richer data structures (images, [ROOT trees](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), etc) must be flattened.
# 
# ![data-table](img/Intro/data-table.png)

# ## What is Data?
# 
# Data is (are?) a finite set of measurements:
# - Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)...
# - **colums = features**
# - **rows = samples** (observations)
# - richer data structures (images, [ROOT trees](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), etc) must be flattened.
# 
# Questions to ask about your data:
# - Are my features categorical / discrete / continuous?
# - Is the ordering of my samples significant?
# - Are my samples statistically independent? drawn from the same distribution?
# - What are my measurement uncertainties?
# - Is my data binned / un-binned?
# - Is there a natural similarity / distance measure on my samples (rows)?

# **ACTIVITY:** Pick one of these ML problems and describe the rows (samples) and columns (features) of the data you might use to solve the problem.
# 1. Learn a fast approximation to a slow exact calculation.
# 2. Learn to identify Higgs particle decays from LHC event data.
# 3. Learn to estimate the distance to a quasar using optical images.

# ## What is a Model?
# 
# Two important types of models: generative, probabilistic.
# 
# All ML algorithms use a model to explain your data.
# 
# Models have parameters.
# 
# ![models1](img/Intro/models1.png)

# ## What is a Model?
# 
# Two important types of models: generative, probabilistic.
# 
# Models can explain data **and parameters**.
# 
# Models have parameters **and hyper-parameters.**
# 
# ![models2](img/Intro/models2.png)

# ## What is Learning?
# 
# Three broad types of learning:
#  - **Unsupervised: learn to predict new data.**
#    - Given data: what patterns are present? (learn a model).
#    - Given data and model: how likely is new data to be from same model? (generate new data).
#  - **Supervised: Learn to predict specific features of new data.**
#    - Classification: predict discrete features (learn a conditional model).
#    - Regression: predict continuous features  (learn a conditional model).
#  - **Inference: explain observed data.**
#    - Assuming a model: what parameters (with what uncertainties) best describe my data? (learn a model).
#    - Given competing models: which best describes my data? (model selection).
#  
# (Also: reinforcement learning.)

# ## What is special about ML in Physics and Astronomy?
# 
# Scientific applications of ML benefit a lot from advances in industry but we work in a different context:
# - **We are data producers, not data consumers:**
#   - Experiment / survey design.
#   - Optimization of statistical errors.
#   - Control of systematic errors.
# - **Our data measures physical processes:**
#   - Measurements often reduce to counting photons, etc, with known a-priori random errors.
#   - Dimensions and units are important.
# - **Our models are usually traceable to an underlying physical theory:**
#   - Models constrained by theory and previous observations.
#   - Parameter values often intrinsically interesting.
# - **A parameter uncertainty estimate is just as important as its value:**
#   - Prefer methods that handle input data uncertainties (weights) and provide output parameter uncertainty estimates.

# ## How will this course be different from a CS class?
# 
# Physics and astronomy students have different preparation:
# - Strong background and experience with mathematical tools (linear algebra, multivariate calculus) needed for rigorous discussion of statistics.
# - Weak / varied background in traditional CS core topics of fundamental algorithms, databases, etc
# 
# Physics and astronomy research also has different needs:
# - Our data and models are often fundamentally different from those in typical CS contexts.
# - We ask different types of questions about our data, sometimes requiring new methods.
# - We have different priorities for judging a "good" method: interpretability, error estimates, etc.

# ## Topics Overview
# 
# ![outline](img/Intro/outline.png)

# ## Exercise
# 
# One of the first tasks when applying machine learning to a new problem is to establish some baselines for the expected performance:
#  - How well does the simplest possible (non-ML) approach work?
#  - What is the current "state of the art"?
#  - If applicable: what is the "human performance" level?
#  
# Let's get a "human performance" baseline for the following supervised-learning problem:
# **How many "sources" are present in an image?**
# 
# You are now the machines:
#  - I will show you 36 training images so you can **learn** how to perform this task.
#  - Next, you must **classify** 12 test images. Enter your responses at https://goo.gl/qi2CEV.