Dimensionality reduction and blind source separation¶

Requires HyperSpy v1.3 or above.

Summary¶

This tutorial shows how to perform matrix factorization and blind source separation on spectra using HyperSpy. The same procedure can be used to analyse objects of higher dimensionality.

Credits and changes¶

23/8/2016 Michael Walls. Extra explanations.
27/7/2016 Francisco de la Peña. Updated for HyperSpy v1.0.1.
6/3/2016 Francisco de la Peña. Adapted from previous tutorials for the SCANDEM workshop.

1. Singular value decomposition¶

This is a powerful method for reducing noise in hyperspectral datasets. Here we begin with a few lines of matrix algebra outlining the principles of the technique, but the demo can be followed easily without a full understanding of the mathematics. As with the "getting started" demo you can run this one interactively by downloading it and saving it as a .ipynb file.

Lets start by supposing a line-spectrum $D$ that can be described by a linear model.

$D=\left[d_{i,j}\right]_{m\times n}={\displaystyle \left[{\color{green}p_{i,j}}\right]_{m\times{\color{red}l}}\times}\left[{\color{red}{\color{blue}s_{i,j}}}\right]_{{\color{red}l}\times n}$ where $m$ is the number of pixels in the line scan, $n$ the number of channels in the spectrum and $l$ the number of components e.g. spectra of individual compounds.

Normally, what we actually measure is a noisy version of $D$, $D'$,

$D'={\displaystyle \left[d'_{i,j}\right]_{m\times n}=\left[{\color{green}p_{i,j}}\right]_{m\times{\color{red}l}}\times}\left[{\color{red}{\color{blue}s_{i,j}}}\right]_{{\color{red}l}\times n}+\mathrm{Noise}$

$D'$ could be factorized as follows:

$D'={\displaystyle \left[{\tilde{\color{green}p}}_{{i,j}}\right]_{m\times{\color{red}k}}\times}\left[\tilde{s}_{i,j}\right]_{{\color{red}k}\times n} $ where $k\leq\min\left(m,n\right)$.

Extra constraints are needed to fully determine the matrix factorization. When we add the orthogonality constraint we refer to this decomposition as singular value decomposition (SVD).

In our assumption of a linear model:

$D'={\displaystyle \left[{\tilde{\color{green}p}}_{{i,j}}\right]_{m\times{\color{red}l}}\times}\left[\tilde{s}_{i,j}\right]_{{\color{red}l}\times n}+ {\displaystyle \left[{\tilde{\color{green}p}}_{{i,j}}\right]_{m\times{\color{red}{k-l}}}\times}\left[\tilde{s}_{i,j}\right]_{{\color{red}{k-l}}\times n}$

With

$D\approx{\displaystyle \left[{\tilde{\color{green}p}}_{{i,j}}\right]_{m\times{\color{red}l}}\times}\left[\tilde{s}_{i,j}\right]_{{\color{red}l}\times n}$

$\mathrm{Noise}\approx{\displaystyle \left[{\tilde{\color{green}p}}_{{i,j}}\right]_{m\times{\color{red}{k-l}}}\times}\left[\tilde{s}_{i,j}\right]_{{\color{red}{k-l}}\times n}$

We start by downloading the data for this demo, activating the matplotlib backend, importing HyperSpy and loading a dataset.

Download and unpack the machine learning example data¶

In [1]:

from urllib.request import urlretrieve, urlopen
from zipfile import ZipFile
files = urlretrieve("https://www.dropbox.com/s/dt6bc3dtg373ahw/machine_learning.zip?raw=1", "./machine_learning.zip")

with ZipFile("machine_learning.zip") as z:
    z.extractall()

Set up matplotlib and import Hyperspy¶

NOTE: In the online version of this document we use the inline backend that displays interactive figures inside the Jupyter Notebook. However, for interactive data analysis purposes most would prefer to use the qt or widget backends.

In [2]:

%matplotlib widget 
import hyperspy.api as hs

WARNING:hyperspy_gui_traitsui:The nbAgg matplotlib backend is not supported by the installed traitsui version and the ETS toolkit has been set to null. To set the ETS toolkit independently from the matplotlib backend, set it before importing matplotlib.
WARNING:hyperspy_gui_traitsui:The traitsui GUI elements are not available.

Load a dataset¶

In [3]:

s = hs.load("machine_learning/CL1.hdf5")

This is a synthetic electron energy-loss spectroscopy dataset. The procedure, although not fully general, can easily be used as is or with minor adaptation to analyse other kinds of data including images and higher dimensional signals.

In [4]:

s.plot()

To perform SVD in HyperSpy we use the decomposition method that, by default, performs SVD.

In [9]:

s.decomposition()

The result of the decomposition is stored in the learning_results attribute.

In [6]:

s.learning_results.summary()

Decomposition parameters:
-------------------------

Decomposition algorithm : 	svd
Poissonian noise normalization : False
Output dimension : None
Centre : None

SVD decomposes the data in so-called "components" and sorts them in order of decreasing relevance. It is often useful to estimate the dimensionality of the data by plotting the explained variance against the component index on a logarithmic y-scale. This plot is sometimes called a scree-plot and it should drop quickly, eventually becoming a slowly descending line. The point at which it becomes linear is often a good estimation of the dimensionality of the data (or equivalently, the number of components that should be retained).

To plot the scree plot, run the plot_explained_variance_ratio method e.g.:

In [7]:

s.plot_explained_variance_ratio()

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f16554ca400>

From the scree plot we estimate that there are 4 principal components.

We can store the scree plot as a Spectrum instance using the following method:

In [10]:

scree_plot = s.get_explained_variance_ratio()

Shot noise¶

PCA assumes gaussian noise, however, the noise in EELS spectra is approximately poissonian (shot noise). It is possible to approximately "normalise" the noise by using a liner transformation, which should result in a better decomposition of data with shot noise. This is done in HyperSpy as follows:

In [11]:

s.decomposition(normalize_poissonian_noise=True)

Let's plot the scree plot of this and the previous decomposition in the same figure

In [12]:

ax = hs.plot.plot_spectra([scree_plot.isig[:20],
                          s.get_explained_variance_ratio().isig[:20]],
                          legend=("Std", "Normalized"))

Let's improve the plot using some matplotlib commands:

In [13]:

ax.set_yscale('log')
ax.lines[0].set_marker("o")
ax.lines[1].set_marker("o")
ax.figure #for details of how these commands work see the matplotlib webpage

As we can see, the explained variance of the first four components of the normalized decomposition are "further away" from the noise than those of the original data, indicating a better decomposition.

Visualise the decomposition results¶

The following commands can be used to plot the first $n$ principal components. (Note: the _= part at the beginning is for webpage rendering purposes and is not strictly needed in the command)

In [14]:

_ = s.plot_decomposition_loadings(4)
_ = s.plot_decomposition_factors(4, comp_label="")

/home/fjd29/Python/hyperspy3/hyperspy/signal.py:1511: VisibleDeprecationWarning: The 'comp_label' argument will be deprecated in 2.0, please use 'title' instead
  VisibleDeprecationWarning)

Alternatively (and usually more conveniently) we can use the following plotting method. Use the slider or the left/right arrow keys to change the index of the components in interactive plotting mode.

In [15]:

s.plot_decomposition_results()

Noise reduction¶

A common application of PCA is noise reduction, which is achieved by dimensionality reduction. We can create a "denoised" version of the dataset by inverting the decomposition using only the number of principal components that we want to retain (in this case the first 4 components). This is done with the get_decomposition_model command.

In [16]:

sc = s.get_decomposition_model(4)

Let's plot the spectra at coordinates (30,30) from the original and PCA-denoised datasets

In [17]:

(s + sc * 1j).plot()

Calculating and plotting the residuals at a given position can be done in one single line

In [18]:

(s - sc).inav[30,30].plot()

Independent component analysis¶

As we have seen in the previous section, the principal components are a linear mixture of EELS elemental maps and spectra, but the mixing matrix is unknown. We can use blind source separation (BSS) to estimate the mixing matrix. In this case we will use independent component analysis. BSS is performed in HyperSpy using the blind_source_separation method that by default uses "FastICA" which is a well-tested ICA routine.

In [19]:

s.blind_source_separation(4, diff_order=1)

The results are also stored in the learning_results attribute:

In [20]:

s.learning_results.summary()

Decomposition parameters:
-------------------------

Decomposition algorithm : 	svd
Poissonian noise normalization : True
Output dimension : None
Centre : None

Demixing parameters:
------------------------
BSS algorithm : sklearn_fasticaNumber of components : 4

And can be visualised with the following commands:

In [21]:

_ = s.plot_bss_loadings()
_ = s.plot_bss_factors()

Or usually more conveniently:

In [22]:

s.plot_bss_results()

Non-negative matrix factorization¶

By using a different matrix factorization method, non-negative matrix factorization (NMF) we can decompose the data into "elemental" components directly. NMF replaces the orthogonality constraint in SVD by a positivity constraint.

In [23]:

s.decomposition(True, algorithm="nmf", output_dimension=4)

In [24]:

_ = s.plot_decomposition_loadings()
_ = s.plot_decomposition_factors()

ICA and NMF both do a good job, NMF being slightly better in this case. However, some mixing remains in both cases. In the case of ICA the main cause is noise as we will show by analysing in the same way an identical dataset with less noise.

In [25]:

s2 = hs.load("machine_learning/CL2.hdf5")
s2.decomposition(True)
s2.blind_source_separation(4, diff_order=2)
_ = s2.plot_bss_loadings()
_ = s2.plot_bss_factors()

Note that the better SNR permits using a second numerical derivative which enhances the independence of the sources, improving the performance of ICA in this case.

In [26]:

s2.decomposition(True, algorithm="nmf", output_dimension=4)
_ = s2.plot_decomposition_loadings()
_ = s2.plot_decomposition_factors()

/home/fjd29/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

For this dataset with better SNR ICA outperforms NMF

3. Pre-processing¶

EELS datasets sometime suffer from energy instability. This introduces a non-linearity in the data, resulting in a higher number of components than there are genuine spectral signatures coming from the object. As an example let's have a look at the CL3 dataset.

In [27]:

s3 = hs.load("machine_learning/CL3.hdf5")

In [28]:

s3.plot()

/home/fjd29/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

In [29]:

s3.decomposition(True)

In [30]:

s3.plot_explained_variance_ratio()

/home/fjd29/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f162fdbcb38>

The scree plot now suggests that we need more components to accurately explain this dataset.

In this case we can partially correct the energy instability using the low-loss spectrum image contained in the LL.hdf5 file. (To do this with real data you would need the low-loss spectra simultaneously acquired, using a dual EELS camera)

In [31]:

ll3 = hs.load("machine_learning/LL3.hdf5")

In [32]:

ll3.plot()

/home/fjd29/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

In [33]:

ll3

Out[33]:

<EELSSpectrum, title: , dimensions: (64, 64|1024)>

In [34]:

ll3.align_zero_loss_peak(also_align=[s3])

Initial ZLP position statistics
-------------------------------
Summary statistics
------------------
mean:	0.006
std:	1.000

min:	-3.500
Q1:	-0.500
median:	0.000
Q3:	0.500
max:	3.500

In [35]:

s3.decomposition(True)

In [36]:

s3.plot_explained_variance_ratio()

/home/fjd29/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Out[36]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f16206df5c0>

The scree plot is now more similar to the scree plot of the CL1 dataset. However the alignment is not perfect and the scree plot still shows the effect of the remaining instability in the dataset. Nevertheless, performing BSS on the first four components results in very similar components to those of CL1

In [37]:

s3.blind_source_separation(4, diff_order=2)
_ = s3.plot_bss_loadings()
_ = s3.plot_bss_factors()

/home/fjd29/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Another usual complication in real data are spikes. As an example, let's load the "CL4" dataset.

In [38]:

s4 = hs.load("machine_learning/CL4.hdf5")

In [39]:

s4.decomposition(True)

In [40]:

s4.plot_explained_variance_ratio()

/home/fjd29/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f1620279128>

In this case the increase in the number of components required to explain the dataset is due to a high number of spikes and the same energy instability as in the CL3 dataset. We can remove the spikes using the spikes_removal_tool

In [42]:

s4.spikes_removal_tool()

/home/fjd29/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

And the instability as above:

In [ ]:

ll4 = hs.load("machine_learning/LL3.hdf5")
ll4.align_zero_loss_peak(also_align=[s4])# after this, re-run the decomposition and re-plot the scree-plot using the lines above