In this part of the talk we will go through how to use Engineering expertise and Machine Learning for application to signal analysis and Virtual Sensors. We will not be able to go really deep into the subject, but please refer to this notebook if you want more details. To use the notebook, please install, to the latest version, Numpy, Scikit-learn, Pandas, Tensorflow, and Neon
import numpy as np import pandas as pd from neon import NervanaObject from math import sqrt from sklearn.metrics import mean_squared_error from sklearn.preprocessing import StandardScaler from IPython.display import Image
from bokeh.io import vplot, gridplot import bokeh.plotting as bk bk.output_notebook()
data = pd.read_csv('example_data/data.csv', parse_dates=['X0'])
Building a model usually comprises few standard steps. In the course of this talk we will outline two of the main stages of a modeling pipeline, namely data preprocessing and model building (in this case a time series regression model for forecasting).
Data preprocessing refers to all the steps necessary to transform raw data into an understandable and applicable format. This procedure is common to many Machine Learning (ML) and Data Mining tasks. In our work however we integrate into the process the Engineering knowledge of the system we are studying. In the following sections we will briefly describe all the steps we usually perform on data with particular attention to Engineering point of view of signal processing. There could be other steps for example feature elimination is a common approach to reduce the dimensionality of a problem in an uninformed way. In our case the approach to feature engineering is indeed an engineering approach. We select the most informative features based on experience and problem specific know-how
The steps covered will be:
During this step we check for outliers, missing values and we try to resolve inconsistency of the data.
As an example, in the next plot we can see that the quantity we are measuring w.r.t. time present some "holes". In this case they correspond to NaN values in the data, meaning some values are not measured by the system. There are several ways to deal with this situation but they depend on the particular problem we are approaching. In our case we often deal with time series data, with its own peculiarities.
If the missing data comprise many values, as in this case, it is better to ignore them altogheter, we simply cut away NaNs. In the case of time series data this simple approach may be the best, since time series data are correlated to preceeding values, introducing synthetic values could cause the system to diverge or behave erratically. However if missing data consists of few points that are not contiguous, it is possible to interpolate and reconstruct the data. Even in this case, care must be taken because if the dynamic of the system varies with a frequency close to the sampling frequency, loosing a single data point may result in loosing a lot of information; on the contrary if the dynamic of the system is slow w.r.t. sampling frequency loosing few contiguous data points is not harmful.
If the signals come from a sensor and represent a physical quantity, it is possible to inspect the data to find faulty points or outliers and manually remove them. For example some sensors have an automated diagnosis tool that output full-scale if they sense that they are faulty. Another sanity check that is performed is to compare the output of a sensor with a simple physical model that describes its behavior. If the two are roughly similar the signal is considered good. Engineering experts knows the physics of the system and integrate this information into the process of distinguishing between real outliers (value that are not physically possible to exist) and anomalous working conditions of the dynamic system that we still want to analyse.
For example, suppose that you are receiving acceleration and speed signals from a car (on a flat road); from physics we know that they are correlated, so if the two differs, for example speed is increasing while car is decelerating, the engineer can conclude that this measurement is an obviously an outlier.
tools = 'pan,wheel_zoom,box_zoom,reset'
fig_a = bk.figure(plot_width=800, plot_height=350, x_axis_label='time', y_axis_label='value', x_axis_type='datetime', tools=tools) fig_a.line(data['X0'], data['y']) bk.show(fig_a)
The preceeding graph depicts a physical quantity that varies over time, and it is possible to see where data is missing. It is worth pointing out that in real life scenarios missing data is absolutely the rule rather than the exception.
Filtering is applied to time series and removes unwanted components from a signal. It is used for removing noise or other components from the signal that are not meaningful from an engineering standpoint. To further clarify, if we know that a signal represent a physical quantity that varies with a certain bandwidth of frequencies, removing the frequencies that are outside that bandwidth is important; moreover it is not harmful, because we have the knowledge that if we have a frequency outside it is certainly noise. Filtering is applied if and only if we have a prior knowledge of the system, otherwise cutting frequencies at random may remove information from the signal.
Let's look at an example. First we create two sine waves one with a frequency 2 Hz and the second with frequency 40 Hz. The third wave is the sum of the first two.
from scipy.signal import butter from scipy.signal import filtfilt, lfilter
fs = 2000 end = 4 t = np.arange(0, end, 1.0/fs) x =  for f in [2,40]: x.append(np.sin(2 * np.pi * f * t))
fig1 = bk.figure(plot_width=800, plot_height=200, x_axis_label='time', y_axis_label='amplitude', #x_axis_type='datetime', tools=tools) fig1.line(t, x, legend='signal of interest') fig2 = bk.figure(plot_width=800, plot_height=200, x_axis_label='time', y_axis_label='amplitude', #x_axis_type='datetime', x_range=fig1.x_range, tools=tools) fig2.line(t, x, legend='noise') fig3 = bk.figure(plot_width=800, plot_height=200, x_axis_label='time', y_axis_label='amplitude', #x_axis_type='datetime', x_range=fig1.x_range, tools=tools) fig3.line(t, x+x, legend='signal+noise') ha1 = bk.show(gridplot([[fig1], [fig2], [fig3]]))