import addutils.toc ; addutils.toc.js(ipy_notebook=True)
from addutils import css_notebook css_notebook()
Today we see an explosion of applications that are wide and connected with an emphasis on storage and processing. Most companies are storing a lot of data but not solving the problem of what to do with it. Yet most of the information is stored in raw form: There a huge amound of information locked-up in databases: information that is potentially important but has not yet been discovered. The objective of these tutorials is to show the foundamental techniques to Discover Meaningful Information in Data and use state of the art algorithms for Building Models from Data.
Machine Learning is a technology that is currently having a huge impact on business and society. Many big tech companies such as Google, Facebook, Twitter, Amazon and others have employed Machine Learning algorithms to ranking web pages, photo tagging, spam filters, product recomendation systems and many more use cases.
Traditionally computers can be programmed with specific algorithms to perform defined tasks, for example how to find the shortest path from A to B, but for the most important tasks, for example how to drive a car, we are not able to program a machine to do that. The only way is to program the machine to learn by itself.
In order to do so, scientist developed in the early fifties the field of Artificial Intellingence (AI). The field growed rapidly and now encompasses many subfields, that ranges from general, like for example learning, to specific, like for example playing GO. AI has many approaches, but historically its main goals were to act and behave like humans, and think rationally.
In recent years, Machine Learning (ML) and Deep Learning (DL) emerged as a subfield of AI. In contrast with the general principles of AI (namely building sentients machines) ML goals is to learn, that is programming computer to improve automatically with experience.
ML is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data. DL is about modeling high-level abstractions in data by using model architectures composed of multiple non-linear transformations.
Other field of computer science deals with information and data, for example Data Mining is the extraction of implicit, previously unknown and potentially useful information from unstructured data.
In recent year the boundaries between all these disciplines and fields has become blurred as all of them borrow techniques from one another. A simple diagram of the interrelation of this fields can be seen in the picture below.
Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to better filter spam. What is the task T in this setting?
Machine learning algorithms:
In SUPERVISED LEARNING, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features.
In general, a learning problem uses a set of n data samples to predict properties of unknown data. Usually data are organized in tables where rows (first axis) represent the samples (or instances) and colums represent attributes (or features), for Supervised Learning, another array of classes or target variables (the "right answers") is provided.
We can separate learning problems in a few large categories:
In UNSUPERVISED LEARNING the data has no labels, and we are interested in finding similarities between the samples.
Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. Some unsupervised learning problems are:
CLUSTERING is the task that group similar items together
DENSITY ESTIMATION is a task were we want to find statistical values that describe the data
DIMENSIONALITY REDUCTION is for reduce the number of the features while keeping most of the information
UNSUPERVISED / SUPERVISED LEARNING in DL usally the two approach are combined, in fact the DL layers (Restricted Boltzmann Machines, Autoencoders, Convolutional Neural Networks) are used to learn the most significative features of the data. Those features are then used with standard ML regressors or classificators.
2) Data Collection
Data collection may require the use of specialized hardware such as a sensor network, manual labor such as the collection of user surveys, or software tools such as a Web document crawling engine to collect documents. This stage is highly application-specific and it is critically important because good choices at this stage may significantly impact future stages of the process. After the collection phase, the data are often stored in a database or in a variety of file formats, for later processing.
3) Data Preparation
The data is often not in a form that is suitable for processing. For example, the data may be encoded in complex logs or documents without a structure. In many cases, different types of data may be arbitrarily mixed together. To make the data suitable for processing, it is essential to transform them into a format that is friendly to ML algorithms, such as multidimensional, time series, or semistructured format.
The multidimensional format is the most common one, in which different fields of the data correspond to the different measured properties that are referred to as features, attributes, or dimensions. It is crucial to extract relevant features.
The feature extraction phase is often performed in parallel with data cleaning, where missing and erroneous parts of the data are either estimated or corrected. In many cases, the data may be extracted from multiple sources and need to be integrated into a unified format for processing. The final result of this procedure is a tidy data set, which can be effectively used by a computer program.
4) Model Training and Evaluation
The goal of Model Training and Evaluation is to test the types of algorithms and dataset combinations that are good at picking out the structure of the problem so that they can be studied in more detail with focused experiments.
More focused experiments with well-performing families of algorithms may be performed in this step, but algorithm tuning is left for the next step.
In this phase we must answer the question:
How can I improve the solution? More data (if possible!)? Less features? More features? Simpler learning algorithm? More complex learning algorithm?
This is known as Model Selection: any modeling technique can be used to construct of a continuum of models, from simple to complex. One of the key issues in modeling is model selection, which involves picking the appropriate level of complexity for a model given a data set. Although model selection methods can be automated to some degree, model selection cannot be avoided. If someone claims otherwise, or does not emphasize their expertise in model selection, one should be suspicious of his abilities.
Here is a list of things to take in great consideration while developing ML systems:
Model Evaluation: Once a model has been built, the natural question to ask is how accurate it is. Here we describe common sorts of deception that can occur in assessing and evaluating a model:
Failing to use an independent test set: To obtain a fair estimate of performance, the model must be evaluated on examples that were not contained in the training set. The available data must be split into nonoverlapping subsets, with the test set reserved only for evaluation.
Assuming stationarity of the test environment: For many difficult problems, a model built based on historical data will become a poorer and poorer predictor as time goes on, because the environment is nonstationary--the rules and behaviors of individuals change over time. Consequently, the best measure of a model's true performance will be obtained if it is tested on data from a different point in time relative to the training data.
Incomplete reports of results: An accurate model will correctly discriminate examples of one output class from examples of another output class. Discrimination performance is best reported with an ROC curve, a lift curve, or a precision-recall curve. Any report of accuracy using only a single number is suspect.
Filtering data to bias results: In a large data set, one segment of the population may be easier to predict than another. If a model is trained and tested just on this segment of the population, it will be more accurate than a model that must handle the entire population. Selective filtering can turn a hard problem into an easier problem.
Selective sampling of test cases: A fair evaluation of a model will utilize a test set that is drawn from the same population as the model will eventually encounter in actual usage.
Failing to assess statistical reliability: When comparing the accuracy of two models, it is not sufficient to report that one model performed better than the other, because the difference might not be statistically reliable. "Statistical reliability" means, among other things, that if the comparison were repeated using a different sample of the population, the same result would be achieved.