#!/usr/bin/env python # coding: utf-8 # # Detect and Delete outliers with Optimus # An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations. # # You have to be careful when studying outliers because how do you know if an outlier is the result of a data glitch, or a real data point -- indeed maybe not an outlier. # In[1]: get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # In[2]: import sys sys.path.append("..") # In[3]: from optimus import Optimus # In[4]: # Create optimus op = Optimus() # In[5]: df = op.load.excel("data/titanic3.xls") # In[6]: df.table() # From a quick inspection of the dataframe we can guess that the 1000 in the column `num` can be an outlier. You can perform a very intense search to see if it is actually and outlier, if you need something like that please check out [these articles and tutorials](http://www.datasciencecentral.com/profiles/blogs/11-articles-and-tutorials-about-outliers) # With optimus you can perform several analysis too to check if a value is an outlier. First lets run some visual analysis. Remember to check the [Main Example](https://github.com/ironmussa/Optimus/blob/master/examples/Optimus_Example.ipynb) for more. # ## Outlier detection # One of the commonest ways of finding outliers in one-dimensional data is to mark as a potential outlier any point that is more than two standard deviations, say, from the mean (I am referring to sample means and standard deviations here and in what follows). But the presence of outliers is likely to have a strong effect on the mean and the standard deviation, making this technique unreliable. # That's why we have programmed in Optimus the median absolute deviation from median, commonly shortened to the median absolute deviation (MAD). It is the median of the set comprising the absolute values of the differences between the median and each data point. If you want more information on the subject please read the amazing article by Leys et al. about dtecting outliers [here](http://www.sciencedirect.com/science/article/pii/S0022103113000668) # ### Zscore # In[21]: df.outliers.z_score("fare", threshold= 2).select().table() # In[22]: df.outliers.z_score("fare", threshold= 1).drop().table() # ### Tukey # In[16]: df.outliers.tukey("fare").select().table() # In[17]: df.outliers.tukey("fare").drop().table() # ### MAD # In[27]: df.outliers.mad("fare", threshold= 2).select().table() # In[28]: df.outliers.mad("fare", threshold= 1).drop().table() # ### Modified Zscore # In[34]: df.outliers.modified_z_score("fare", threshold= 1).select().table() # In[35]: df.outliers.modified_z_score("fare", threshold= 1).drop().table() # In[ ]: