#!/usr/bin/env python # coding: utf-8 # # In Depth: Naive Bayes Classification # # # *This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).* # # *The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!* # # < [Feature Engineering](05.04-Feature-Engineering.ipynb) | [Contents](Index.ipynb) | [In Depth: Linear Regression](05.06-Linear-Regression.ipynb) > # ``Naive Bayes models`` are a group of **extremely fast and simple** classification algorithms that are often suitable for **very high-dimensional** datasets. # # A quick-and-dirty baseline for a classification problem. # - so fast # - so few tunable parameters # - very useful # # # This section will focus on # - an intuitive explanation of how naive Bayes classifiers work # - followed by a couple examples of them in action on some datasets. # # # ## Bayesian Classification # # Naive Bayes classifiers are built on Bayesian classification methods. # # Bayes's theorem is an equation describing the relationship of conditional probabilities of statistical quantities. # # - We are intrested in finding the probability of a label given some observed features # - which we can write as $P(L~|~{\rm features})$. # # Bayes's theorem: # # $$ # P(L~|~{\rm features}) = \frac{P({\rm features}~|~L)P(L)}{P({\rm features})} # $$ # # $$ # \mbox{posterior} = \frac{\mbox{likelihood}\times \mbox{prior}}{\mbox{evidence}} \ # $$ # # # https://en.wikipedia.org/wiki/Naive_Bayes_classifier # # In practice, there is interest only in the numerator of that fraction, # - because the denominator does not depend on $L$ and the values of the features $x_i$ are given # - so that the denominator is effectively constant. # # The numerator is equivalent to the ``joint probability`` model # # $$p(x_1, \dots, x_n|L_k) p(L_k) = p(L_k, x_1, \dots, x_n)$$ # # # $$ # p(L_k \mid x_1, \dots, x_n) \varpropto p(L_k, x_1, \dots, x_n) # $$ # # \begin{align} # p(L_k, x_1, \dots, x_n) & = p(x_1, \dots, x_n, L_k) \\ # & = p(x_1 \mid x_2, \dots, x_n, L_k) p(x_2, \dots, x_n, L_k) \\ # & = p(x_1 \mid x_2, \dots, x_n, L_k) p(x_2 \mid x_3, \dots, x_n, L_k) p(x_3, \dots, x_n, L_k) \\ # & = \dots \\ # & = p(x_1 \mid x_2, \dots, x_n, L_k) p(x_2 \mid x_3, \dots, x_n, L_k) \dots p(x_{n-1} \mid x_n, L_k) p(x_n \mid L_k) p(L_k) \\ # \end{align} # # # Now the "naive" conditional independenLe assumptions come into play: assume that each feature $x_i$ is conditionally statistical independence of every other feature $x_j$ for $j\neq i$, given the category $L_k$. This means that # # $$p(x_i \mid x_{i+1}, \dots ,x_{n}, L_k ) = p(x_i \mid L_k)$$ # # Thus, the joint model can be expressed as # # # \begin{align} # p(L_k \mid x_1, \dots, x_n) & \varpropto p(L_k, x_1, \dots, x_n) = \\ # & = p(L_k) \ p(x_1 \mid L_k) \ p(x_2\mid L_k) \ p(x_3\mid L_k) \ \cdots \\ # & = p(L_k) \prod_{i=1}^n p(x_i \mid L_k)\,. # \end{align} # # # The maximum a posteriori (MAP) decision rule # # $$\hat{y} = \underset{k \in \{1, \dots, K\}}{\operatorname{argmax}} \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \mid C_k)$$ # # ## Bayesian Classification # # If we are trying to decide between two labels $L_1$ and $L_2$ # - to compute the ratio of the posterior probabilities for each label: # # $$ # \frac{P(L_1~|~{\rm features})}{P(L_2~|~{\rm features})} = \frac{P({\rm features}~|~L_1)}{P({\rm features}~|~L_2)}\frac{P(L_1)}{P(L_2)} # $$ # # We can compute $P({\rm features}~|~L_i)$ for each label. # - Such a model is called a *generative model* # - because it specifies the hypothetical random process that generates the data. # ## Bayesian Classification # # ``Specifying this generative model`` for each label is the main piece of the training of such a Bayesian classifier. # - The general version of such a training step is a very difficult task, # - but we can make it simpler through the use of some simplifying assumptions about the form of this model. # # This is where the "naive" in "naive Bayes" comes in: # - if we make very naive assumptions about the generative model for each label, # - we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification. # # Different types of naive Bayes classifiers rest on different naive assumptions about the data, and we will examine a few of these in the following sections. # # # ## Gaussian Naive Bayes # # Perhaps the easiest naive Bayes classifier to understand is Gaussian naive Bayes. # - In this classifier, the assumption is that *data from each label is drawn from a simple Gaussian distribution*. # # Imagine that you have the following data: # When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Normal distribution (Gaussian) distribution. # # The probability density of the normal distribution is # # $$ # f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2} } e^{ -\frac{(x-\mu)^2}{2\sigma^2} } # $$ # For example, suppose the training data contains a continuous attribute, $x$. # # We first segment the data by the class, and then compute the mean and variance of $x$ in each class. # - Let $\mu_k$ be the mean of the values in $x$ associated with class $C_k$ # - Let $\sigma^2_k$ be the variance of the values in $x$ associated with class $C_k$. # # Suppose we have collected some observation value $v$. # Then, the probability distribution of $v$ given a class $C_k$, # # $p(x=v \mid C_k)$, can be computed by plugging $v$ into the equation for a Normal distribution parameterized by $\mu_k$ and $\sigma^2_k$. That is, # # # # \begin{align} # p(x=v \mid C_k)=\frac{1}{\sqrt{2\pi\sigma^2_k}}\,e^{ -\frac{(v-\mu_k)^2}{2\sigma^2_k} } # \end{align} # # In[14]: get_ipython().run_line_magic('matplotlib', 'inline') import numpy as np import matplotlib.pyplot as plt import seaborn as sns; sns.set() # In[15]: from sklearn.datasets import make_blobs X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5) plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu'); # One extremely fast way to create a simple model is to assume that # - the data is described by a Gaussian distribution with no covariance between dimensions. # # This model can be fit by # - simply finding the mean and standard deviation of the points within each label # # The result of this naive Gaussian assumption is shown in the following figure: # ![(run code in Appendix to generate image)](img/figures/05.05-gaussian-NB.png) # [figure source in Appendix](06.00-Figure-Code.ipynb#Gaussian-Naive-Bayes) # The ellipses here represent the Gaussian generative model for each label # - with larger probability toward the center of the ellipses. # # With this generative model in place for each class, we have a simple recipe to compute the likelihood $P({\rm features}~|~L_1)$ for any data point # - and thus we can quickly compute the posterior ratio and determine which label is the most probable for a given point. # # This procedure is implemented in Scikit-Learn's ``sklearn.naive_bayes.GaussianNB`` estimator: # In[16]: from sklearn.naive_bayes import GaussianNB model = GaussianNB() model.fit(X, y); # Now let's generate some new data and predict the label: # In[17]: rng = np.random.RandomState(0) Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2) ynew = model.predict(Xnew) # Now we can plot this new data to get an idea of where the decision boundary is: # In[18]: plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu') lim = plt.axis() plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1) plt.axis(lim); # We see a slightly curved boundary in the classifications—in general, the boundary in Gaussian naive Bayes is quadratic. # # A nice piece of this Bayesian formalism is that it naturally allows for probabilistic classification, which we can compute using the ``predict_proba`` method: # In[6]: yprob = model.predict_proba(Xnew) yprob[-8:].round(2) # The columns give the posterior probabilities of the first and second label, respectively. # If you are looking for estimates of uncertainty in your classification, Bayesian approaches like this can be a useful approach. # # The final classification will only be as good as the model assumptions that lead to it # - This is why Gaussian naive Bayes often does not produce very good results. # ## Multinomial Naive Bayes # # - The Gaussian assumption is only one simple assumption that could be used to specify the generative distribution for each label. # - Multinomial naive Bayes assumes that the features are generated from a simple multinomial distribution. # # $$ # p(\mathbf{x} \mid L_k) = \frac{(\sum_i x_i)!}{\prod_i x_i !} \prod_i {p_{ki}}^{x_i} # $$ # # The multinomial distribution describes the probability of observing counts among a number of categories, # - multinomial naive Bayes is most appropriate for features that represent counts or count rates. # - The idea is precisely the same as before, but we model the data distribuiton with a best-fit multinomial distribution. # ### Example: Classifying Text # # One place where multinomial naive Bayes is often used is in ``text classification`` # - the features are related to word counts or frequencies within the documents to be classified. # - the extraction of such features from text in [Feature Engineering](05.04-Feature-Engineering.ipynb) # # Using the sparse word count features from the 20 Newsgroups corpus to classify these documents. # - Let's download the data and take a look at the target names: # In[2]: from sklearn.datasets import fetch_20newsgroups data = fetch_20newsgroups() data.target_names # For simplicity here, we will select just a few of these categories, and download the training and testing set: # In[3]: categories = ['talk.religion.misc', 'soc.religion.christian', 'sci.space', 'comp.graphics'] train = fetch_20newsgroups(subset='train', categories=categories) test = fetch_20newsgroups(subset='test', categories=categories) # Here is a representative entry from the data: # In[4]: print(train.data[5]) # To convert the content of each string into a vector of numbers. # - Use TF-IDF vectorizer (discussed in [Feature Engineering](05.04-Feature-Engineering.ipynb)) # - Create a pipeline that attaches it to a multinomial naive Bayes classifier: # In[5]: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline model = make_pipeline(TfidfVectorizer(), MultinomialNB()) # With this pipeline, we can apply the model to the training data, and predict labels for the test data: # In[7]: model.fit(train.data, train.target) labels = model.predict(test.data) # Evaluate the performance of the estimator. # - the confusion matrix between the true and predicted labels for the test data # In[9]: from sklearn.metrics import confusion_matrix sns.set_context("notebook", font_scale=1.7) mat = confusion_matrix(test.target, labels) sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, xticklabels=train.target_names, yticklabels=train.target_names) plt.xlabel('true label') plt.ylabel('predicted label'); # Evidently, even this very simple classifier can successfully separate space talk from computer talk, # - but it gets confused between talk about religion and talk about Christianity. # - This is perhaps an expected area of confusion! # # ### The very cool thing here # we now have the tools to determine the category for *any* string # - using the ``predict()`` method of this pipeline. # # Here's a quick utility function that will return the prediction for a single string: # In[10]: def predict_category(s, train=train, model=model): pred = model.predict([s]) return train.target_names[pred[0]] # Let's try it out: # In[11]: predict_category('sending a payload to the ISS') # In[12]: predict_category('discussing islam vs atheism') # In[13]: predict_category('determining the screen resolution') # Remember that this is nothing more sophisticated than a simple probability model for the (weighted) frequency of each word in the string; # - nevertheless, the result is striking. # # > ## Even a very naive algorithm, when used carefully and trained on a large set of high-dimensional data, can be surprisingly effective. # ## When to Use Naive Bayes # # Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more complicated model. # # Advantages: # # - They are extremely fast for both training and prediction # - They provide straightforward probabilistic prediction # - They are often very easily interpretable # - They have very few (if any) tunable parameters # ## When to Use Naive Bayes # # # A good choice as an initial baseline classification. # - If it performs suitably, then congratulations: you have a very fast, very interpretable classifier for your problem. # - If it does not perform well, then you can begin exploring more sophisticated models, with some baseline knowledge of how well they should perform. # # Naive Bayes classifiers tend to perform especially well in one of the following situations: # # - When the naive assumptions actually match the data (very rare in practice) # - For very well-separated categories, when model complexity is less important # - For very high-dimensional data, when model complexity is less important # # ## When to Use Naive Bayes # # The last two points seem distinct, but they actually are related: # - as the dimension of a dataset grows, it is much less likely for any two points to be found close together (after all, they must be close in *every single dimension* to be close overall). # - clusters in high dimensions tend to be more separated, on average, than clusters in low dimensions, assuming the new dimensions actually add information. # # > ## simplistic classifiers like naive Bayes tend to work as well or better than more complicated classifiers as the dimensionality grows: once you have enough data, even a simple model can be very powerful. # # < [Feature Engineering](05.04-Feature-Engineering.ipynb) | [Contents](Index.ipynb) | [In Depth: Linear Regression](05.06-Linear-Regression.ipynb) >