Created using: PyCaret 2.0
Date Updated: August 24, 2020
Welcome to Natural Language Processing Tutorial (NLP102). This tutorial assumes that you have completed Natural Language Processing Tutorial (NLP101) - Level Beginner. If you haven't we strongly recommend you to go back and progress through the beginner's tutorial as several key concepts that we aim to cover in this tutorial are inter-connected with Beginner's Tutorial.
Building on the previous tutorial, we will learn the following in this tutorial:
Read Time : Approx. 30 Minutes
If you haven't installed PyCaret yet. Please follow the link to Beginner's Tutorial for instruction on how to install pycaret.
If you are running this notebook on Google colab, below code of cells must be run at top of the notebook to display interactive visuals.
from pycaret.utils import enable_colab
Building on from previous tutorial Natural Language Processing Tutorial (NLP101) - Level Beginner we will create a topic model in this tutorial after passing
custom_stopwords. We will then evaluate and compare the results of topic model with the one we created in last tutorial. We will then see how to evaluate topic models using coherence value and also how to use PyCaret's unique implementation of
tune_model() function that allows you to optimize supervised learning target (status column in this example). We will finally finish off the tutorial by using the output of topic model in
pycaret.classifcation module to find the best classifier that can predict loan default using topic information extracted from
Let's get started!
For this tutorial we will be using the same dataset that was used in Natural Language Processing Tutorial (NLP101) - Level Beginner. You can download the dataset from our github repository (Click here to Download) or you can use PyCaret's
get_data() function to import the dataset (This will require internet connection).
Kiva Microfunds https://www.kiva.org/
from pycaret.datasets import get_data data = get_data('kiva')
|0||Dominican Republic||"Banco Esperanza" is a group of 10 women looki...||F||1225||partner||Retail||0|
|1||Dominican Republic||"Caminemos Hacia Adelante" or "Walking Forward...||F||1975||lender||Clothing||0|
|2||Dominican Republic||"Creciendo Por La Union" is a group of 10 peop...||F||2175||partner||Clothing||0|
|3||Dominican Republic||"Cristo Vive" ("Christ lives" is a group of 10...||F||1425||partner||Clothing||0|
|4||Dominican Republic||"Cristo Vive" is a large group of 35 people, 2...||F||4025||partner||Food||0|
#check the shape of data data.shape
# sampling the data to select only 1000 documents data = data.sample(1000, random_state=786).reset_index(drop=True) data.shape
setup() function initializes the environment in pycaret and performs several text pre-processing steps that are imperative to work with NLP problems. In last tutorial, we have not passed any custom stopwords, which we will do in this tutorial using
custom_stopwords parameter. All the custom stopwords passed below are obtained through the analysis we performed in Natural Language Processing Tutorial (NLP101) - Level Beginner (refer to section 9.1). These are the words with very high frequency in the documents. As such, this is adding more noise than information. Deciding a list of custom stopwords is a subjective decision and mostly stems from your understanding of the dataset. For example, in this dataset words like 'loan', 'income', 'business', 'usd' etc. are very obvious since we are working on a dataset with customer loans. Rest of the parameters passed in
setup() below are same as last tutorial.
from pycaret.nlp import *
exp_nlp102 = setup(data = data, target = 'en', session_id = 123, custom_stopwords = ['loan', 'income', 'usd', 'many', 'also', 'make', 'business', 'buy', 'sell', 'purchase','year', 'people', 'able', 'enable', 'old', 'woman', 'child', 'school'], log_experiment = True, experiment_name = 'kiva1')
lda = create_model('lda')
plot_model(lda, plot = 'topic_distribution')
If you compare the output above with the one in section 9.4 of last tutorial you would notice that distribution of topics have changed, but what is more important to observe here is when you hover over the bars, keywords gives you better idea of theme of the topic in this experiment compared to last one because we have removed some noice by removing custom stopwords. For eg.
Topic 3 seems to be about customers seeking trade loans as it include keywords like 'hair', 'salon', 'wood', 'machine'.
Topic 0 is about farming/agricultural loans,
Topic 1 is mostly about retail loans and
Topic 2 seems to be about loans for domestic reasons.
Topic Modeling is very iterative machine learning task, finding the right list of custom stopwords is only possible after several iterations. We encourage you to repeat the experiments to gain actionable insights that finally leads you to the best working and implementable model. So far we have learned how to create and analyze a topic model using
pycaret.nlp module. In next section we will go a few steps deeper to understand how to evaluate a topic model.
Many topic models including Latent Dirichlet allocation are probabilistic models, providing both a predictive and latent topic representation. It is generally assumed that results generated by these models are meaningful and useful and due to its unsupervised training process it is hard to evaluate those assumptions. Nevertheless, it is equally important to identify if a trained model is objectively good or bad, as well have an ability to compare different models/methods. To do so, one would require an objective measure for the quality. Traditionally, and still for many practical applications, to evaluate if "the correct thing" has been learned about the corpus, an implicit knowledge and "eyeballing" approaches are used. Ideally, we’d like to capture this information in a single metric that can be maximized, and compared. The approaches that are commonly used today:
In this section we will learn how to evaluate coherence value of a topic model using
tune_model() function. Followed by extrinsic evaluation on number of topics in a topic model to optimize classifier that can predict default using
status column in the dataset.
What is Topic Coherence? Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.
tune_model() function iterates on a pre-defined grid with different number of topics and create a model for each parameter. Topic coherence is then evaluated for different models and are visually presented in a graph the has
Coherence Score on y-axis as a function of
# Topics on x-axis. See example below:
tuned_unsupervised = tune_model(model = 'lda', multi_core = True)
Best Model: Latent Dirichlet Allocation | # Topics: 16 | Coherence: 0.4741
Model with highest coherence score is the best model based on intrinsic evaluation criteria. As appealing as it may sound that performance of a topic model can be captured in one number i.e. Coherence Score, it doesn't come without its downside. We encourage you to do some more reading about Coherence Score to understand more about it. [Read More]
We have only covered Coherence in this tutorial. The other popular measure is perplexity. It captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set. Focussing on the log-likelihood part, you can think of the perplexity metric as measuring how probable some new unseen data is given the model that was learned earlier. That is to say, how well does the model represent or reproduce the statistics of the held-out data. However, recent studies have shown that predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated as such optimizing for perplexity may not yield human interpretable topics. [Reference]
The dataset we are using is labelled using
status column (1 means loan default, 0 means no default). We will now use
tune_model() function to determine the best number of topics. Best in this case is defined by the measure of interest in supervised machine learning which in this case is
Accuracy since this is a classification problem. See example below:
tuned_classification = tune_model(model = 'lda', multi_core = True, supervised_target = 'status')
Best Model: Latent Dirichlet Allocation | # Topics: 2 | Accuracy : 0.867
In this example the
Accuracy is optimized when
num_topics are set to
4. It is very likely that number of topics to optimize the supervised metric such as accuracy in this case would be different than model with best coherence value. At the end of the day, which one to use is totally dependent on the ues case of topic model. Evaluating topic models is a complex subject. It is unlikely that you will understand them fully if this is your first time doing topic modeling. We recommend you to watch this YouTube Video, if you interested in learning more.
Notice that when you used
load_experiment(), it has loaded the entire experiments and all the intermediate outputs in variable
saved_experiment. You can access specific items in a similar way you would access list elements in Python. See example below in which we are accessing our final stacking ensembler and store it in
PyCaret 2.0 embeds MLflow Tracking component as a backend API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. To log your experiments in pycaret simply use
experiment_name parameter in the
setup function, as we did in this example.
You can start the UI on
localhost:5000. Simply initiate the MLFlow server from command line or from notebook. See example below:
# to start the MLFlow server from notebook: !mlflow ui
We have covered several key concepts in this tutorial such as model evaluation using intrinsic and extrinsic technique. In next tutorial. We have performed several text pre-processing steps including removal of custom_stopwords using
setup() then we have created a topic model and compared the results with the one we created in last tutorial. We have also talked about different ways to evaluate topic model and have used
tune_model() to evaluate coherence value of a LDA model. We have also used
tune_model() to evaluate number of topics in a supervised setting (in this case we have used it to build a classifier to predict loan status).
In next tutorial we will use
pycaret.nlp together with
pycaret.classification and focus on using supervised and unsupervised module of pycaret together.
See you at the next tutorial. Follow the link to Natural Language Processing (NLP103) - Level Expert