Natural Language Processing Tutorial (NLP102) - Level Intermediate

Created using: PyCaret 2.0
Date Updated: August 24, 2020

1.0 Objective of Tutorial

Welcome to Natural Language Processing Tutorial (NLP102). This tutorial assumes that you have completed Natural Language Processing Tutorial (NLP101) - Level Beginner. If you haven't we strongly recommend you to go back and progress through the beginner's tutorial as several key concepts that we aim to cover in this tutorial are inter-connected with Beginner's Tutorial.

Building on the previous tutorial, we will learn the following in this tutorial:

  • Custom Stopwords: How to define custom stopwords?
  • Evaluate Topic Model: How to evaluate performance of a topic model?
  • Hyperparameter Tuning: How to tune hyperparameter (# of topics) for a topic model?
  • Experiment Logging: How to log experiments in PyCaret using MLFlow backend

Read Time : Approx. 30 Minutes

1.1 Installing PyCaret

If you haven't installed PyCaret yet. Please follow the link to Beginner's Tutorial for instruction on how to install pycaret.

1.2 Pre-Requisites

  • Python 3.6 or greater
  • PyCaret 2.0 or greater
  • Internet connection to load data from pycaret's repository
  • Completion of Natural Language Processing Tutorial (NLP101) - Level Beginner
  • Completion of Natural Binary Classification Tutorial (CLF101) - Level Beginner

1.3 For Google colab users:

If you are running this notebook on Google colab, below code of cells must be run at top of the notebook to display interactive visuals.

from pycaret.utils import enable_colab

1.4 See also:

2.0 Brief Overview of Tutorial

Building on from previous tutorial Natural Language Processing Tutorial (NLP101) - Level Beginner we will create a topic model in this tutorial after passing custom_stopwords. We will then evaluate and compare the results of topic model with the one we created in last tutorial. We will then see how to evaluate topic models using coherence value and also how to use PyCaret's unique implementation of tune_model() function that allows you to optimize supervised learning target (status column in this example). We will finally finish off the tutorial by using the output of topic model in pycaret.classifcation module to find the best classifier that can predict loan default using topic information extracted from pycaret.nlp module.

Let's get started!

3.0 Dataset for the Tutorial

For this tutorial we will be using the same dataset that was used in Natural Language Processing Tutorial (NLP101) - Level Beginner. You can download the dataset from our github repository (Click here to Download) or you can use PyCaret's get_data() function to import the dataset (This will require internet connection).

Dataset Acknowledgement:

Kiva Microfunds

4.0 Getting the Data

In [1]:
from pycaret.datasets import get_data
data = get_data('kiva')
country en gender loan_amount nonpayment sector status
0 Dominican Republic "Banco Esperanza" is a group of 10 women looki... F 1225 partner Retail 0
1 Dominican Republic "Caminemos Hacia Adelante" or "Walking Forward... F 1975 lender Clothing 0
2 Dominican Republic "Creciendo Por La Union" is a group of 10 peop... F 2175 partner Clothing 0
3 Dominican Republic "Cristo Vive" ("Christ lives" is a group of 10... F 1425 partner Clothing 0
4 Dominican Republic "Cristo Vive" is a large group of 35 people, 2... F 4025 partner Food 0
In [2]:
#check the shape of data
(6818, 7)
In [3]:
# sampling the data to select only 1000 documents
data = data.sample(1000, random_state=786).reset_index(drop=True)
(1000, 7)

5.0 Setting up Environment in PyCaret

setup() function initializes the environment in pycaret and performs several text pre-processing steps that are imperative to work with NLP problems. In last tutorial, we have not passed any custom stopwords, which we will do in this tutorial using custom_stopwords parameter. All the custom stopwords passed below are obtained through the analysis we performed in Natural Language Processing Tutorial (NLP101) - Level Beginner (refer to section 9.1). These are the words with very high frequency in the documents. As such, this is adding more noise than information. Deciding a list of custom stopwords is a subjective decision and mostly stems from your understanding of the dataset. For example, in this dataset words like 'loan', 'income', 'business', 'usd' etc. are very obvious since we are working on a dataset with customer loans. Rest of the parameters passed in setup() below are same as last tutorial.

In [4]:
from pycaret.nlp import *
In [5]:
exp_nlp102 = setup(data = data, target = 'en', session_id = 123,
                   custom_stopwords = ['loan', 'income', 'usd', 'many', 'also', 'make', 'business', 'buy', 
                                       'sell', 'purchase','year', 'people', 'able', 'enable', 'old', 'woman',
                                       'child', 'school'],
                   log_experiment = True, experiment_name = 'kiva1')
Description Value
session_id 123
Documents 1000
Vocab Size 4552
Custom Stopwords True

6.0 Create a Topic Model

In [6]:
lda = create_model('lda')
In [7]:
plot_model(lda, plot = 'topic_distribution')

If you compare the output above with the one in section 9.4 of last tutorial you would notice that distribution of topics have changed, but what is more important to observe here is when you hover over the bars, keywords gives you better idea of theme of the topic in this experiment compared to last one because we have removed some noice by removing custom stopwords. For eg. Topic 3 seems to be about customers seeking trade loans as it include keywords like 'hair', 'salon', 'wood', 'machine'. Topic 0 is about farming/agricultural loans, Topic 1 is mostly about retail loans and Topic 2 seems to be about loans for domestic reasons.

Topic Modeling is very iterative machine learning task, finding the right list of custom stopwords is only possible after several iterations. We encourage you to repeat the experiments to gain actionable insights that finally leads you to the best working and implementable model. So far we have learned how to create and analyze a topic model using pycaret.nlp module. In next section we will go a few steps deeper to understand how to evaluate a topic model.

7.0 Evaluating Topic Model

Many topic models including Latent Dirichlet allocation are probabilistic models, providing both a predictive and latent topic representation. It is generally assumed that results generated by these models are meaningful and useful and due to its unsupervised training process it is hard to evaluate those assumptions. Nevertheless, it is equally important to identify if a trained model is objectively good or bad, as well have an ability to compare different models/methods. To do so, one would require an objective measure for the quality. Traditionally, and still for many practical applications, to evaluate if "the correct thing" has been learned about the corpus, an implicit knowledge and "eyeballing" approaches are used. Ideally, we’d like to capture this information in a single metric that can be maximized, and compared. The approaches that are commonly used today:

  • Eye Balling Models : Look at Top N words, Topics / Documents etc.
  • Intrinsic Evaluation Metrics: Interpretability and semantics of model
  • Extrinsic Evaluation Metrics: Is model good at performing predefined tasks, such as classification (later in this tutorial we will use our topic model to build a classifier to predict loan default)
  • Human Judgements: Does the topic model improves your understanding of the problem?

In this section we will learn how to evaluate coherence value of a topic model using tune_model() function. Followed by extrinsic evaluation on number of topics in a topic model to optimize classifier that can predict default using status column in the dataset.

Read More about Model Evaluation

7.1 Intrinsic Evaluation using Coherence Value

What is Topic Coherence? Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. tune_model() function iterates on a pre-defined grid with different number of topics and create a model for each parameter. Topic coherence is then evaluated for different models and are visually presented in a graph the has Coherence Score on y-axis as a function of # Topics on x-axis. See example below:

In [8]:
tuned_unsupervised = tune_model(model = 'lda', multi_core = True)
Best Model: Latent Dirichlet Allocation | # Topics: 16 | Coherence: 0.4741

Model with highest coherence score is the best model based on intrinsic evaluation criteria. As appealing as it may sound that performance of a topic model can be captured in one number i.e. Coherence Score, it doesn't come without its downside. We encourage you to do some more reading about Coherence Score to understand more about it. [Read More]

We have only covered Coherence in this tutorial. The other popular measure is perplexity. It captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set. Focussing on the log-likelihood part, you can think of the perplexity metric as measuring how probable some new unseen data is given the model that was learned earlier. That is to say, how well does the model represent or reproduce the statistics of the held-out data. However, recent studies have shown that predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated as such optimizing for perplexity may not yield human interpretable topics. [Reference]

7.2 Extrinsic Evaluation using Classifier

The dataset we are using is labelled using status column (1 means loan default, 0 means no default). We will now use tune_model() function to determine the best number of topics. Best in this case is defined by the measure of interest in supervised machine learning which in this case is Accuracy since this is a classification problem. See example below:

In [9]:
tuned_classification = tune_model(model = 'lda', multi_core = True, supervised_target = 'status')
Best Model: Latent Dirichlet Allocation | # Topics: 2 | Accuracy : 0.867

In this example the Accuracy is optimized when num_topics are set to 4. It is very likely that number of topics to optimize the supervised metric such as accuracy in this case would be different than model with best coherence value. At the end of the day, which one to use is totally dependent on the ues case of topic model. Evaluating topic models is a complex subject. It is unlikely that you will understand them fully if this is your first time doing topic modeling. We recommend you to watch this YouTube Video, if you interested in learning more.

Notice that when you used load_experiment(), it has loaded the entire experiments and all the intermediate outputs in variable saved_experiment. You can access specific items in a similar way you would access list elements in Python. See example below in which we are accessing our final stacking ensembler and store it in final_stack_soft_loaded variable.

8.0 Experiment Logging

PyCaret 2.0 embeds MLflow Tracking component as a backend API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. To log your experiments in pycaret simply use log_experiment and experiment_name parameter in the setup function, as we did in this example.

You can start the UI on localhost:5000. Simply initiate the MLFlow server from command line or from notebook. See example below:

In [ ]:
# to start the MLFlow server from notebook:
!mlflow ui 

Open localhost:5000 on your browser (below is example of how UI looks like)


9.0 Wrap-up / Next Steps?

We have covered several key concepts in this tutorial such as model evaluation using intrinsic and extrinsic technique. In next tutorial. We have performed several text pre-processing steps including removal of custom_stopwords using setup() then we have created a topic model and compared the results with the one we created in last tutorial. We have also talked about different ways to evaluate topic model and have used tune_model() to evaluate coherence value of a LDA model. We have also used tune_model() to evaluate number of topics in a supervised setting (in this case we have used it to build a classifier to predict loan status).

In next tutorial we will use pycaret.nlp together with pycaret.classification and focus on using supervised and unsupervised module of pycaret together.

See you at the next tutorial. Follow the link to Natural Language Processing (NLP103) - Level Expert