Created using: PyCaret 2.0
Date Updated: August 24, 2020
Welcome to Natural Language Processing Tutorial (NLP101). This tutorial assumes that you are new to PyCaret and looking to get started with Natural Language Processing using pycaret.nlp
Module.
In this tutorial we will learn:
Read Time : Approx. 30 Minutes
First step to get started with PyCaret is to install pycaret. Installing pycaret is easy and take few minutes only. Follow the instructions below:
pip install pycaret
!pip install pycaret
If you are running this notebook on Google colab, below code of cells must be run at top of the notebook to display interactive visuals.
Natural Language Processing (NLP in short) is a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages. Some of the common use case of NLP in machine learning are:
PyCaret's NLP module (pycaret.nlp
) is an unsupervised machine learning module which can be used for analyzing the text data by creating topic model to find hidden semantic structure in documents. PyCaret's NLP module comes built-in with a wide range of text pre-processing techniques which is the fundamental step in any NLP problem. It transforms the raw text into a format that machine learning algorithms can learn from.
As of first release, PyCaret's NLP module only support English
language and provides several popular implementation of topic models from Latent Dirichlet Allocation to Non-Negative Matrix Factorization. It has over 5 ready-to-use algorithms and over 10 plots to analyze the text. PyCaret's NLP module also implements a unique function tune_model()
that allows you to tune the hyperparameters of a topic model to optimize the supervised learning objective such as AUC
for classification or R2
for regression.
For this tutorial we will be using data from Kiva Microfunds https://www.kiva.org/. Kiva Microfunds is a non-profit that allows individuals to lend money to low-income entrepreneurs and students around the world. Since starting in 2005, Kiva has crowd-funded millions of loans with a repayment rate of around 98%. At Kiva, each loan request includes both traditional demographic information on the borrower, such as gender and location, as well as a personal story. In this tutorial we will use the text given in personal story to gain insights of the dataset and understand hidden semantic structure in the text. The dataset contains 6,818 samples. Short description of features are below:
In this tutorial we will only use en
column to create topic model. In next tutorial Natural Language Processing (NLP102) - Level Intermediate we will use topic model to build a classifier that predicts status
of loan to know whether the applicant will default or not.
Kiva Microfunds https://www.kiva.org/
You can download the data from PyCaret's git repository Click Here to Download or you can load it using get_data()
function (This will require internet connection).
from pycaret.datasets import get_data
data = get_data('kiva')
country | en | gender | loan_amount | nonpayment | sector | status | |
---|---|---|---|---|---|---|---|
0 | Dominican Republic | "Banco Esperanza" is a group of 10 women looki... | F | 1225 | partner | Retail | 0 |
1 | Dominican Republic | "Caminemos Hacia Adelante" or "Walking Forward... | F | 1975 | lender | Clothing | 0 |
2 | Dominican Republic | "Creciendo Por La Union" is a group of 10 peop... | F | 2175 | partner | Clothing | 0 |
3 | Dominican Republic | "Cristo Vive" ("Christ lives" is a group of 10... | F | 1425 | partner | Clothing | 0 |
4 | Dominican Republic | "Cristo Vive" is a large group of 35 people, 2... | F | 4025 | partner | Food | 0 |
#check the shape of data
data.shape
(6818, 7)
# sampling the data to select only 1000 documents
data = data.sample(1000, random_state=786).reset_index(drop=True)
data.shape
(1000, 7)
setup()
function initializes the environment in pycaret and performs several text pre-processing steps that are imperative to work with NLP problems. setup must be called before executing any other function in pycaret. It takes two parameters: pandas dataframe and name of the text column passed as target
parameter. You can also pass a list
containing text, in which case you don't need to pass target
parameter. When setup is executed, following pre-processing steps are applied automatically:
Note : Some functionalities in pycaret.nlp
requires english language model. The language model is not downloaded automatically when you install pycaret. You will have to download these python command line interface such as Anaconda Prompt. To download the model, please type the following in your command line:
python -m spacy download en_core_web_sm
python -m textblob.download_corpora
from pycaret.nlp import *
exp_nlp101 = setup(data = data, target = 'en', session_id = 123)
Description | Value |
---|---|
session_id | 123 |
Documents | 1000 |
Vocab Size | 4573 |
Custom Stopwords | False |
Once the setup is successfully executed it prints the information grid with the following information:
session_id
is passed, a random number is automatically generated that is distributed to all functions. In this experiment session_id is set as 123
for later reproducibility.Notice that all text pre-processing steps are performed automatically when you execute setup()
. These steps are imperative to perform any NLP experiment. setup()
function prepares the corpus and dictionary that is ready-to-use for the topic models that you can create using create_model()
function. Another way to pass the text is in the form of list in which case no target
parameter is needed.
What is Topic Model? In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is. Read More
Creating a topic model in PyCaret is simple and similar to how you would have created a model in supervised modules of pycaret. A topic model is created using create_model()
function which takes one mandatory parameter i.e. name of model as a string. This function returns a trained model object. There are 5 topic models available in PyCaret. see the docstring of create_model()
for complete list of models. See an example below where we create Latent Dirichlet Allocation (LDA) model:
lda = create_model('lda')
print(lda)
LdaModel(num_terms=4573, num_topics=4, decay=0.5, chunksize=100)
We have created Latent Dirichlet Allocation (LDA) model with just one word i.e. create_model()
. Notice the num_topics
parameter is set to 4
which is a default value taken when you donot pass num_topics
parameter in create_model()
. In below example, we will create LDA model with 6 topics and we will also set multi_core
parameter to True
. When multi_core
is set to True
Latent Dirichlet Allocation (LDA) uses all CPU cores to parallelize and speed up model training.
lda2 = create_model('lda', num_topics = 6, multi_core = True)
print(lda2)
LdaModel(num_terms=4573, num_topics=6, decay=0.5, chunksize=100)
Now that we have created a topic model, we would like to assign the topic proportions to our dataset (6818 documents / samples) to analyze the results. We will achieve this by using assign_model()
function. See an example below:
lda_results = assign_model(lda)
lda_results.head()
country | en | gender | loan_amount | nonpayment | sector | status | Topic_0 | Topic_1 | Topic_2 | Topic_3 | Dominant_Topic | Perc_Dominant_Topic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Kenya | praxide marry child primary school train tailo... | F | 75 | partner | Services | 0 | 0.001872 | 0.006235 | 0.990371 | 0.001521 | Topic 2 | 0.99 |
1 | Kenya | practitioner run year old life wife child biol... | M | 1200 | partner | Health | 0 | 0.262538 | 0.129088 | 0.606908 | 0.001466 | Topic 2 | 0.61 |
2 | Dominican Republic | live child boy girl range year sell use clothi... | F | 150 | partner | Clothing | 0 | 0.002032 | 0.218908 | 0.777409 | 0.001651 | Topic 2 | 0.78 |
3 | Kenya | phanice marry child daughter secondary school ... | F | 150 | lender | Services | 1 | 0.002075 | 0.070739 | 0.925500 | 0.001686 | Topic 2 | 0.93 |
4 | Kenya | year old hotel kaptembwa operating hotel last ... | F | 300 | lender | Food | 1 | 0.001838 | 0.097333 | 0.899336 | 0.001493 | Topic 2 | 0.90 |
Notice how 6 additional columns are now added to the dataframe. en
is the text after all pre-processing. Topic_0 ... Topic_3
are the topic proportions and represents the distribution of topics for each document. Dominant_Topic
is the topic number with highest proportion and Perc_Dominant_Topic
is the percentage of dominant topic over 1 (only shown when models are stochastic i.e. sum of all proportions equal to 1) .
plot_model()
function can be used to analyze the overall corpus or only specific topics extracted through topic model. Hence the function plot_model()
can also work without passing any trained model object. See examples below:
plot_model()
plot_model(plot = 'bigram')
plot_model()
can also be used to analyze the same plots for specific topics. To generate plots at topic level, function requires trained model object to be passed inside plot_model()
. In example below we will generate frequency distribution on Topic 1
only as defined by topic_num
parameter.
plot_model(lda, plot = 'frequency', topic_num = 'Topic 1')
plot_model(lda, plot = 'topic_distribution')
Each document is a distribution of topics and not a single topic. Although, if the task is of categorizing document into specific topics, it wouldn't be wrong to use the topic proportion with highest value to categorize the document into a topic. In above plot, each document is categorized into one topic using the largest proportion of topic weights. We can see most of the documents are in Topic 3
with only few in Topic 1
. If you hover over these bars, you will get basic idea of themes in this topic by looking at the keywords. For example if you evaluate Topic 2
, you will see keywords words like 'farmer', 'rice', 'land', which probably means that the loan applicants in this category pertains to agricultural/farming loans. However, if you hover over Topic 0
and Topic 3
you will observe lot of repitions and keywords are overlapping in all topics such as word "loan" and "business" appears both in Topic 0
and Topic 3
. In next tutorial, Natural Language Processing Tutorial (NLP102) - Level Intermediate we will demonstrate the use of custom_stopwords
at which point we will re-analyze this plot.
plot_model(lda, plot = 'tsne')
T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions.
plot_model(lda, plot = 'umap')
UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimensionality reduction. It is similar to tSNE and PCA in its purpose as all of them are techniques to reduce dimensionality for 2d/3d projections. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology.
Another way to analyze performance of models is to use evaluate_model()
function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model()
function. See below example where we have generated Sentiment Polarity Plot for Topic 3
using LDA model stored in lda
variable.
evaluate_model(lda)
interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Frequency Plot', 'freque…
As you get deeper into Natural Language Processing, you will learn that training time of topic models increases exponentially as the size of corpus increases. As such, if you would like to continue your experiment or analysis at a later point, you don't need to repeat the entire experiment and re-train your model. PyCaret inbuilt function save_model()
allows you to save the model for later use.
save_model(lda,'Final LDA Model 08Feb2020')
Model Succesfully Saved
To load a saved model on a future date in the same or different environment, we would use the PyCaret's load_model()
function.
saved_lda = load_model('Final LDA Model 08Feb2020')
Model Sucessfully Loaded
print(saved_lda)
LdaModel(num_terms=4573, num_topics=4, decay=0.5, chunksize=100)
What we have covered in this tutorial is the entire workflow for Natural Language Processing experiment. Our task today was to create and analyze a topic model. We have performed several text pre-processing steps using setup()
then we have created a topic model using create_model()
, assigned topics to the dataset using assign_model()
and analyze the results using plot_model()
. All this was completed in less than 10 commands that are naturally constructed and very intuitive to remember. Re-creating the entire experiment without PyCaret would have taken well over 100 lines of code.
In this tutorial, we have only covered basics of pycaret.nlp
. In the next tutorial we will demonstrate the use of tune_model()
to automatically select the number of topics for a topic model. We will also go deeper into few concepts and techniques such as custom_stopwords
to improve the result of a topic model.
See you at the next tutorial. Follow the link to Natural Language Processing (NLP102) - Level Intermediate