Created using: PyCaret 2.0
Date Updated: August 24, 2020
Welcome to Association Rule Mining Tutorial (ARUL101). This tutorial assumes that you are new to PyCaret and looking to get started with Association Rule Mining using pycaret.arules
Module.
In this tutorial we will learn:
Read Time : Approx. 15 Minutes
First step to get started with PyCaret is to install PyCaret. Installing pycaret is easy and take few minutes only. Follow the instructions below:
pip install pycaret
!pip install pycaret
If you are running this notebook on Google Colab, run the following code at top of your notebook to display interactive visuals.
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. For example, the rule {onions, potatoes} --> {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy burger. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.
PyCaret's association rule module (pycaret.arules
) is a supervised machine learning module which is used for discovering interesting relations between variables in the dataset. This module automatically transforms any transactional database into a shape that is acceptable for the apriori algorithm. Apriori is an algorithm for frequent item set mining and association rule learning over relational databases.
For this tutorial we will use a small sample from UCI dataset called Online Retail Dataset. This is a transactional dataset which contains transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. Short description of attributes are as follows:
Dr Daqing Chen, Director: Public Analytics group. chend@lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
The original dataset and data dictionary can be found here.
You can download the data from the original source found here and load it using pandas (Learn How) or you can use PyCaret's data respository to load the data using get_data()
function (This will require internet connection).
from pycaret.datasets import get_data
data = get_data('france')
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536370 | 22728 | ALARM CLOCK BAKELIKE PINK | 24 | 12/1/2010 8:45 | 3.75 | 12583.0 | France |
1 | 536370 | 22727 | ALARM CLOCK BAKELIKE RED | 24 | 12/1/2010 8:45 | 3.75 | 12583.0 | France |
2 | 536370 | 22726 | ALARM CLOCK BAKELIKE GREEN | 12 | 12/1/2010 8:45 | 3.75 | 12583.0 | France |
3 | 536370 | 21724 | PANDA AND BUNNIES STICKER SHEET | 12 | 12/1/2010 8:45 | 0.85 | 12583.0 | France |
4 | 536370 | 21883 | STARS GIFT TAPE | 24 | 12/1/2010 8:45 | 0.65 | 12583.0 | France |
Note: If you are downloading the data from original source, you have to filter Country
for 'France' only, if you wish to reproduce the results in this experiment.
#check the shape of data
data.shape
(8557, 8)
setup()
function initializes the environment in PyCaret and transforms the transactional dataset into a shape that is acceptable to Apriori algorithm. It requires three mandatory parameters: pandas dataframe, transaction_id
which is the name of column representing transaction id and will be used to pivot the matrix; and item_id
which is the name of the column used for creation of rules. Normally, this will be the variable of interest. You can also pass an optional parameter ignore_items
to ignore certain values for creation of rule.
from pycaret.arules import *
exp_arul101 = setup(data = data,
transaction_id = 'InvoiceNo',
item_id = 'Description')
Description | Value |
---|---|
session_id | 7256 |
# Transactions | 461 |
# Items | 1565 |
Ignore Items | None |
Once the setup is successfully executed it prints the information grid that contains few important information:
InvoiceNo
. Creating an association rule model is simple. create_model()
requires no mandatory parameters. It has 4 optional parameters which are as follows:
Let's create an association rule model with all default values.
model1 = create_model() #model created and stored in model1 variable.
print(model1.shape) #141 rules created.
(141, 9)
model1.head() #see the rules
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
0 | (JUMBO BAG WOODLAND ANIMALS) | (POSTAGE) | 0.0651 | 0.6746 | 0.0651 | 1.0000 | 1.4823 | 0.0212 | inf |
1 | (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... | (SET/6 RED SPOTTY PAPER CUPS) | 0.0868 | 0.1171 | 0.0846 | 0.9750 | 8.3236 | 0.0744 | 35.3145 |
2 | (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... | (SET/6 RED SPOTTY PAPER PLATES) | 0.0868 | 0.1085 | 0.0846 | 0.9750 | 8.9895 | 0.0752 | 35.6616 |
3 | (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... | (SET/6 RED SPOTTY PAPER CUPS) | 0.0716 | 0.1171 | 0.0694 | 0.9697 | 8.2783 | 0.0610 | 29.1345 |
4 | (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... | (SET/6 RED SPOTTY PAPER PLATES) | 0.0716 | 0.1085 | 0.0694 | 0.9697 | 8.9406 | 0.0617 | 29.4208 |
ignore_items
¶In model1
created above, notice that the number 1 rule is of JUMBO BAG WOODLAND ANIMALS
with POSTAGE
which is very obvious. In example below, we will use ignore_items
parameter in setup()
to ignore POSTAGE
from the dataset and re-create the association rule model.
exp_arul101 = setup(data = data,
transaction_id = 'InvoiceNo',
item_id = 'Description',
ignore_items = ['POSTAGE'])
Description | Value |
---|---|
session_id | 6114 |
# Transactions | 461 |
# Items | 1565 |
Ignore Items | ['POSTAGE'] |
model2 = create_model()
print(model2.shape) #notice how only 45 rules are created vs. 141 above.
(45, 9)
model2.head()
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
0 | (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... | (SET/6 RED SPOTTY PAPER CUPS) | 0.0868 | 0.1171 | 0.0846 | 0.9750 | 8.3236 | 0.0744 | 35.3145 |
1 | (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... | (SET/6 RED SPOTTY PAPER PLATES) | 0.0868 | 0.1085 | 0.0846 | 0.9750 | 8.9895 | 0.0752 | 35.6616 |
2 | (SET/6 RED SPOTTY PAPER PLATES) | (SET/6 RED SPOTTY PAPER CUPS) | 0.1085 | 0.1171 | 0.1041 | 0.9600 | 8.1956 | 0.0914 | 22.0716 |
3 | (CHILDRENS CUTLERY SPACEBOY ) | (CHILDRENS CUTLERY DOLLY GIRL ) | 0.0586 | 0.0629 | 0.0542 | 0.9259 | 14.7190 | 0.0505 | 12.6508 |
4 | (SET/6 RED SPOTTY PAPER CUPS) | (SET/6 RED SPOTTY PAPER PLATES) | 0.1171 | 0.1085 | 0.1041 | 0.8889 | 8.1956 | 0.0914 | 8.0239 |
plot_model(model2)
plot_model(model2, plot = '3d')