Association Rule Mining Tutorial (ARUL101)

Created using: PyCaret 2.0
Date Updated: August 24, 2020

1.0 Objective of Tutorial

Welcome to Association Rule Mining Tutorial (ARUL101). This tutorial assumes that you are new to PyCaret and looking to get started with Association Rule Mining using pycaret.arules Module.

In this tutorial we will learn:

  • Getting Data: How to import data from PyCaret repository?
  • Setting up Environment: How to setup experiment in PyCaret to get started with association rule mining?
  • Create Model: How to create a model to evaluate results?
  • Plot Model: How to analyze model using various plots?

Read Time : Approx. 15 Minutes

1.1 Installing PyCaret

First step to get started with PyCaret is to install PyCaret. Installing pycaret is easy and take few minutes only. Follow the instructions below:

Installing PyCaret in Local Jupyter Notebook

pip install pycaret

Installing PyCaret on Google Colab or Azure Notebooks

!pip install pycaret

1.2 Pre-Requisites

  • Python 3.6 or greater
  • PyCaret 2.0 or greater
  • Internet connection to load data from PyCaret's repository
  • Basic Knowledge of Association Rule Mining

1.3 For Google colab users:

If you are running this notebook on Google Colab, run the following code at top of your notebook to display interactive visuals.

from pycaret.utils import enable_colab
enable_colab()

2.0 What is Association Rule Mining?

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. For example, the rule {onions, potatoes} --> {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy burger. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.

Learn More about Association Rule Mining

3.0 Overview of Association Rule Module in PyCaret

PyCaret's association rule module (pycaret.arules) is a supervised machine learning module which is used for discovering interesting relations between variables in the dataset. This module automatically transforms any transactional database into a shape that is acceptable for the apriori algorithm. Apriori is an algorithm for frequent item set mining and association rule learning over relational databases.

4.0 Dataset for the Tutorial

For this tutorial we will use a small sample from UCI dataset called Online Retail Dataset. This is a transactional dataset which contains transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. Short description of attributes are as follows:

  • InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
  • StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
  • Description: Product (item) name. Nominal.
  • Quantity: The quantities of each product (item) per transaction. Numeric.
  • InvoiceData: Invice Date and time. Numeric, the day and time when each transaction was generated.
  • UnitPrice: Unit price. Numeric, Product price per unit in sterling.
  • CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
  • Country: Country name. Nominal, the name of the country where each customer resides.

Dataset Acknowledgement:

Dr Daqing Chen, Director: Public Analytics group. [email protected], School of Engineering, London South Bank University, London SE1 0AA, UK.

The original dataset and data dictionary can be found here.

5.0 Getting the Data

You can download the data from the original source found here and load it using pandas (Learn How) or you can use PyCaret's data respository to load the data using get_data() function (This will require internet connection).

In [1]:
from pycaret.datasets import get_data
data = get_data('france')
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536370 22728 ALARM CLOCK BAKELIKE PINK 24 12/1/2010 8:45 3.75 12583.0 France
1 536370 22727 ALARM CLOCK BAKELIKE RED 24 12/1/2010 8:45 3.75 12583.0 France
2 536370 22726 ALARM CLOCK BAKELIKE GREEN 12 12/1/2010 8:45 3.75 12583.0 France
3 536370 21724 PANDA AND BUNNIES STICKER SHEET 12 12/1/2010 8:45 0.85 12583.0 France
4 536370 21883 STARS GIFT TAPE 24 12/1/2010 8:45 0.65 12583.0 France

Note: If you are downloading the data from original source, you have to filter Country for 'France' only, if you wish to reproduce the results in this experiment.

In [2]:
#check the shape of data
data.shape
Out[2]:
(8557, 8)

6.0 Setting up Environment in PyCaret

setup() function initializes the environment in PyCaret and transforms the transactional dataset into a shape that is acceptable to Apriori algorithm. It requires three mandatory parameters: pandas dataframe, transaction_id which is the name of column representing transaction id and will be used to pivot the matrix; and item_id which is the name of the column used for creation of rules. Normally, this will be the variable of interest. You can also pass an optional parameter ignore_items to ignore certain values for creation of rule.

In [3]:
from pycaret.arules import *
In [4]:
exp_arul101 = setup(data = data, 
                    transaction_id = 'InvoiceNo',
                    item_id = 'Description') 
Description Value
session_id 7256
# Transactions 461
# Items 1565
Ignore Items None

Once the setup is successfully executed it prints the information grid that contains few important information:

  • # Transactions : Unique number of transactions in the dataset. In this case unique InvoiceNo.

  • # Items : Unique number of items in the dataset. In this case Description.

  • Ignore Items : Items to be ignored in rule mining. Many times there are relations which are too obvious and you might want to ignore them for this analysis. For example: many transactional datasets will contain shipping cost which is very obvious relationship that can be ignored in setup() using ignore_items parameter. In this tutorial, we will run the setup() twice, first without ignoring any items and later with ignoring items.

7.0 Create a Model

Creating an association rule model is simple. create_model() requires no mandatory parameters. It has 4 optional parameters which are as follows:

  • metric: Metric to evaluate if a rule is of interest. Default is set to confidence. Other available metrics include 'support', 'lift', 'leverage', 'conviction'.

  • threshold: Minimal threshold for the evaluation metric, via the metric parameter, to decide whether a candidate rule is of interest. Default is set to 0.5.

  • min_support: A float between 0 and 1 for minumum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions. Default is set to 0.05.

  • round: Number of decimal places metrics in score grid will be rounded to.

Let's create an association rule model with all default values.

In [5]:
model1 = create_model() #model created and stored in model1 variable.
In [6]:
print(model1.shape) #141 rules created.
(141, 9)
In [7]:
model1.head() #see the rules
Out[7]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (JUMBO BAG WOODLAND ANIMALS) (POSTAGE) 0.0651 0.6746 0.0651 1.0000 1.4823 0.0212 inf
1 (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... (SET/6 RED SPOTTY PAPER CUPS) 0.0868 0.1171 0.0846 0.9750 8.3236 0.0744 35.3145
2 (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... (SET/6 RED SPOTTY PAPER PLATES) 0.0868 0.1085 0.0846 0.9750 8.9895 0.0752 35.6616
3 (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... (SET/6 RED SPOTTY PAPER CUPS) 0.0716 0.1171 0.0694 0.9697 8.2783 0.0610 29.1345
4 (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... (SET/6 RED SPOTTY PAPER PLATES) 0.0716 0.1085 0.0694 0.9697 8.9406 0.0617 29.4208

8.0 Setup with ignore_items

In model1 created above, notice that the number 1 rule is of JUMBO BAG WOODLAND ANIMALS with POSTAGE which is very obvious. In example below, we will use ignore_items parameter in setup() to ignore POSTAGE from the dataset and re-create the association rule model.

In [8]:
exp_arul101 = setup(data = data, 
                    transaction_id = 'InvoiceNo',
                    item_id = 'Description',
                    ignore_items = ['POSTAGE']) 
Description Value
session_id 6114
# Transactions 461
# Items 1565
Ignore Items ['POSTAGE']
In [9]:
model2 = create_model()
In [10]:
print(model2.shape) #notice how only 45 rules are created vs. 141 above.
(45, 9)
In [11]:
model2.head()
Out[11]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... (SET/6 RED SPOTTY PAPER CUPS) 0.0868 0.1171 0.0846 0.9750 8.3236 0.0744 35.3145
1 (SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE... (SET/6 RED SPOTTY PAPER PLATES) 0.0868 0.1085 0.0846 0.9750 8.9895 0.0752 35.6616
2 (SET/6 RED SPOTTY PAPER PLATES) (SET/6 RED SPOTTY PAPER CUPS) 0.1085 0.1171 0.1041 0.9600 8.1956 0.0914 22.0716
3 (CHILDRENS CUTLERY SPACEBOY ) (CHILDRENS CUTLERY DOLLY GIRL ) 0.0586 0.0629 0.0542 0.9259 14.7190 0.0505 12.6508
4 (SET/6 RED SPOTTY PAPER CUPS) (SET/6 RED SPOTTY PAPER PLATES) 0.1171 0.1085 0.1041 0.8889 8.1956 0.0914 8.0239

9.0 Plot Model

In [12]:
plot_model(model2)
In [13]:
plot_model(model2, plot = '3d')