#!/usr/bin/env python # coding: utf-8 # ## Frequent Itemsets via Apriori Algorithm # Apriori function to extract frequent itemsets for association rule mining # > from mlxtend.frequent_patterns import apriori # ## Overview # Apriori is a popular algorithm [1] for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. A itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur togehter in at least 50% of all transactions in the database. # ## References # # [1] Agrawal, Rakesh, and Ramakrishnan Srikant. "[Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf)." Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994. # ## Example 1 # The `apriori` function expects data in a one-hot encoded pandas DataFrame. # Suppose we have the following transaction data: # In[1]: dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'], ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'], ['Milk', 'Apple', 'Kidney Beans', 'Eggs'], ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'], ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']] # We can transform it into the right format via the `OnehotTransactions` encoder as follows: # In[2]: import pandas as pd from mlxtend.preprocessing import OnehotTransactions oht = OnehotTransactions() oht_ary = oht.fit(dataset).transform(dataset) df = pd.DataFrame(oht_ary, columns=oht.columns_) df # Now, let us return the items and itemsets with at least 60% support: # In[3]: from mlxtend.frequent_patterns import apriori apriori(df, min_support=0.6) # By default, `apriori` returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set `use_colnames=True` to convert these integer values into the respective item names: # In[4]: apriori(df, min_support=0.6, use_colnames=True) # ## Example 2 # The advantage of working with pandas `DataFrames` is that we can use its convenient features to filter the results. For instance, let's assume we are only interested in itemsets of length 2 that have a support of at least 80 percent. First, we create the frequent itemsets via `apriori` and add a new column that stores the length of each itemset: # In[5]: frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True) frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x)) frequent_itemsets # Then, we can select the results that satisfy our desired criteria as follows: # In[6]: frequent_itemsets[ (frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >= 0.8) ] # ## API # In[7]: with open('../../api_modules/mlxtend.frequent_patterns/apriori.md', 'r') as f: print(f.read())