#!/usr/bin/env python
# coding: utf-8

# ## Frequent Itemsets via Apriori Algorithm

# Apriori function to extract frequent itemsets for association rule mining

# > from mlxtend.frequent_patterns import apriori

# ## Overview

# Apriori is a popular algorithm [1] for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. A itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur togehter in at least 50% of all transactions in the database.

# ## References
# 
# [1] Agrawal, Rakesh, and Ramakrishnan Srikant. "[Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf)." Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.

# ## Example 1

# The `apriori` function expects data in a one-hot encoded pandas DataFrame.
# Suppose we have the following transaction data:

# In[1]:


dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]


# We can transform it into the right format via the `OnehotTransactions` encoder as follows:

# In[2]:


import pandas as pd
from mlxtend.preprocessing import OnehotTransactions

oht = OnehotTransactions()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
df


# Now, let us return the items and itemsets with at least 60% support:

# In[3]:


from mlxtend.frequent_patterns import apriori

apriori(df, min_support=0.6)


# By default, `apriori` returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set `use_colnames=True` to convert these integer values into the respective item names: 

# In[4]:


apriori(df, min_support=0.6, use_colnames=True)


# ## Example 2

# The advantage of working with pandas `DataFrames` is that we can use its convenient features to filter the results. For instance, let's assume we are only interested in itemsets of length 2 that have a support of at least 80 percent. First, we create the frequent itemsets via `apriori` and add a new column that stores the length of each itemset:

# In[5]:


frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets


# Then, we can select the results that satisfy our desired criteria as follows:

# In[6]:


frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.8) ]


# ## API

# In[7]:


with open('../../api_modules/mlxtend.frequent_patterns/apriori.md', 'r') as f:
    print(f.read())