Correlational analysis¶

This notebook contains correlational analyses of the recommendation list inputs and outputs, for comparison purposes and also to analyze the control bias.

Setup¶

In [1]:

import pickle

In [2]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.special import logit, expit

In [3]:

with open('data/profile-data.pkl', 'rb') as f:
    profiles = pickle.load(f)
profiles.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 25000 entries, ('AZ', 529) to ('GR-I', 875516)
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   count       25000 non-null  int32  
 1   linked      25000 non-null  int32  
 2   ambiguous   25000 non-null  int32  
 3   male        25000 non-null  int32  
 4   female      25000 non-null  int32  
 5   dcknown     25000 non-null  int32  
 6   dcyes       25000 non-null  int32  
 7   PropDC      25000 non-null  float64
 8   Known       25000 non-null  int32  
 9   PropFemale  25000 non-null  float64
 10  PropKnown   25000 non-null  float64
dtypes: float64(3), int32(8)
memory usage: 2.2+ MB

In [4]:

with open('data/rec-data.pkl', 'rb') as f:
    recs = pickle.load(f)
recs.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 94146 entries, ('AZ', 'als', 529) to ('GR-I', 'wrls', 875516)
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ambiguous   94146 non-null  int64  
 1   female      94146 non-null  int64  
 2   male        94146 non-null  int64  
 3   unknown     94146 non-null  int64  
 4   Total       94146 non-null  int64  
 5   Known       94146 non-null  int64  
 6   PropKnown   94146 non-null  float64
 7   PropFemale  93926 non-null  float64
 8   dcknown     94146 non-null  int64  
 9   dcyes       94146 non-null  int32  
 10  PropDC      94115 non-null  float64
dtypes: float64(3), int32(1), int64(7)
memory usage: 8.7+ MB

Helpers from RecModel Analysis:

In [5]:

def select_implicit(data, reset=True):
    if reset:
        data = data.reset_index()
    implicit = data['Set'].str.endswith('-I')
    if 'Algorithm' in data.columns:
        implicit |= data['Algorithm'].str.endswith('-imp')
    else:
        implicit |= data['Set'] == 'AZ'
    data = data.loc[implicit].assign(Set=data['Set'].str.replace('-I', ''))
    if 'Algorithm' in data.columns:
        data['Algorithm'] = data['Algorithm'].str.replace('-imp', '').str.replace('wrls', 'als')
    return data

In [6]:

def select_explicit(data, reset=True):
    if reset:
        data = data.reset_index()
    implicit = data['Set'].str.endswith('-I') 
    if 'Algorithm' in data.columns:
        implicit |= data['Algorithm'].str.endswith('-imp')
    data = data[~implicit].assign(Set=data['Set'].str.replace('-E', ''))
    return data

Observed Profile Parameters¶

What do proportions look like, for both female authors and dummy codes?

In [7]:

profiles.groupby('Set')[['PropFemale', 'PropDC']].agg(['mean', 'std']).T

Out[7]:

	Set	AZ	BX-E	BX-I	GR-E	GR-I
PropFemale	mean	0.414445	0.418886	0.407030	0.446998	0.450201
PropFemale	std	0.329436	0.267354	0.254304	0.276480	0.269127
PropDC	mean	0.498460	0.494217	0.499506	0.494855	0.497091
PropDC	std	0.160820	0.151552	0.144868	0.082914	0.069235

What does the distribution look like?

In [8]:

props = pd.melt(profiles.reset_index(), ['Set', 'user'], ['PropFemale', 'PropDC'])
props.head()

Out[8]:

	Set	user	variable	value
0	AZ	529	PropFemale	0.800000
1	AZ	1723	PropFemale	0.285714
2	AZ	1810	PropFemale	0.000000
3	AZ	2781	PropFemale	0.166667
4	AZ	2863	PropFemale	0.000000

In [9]:

grid = sns.FacetGrid(props, col='variable', row='Set', sharey=False)
grid.map(sns.distplot, 'value')

Out[9]:

<seaborn.axisgrid.FacetGrid at 0x1bed0aa62c8>

Rec list distributions¶

In [10]:

rec_props = pd.melt(recs.reset_index(), ['Set', 'Algorithm', 'user'], ['PropFemale', 'PropDC'])
rec_props.head()

Out[10]:

	Set	Algorithm	user	variable	value
0	AZ	als	529	PropFemale	0.296296
1	AZ	als	1723	PropFemale	0.571429
2	AZ	als	1810	PropFemale	0.400000
3	AZ	als	2781	PropFemale	0.320000
4	AZ	als	2863	PropFemale	0.137931

Implicit algorithm distributions:

In [11]:

grid = sns.FacetGrid(select_implicit(rec_props), row='Set', col='Algorithm', hue='variable')
grid.map(sns.kdeplot, 'value')
grid.add_legend()

Out[11]:

<seaborn.axisgrid.FacetGrid at 0x1bed2231cc8>

Explicit algorithm distributions:

In [12]:

grid = sns.FacetGrid(select_explicit(rec_props), row='Set', col='Algorithm', hue='variable')
grid.map(sns.kdeplot, 'value')
grid.add_legend()

Out[12]:

<seaborn.axisgrid.FacetGrid at 0x1bed2ba9508>

Let's look at correlations!¶

Implicit feedback mode¶

First, extract the recommendation data:

In [13]:

imp_rd = select_implicit(recs)
imp_rd = imp_rd.melt(['Set', 'user', 'Algorithm'], ['PropFemale', 'PropDC'], value_name='recs')
imp_rd.set_index(['Set', 'user', 'Algorithm', 'variable'], inplace=True)
imp_rd.sort_index(inplace=True)
imp_rd

Out[13]:

				recs
Set	user	Algorithm	variable
AZ	529	als	PropDC	0.489796
		als	PropFemale	0.800000
		bpr	PropDC	0.466667
		bpr	PropFemale	0.464286
		item-item	PropDC	0.600000
...	...	...	...	...
GR	875516	bpr	PropFemale	0.093023
		item-item	PropDC	0.400000
		item-item	PropFemale	0.361111
		user-user	PropDC	0.520000
		user-user	PropFemale	0.083333

119950 rows × 1 columns

Now the profile data:

In [14]:

imp_prof = profiles.reset_index()
imp_prof = imp_prof[~imp_prof['Set'].str.endswith('-E')]
imp_prof['Set'] = imp_prof['Set'].str.replace('-I$', '')
imp_prof = imp_prof.melt(['Set', 'user'], ['PropFemale', 'PropDC'], value_name='profile')
imp_prof.set_index(['Set', 'user', 'variable'], inplace=True)
imp_prof.sort_index(inplace=True)
imp_prof

Out[14]:

			profile
Set	user	variable
AZ	529	PropDC	0.375000
	529	PropFemale	0.800000
	1723	PropDC	0.560000
	1723	PropFemale	0.285714
	1810	PropDC	0.125000
...	...	...	...
GR	875408	PropFemale	0.284211
	875441	PropDC	0.485714
	875441	PropFemale	0.821990
	875516	PropDC	0.523148
	875516	PropFemale	0.115044

30000 rows × 1 columns

And now merge it:

In [15]:

joint = imp_prof.join(imp_rd)
joint

Out[15]:

				profile	recs
Set	user	variable	Algorithm
AZ	529	PropDC	als	0.375000	0.489796
			bpr	0.375000	0.466667
			item-item	0.375000	0.600000
			user-user	0.375000	0.489796
		PropFemale	als	0.800000	0.800000
...	...	...	...	...	...
GR	875516	PropDC	user-user	0.523148	0.520000
		PropFemale	als	0.115044	0.040000
			bpr	0.115044	0.093023
			item-item	0.115044	0.361111
			user-user	0.115044	0.083333

119950 rows × 2 columns

And compute profile-rec correlations:

In [16]:

joint.groupby(['Set', 'Algorithm', 'variable']).apply(lambda df: df['profile'].corr(df['recs'])).unstack()

Out[16]:

	variable	PropDC	PropFemale
Set	Algorithm
AZ	als	-0.024951	0.822392
	bpr	-0.021407	0.630675
	item-item	0.012452	0.705908
	user-user	0.005395	0.803753
BX	als	-0.002632	0.735219
	bpr	-0.001114	0.774075
	item-item	-0.023173	0.560200
	user-user	-0.045251	0.595326
GR	als	-0.008759	0.902928
	bpr	0.001138	0.892903
	item-item	-0.064934	0.773523
	user-user	-0.062454	0.813079

Explicit-feedback mode¶

First, extract the recommendation data:

In [17]:

exp_rd = select_explicit(recs)
exp_rd = exp_rd.melt(['Set', 'user', 'Algorithm'], ['PropFemale', 'PropDC'], value_name='recs')
exp_rd.set_index(['Set', 'user', 'Algorithm', 'variable'], inplace=True)
exp_rd.sort_index(inplace=True)
exp_rd

Out[17]:

				recs
Set	user	Algorithm	variable
AZ	529	als	PropDC	0.522727
		als	PropFemale	0.296296
		item-item	PropDC	0.512195
		item-item	PropFemale	0.684211
		user-user	PropDC	0.613636
...	...	...	...	...
GR	875920	user-user	PropFemale	0.192308
	875927	item-item	PropDC	0.560000
		item-item	PropFemale	0.500000
		user-user	PropDC	0.540000
		user-user	PropFemale	0.304348

68342 rows × 1 columns

Now the profile data:

In [18]:

exp_prof = profiles.reset_index()
exp_prof = exp_prof[~exp_prof['Set'].str.endswith('-I')]
exp_prof['Set'] = exp_prof['Set'].str.replace('-E$', '')
exp_prof = exp_prof.melt(['Set', 'user'], ['PropFemale', 'PropDC'], value_name='profile')
exp_prof.set_index(['Set', 'user', 'variable'], inplace=True)
exp_prof.sort_index(inplace=True)
exp_prof

Out[18]:

			profile
Set	user	variable
AZ	529	PropDC	0.375000
	529	PropFemale	0.800000
	1723	PropDC	0.560000
	1723	PropFemale	0.285714
	1810	PropDC	0.125000
...	...	...	...
GR	875863	PropFemale	0.964602
	875920	PropDC	0.451613
	875920	PropFemale	0.133333
	875927	PropDC	0.608696
	875927	PropFemale	0.588235

30000 rows × 1 columns

And now merge it:

In [19]:

joint = exp_prof.join(exp_rd)
joint

Out[19]:

				profile	recs
Set	user	variable	Algorithm
AZ	529	PropDC	als	0.375000	0.522727
			item-item	0.375000	0.512195
			user-user	0.375000	0.613636
		PropFemale	als	0.800000	0.296296
		PropFemale	item-item	0.800000	0.684211
...	...	...	...	...	...
GR	875920	PropFemale	user-user	0.133333	0.192308
	875927	PropDC	item-item	0.608696	0.560000
		PropDC	user-user	0.608696	0.540000
		PropFemale	item-item	0.588235	0.500000
		PropFemale	user-user	0.588235	0.304348

68364 rows × 2 columns

And compute profile-rec correlations:

In [20]:

joint.groupby(['Set', 'Algorithm', 'variable']).apply(lambda df: df['profile'].corr(df['recs'])).unstack()

Out[20]:

	variable	PropDC	PropFemale
Set	Algorithm
AZ	als	-0.030883	0.220817
	item-item	0.011453	0.619091
	user-user	-0.026146	0.535963
BX	item-item	-0.034358	0.296007
BX	user-user	0.001537	0.379183
GR	item-item	-0.035269	0.622149
GR	user-user	0.006073	0.487080

In [ ]: