User Profile Data Prep¶

This notebook does the data preparation for Bayesian inference for the model analysis.

Setup¶

In [1]:

from pathlib import Path

In [2]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from plotnine import *
import ujson

In [3]:

from bookgender.config import data_dir
from bookgender.nbutils import *

In [4]:

fig_dir = init_figs('ProfileData')

using figure dir figures/ProfileData

Load Data¶

We need to load the author-gender information:

In [5]:

book_gender = pd.read_csv(data_dir / 'author-gender.csv.gz')
book_gender = book_gender.set_index('item')['gender']
book_gender.describe()

Out[5]:

count     12234574
unique           6
top           male
freq       3645216
Name: gender, dtype: object

In [6]:

book_gender[book_gender == 'no-viaf-author'] = 'unlinked'
book_gender[book_gender == 'no-loc-author'] = 'unlinked'
book_gender[book_gender == 'no-loc-book'] = 'unlinked'
book_gender = book_gender.astype('category')
book_gender.unique()

Out[6]:

[male, unlinked, female, unknown, ambiguous]
Categories (5, object): [male, unlinked, female, unknown, ambiguous]

And we load book hashes, to set up our dummy bias:

In [7]:

book_hash = pd.read_parquet(data_dir / 'book-hash.parquet').rename(columns={'cluster': 'item'})
book_hash['dcode'] = book_hash['md5'].apply(lambda x: int(x[-1], 16) % 2)
book_hash = book_hash.set_index('item')
book_hash.head()

Out[7]:

	nisbns	md5	dcode
item
0	17	3781b82fabd530590c70cac955b52bb0	0
1	2	4c6606ab43bfbe946a436c0ce7633a7a	0
2	38	e16249d40bf94b35d8a784d73d0511c5	1
3	2	289071ab1041c090ac252616a76fe079	1
4	4	7308735b39347b616ee6be0ab093541e	0

Load the sample user ratings for each data set:

In [8]:

user_ratings = pd.read_csv(data_dir / 'study-ratings.csv')
user_ratings.drop(columns=['rating'], inplace=True)
user_ratings.rename(columns={'dataset': 'Set'}, inplace=True)
user_ratings.head()

Out[8]:

	Set	user	item	timestamp	nactions	first_time	last_time
0	AZ	4975592	462114	1.233101e+09	NaN	NaN	NaN
1	AZ	4975592	1662785	1.233101e+09	NaN	NaN	NaN
2	AZ	4975592	7287509	1.233101e+09	NaN	NaN	NaN
3	AZ	4975592	8866889	1.233101e+09	NaN	NaN	NaN
4	AZ	4975592	10031188	1.233101e+09	NaN	NaN	NaN

In [9]:

user_ratings = user_ratings.join(book_gender, on='item', how='left')
user_ratings['gender'].fillna('unlinked', inplace=True)
user_ratings = user_ratings.join(book_hash['dcode'], on='item', how='left')
user_ratings.head(15)

Out[9]:

	Set	user	item	timestamp	nactions	first_time	last_time	gender	dcode
0	AZ	4975592	462114	1.233101e+09	NaN	NaN	NaN	male	0.0
1	AZ	4975592	1662785	1.233101e+09	NaN	NaN	NaN	male	1.0
2	AZ	4975592	7287509	1.233101e+09	NaN	NaN	NaN	male	1.0
3	AZ	4975592	8866889	1.233101e+09	NaN	NaN	NaN	unknown	0.0
4	AZ	4975592	10031188	1.233101e+09	NaN	NaN	NaN	male	0.0
5	AZ	4975592	43930	1.233274e+09	NaN	NaN	NaN	male	1.0
6	AZ	4975592	11005013	1.233274e+09	NaN	NaN	NaN	unknown	1.0
7	AZ	4975592	3649898	1.233619e+09	NaN	NaN	NaN	male	1.0
8	AZ	4975592	8462465	1.233619e+09	NaN	NaN	NaN	male	0.0
9	AZ	4975592	3727790	1.234656e+09	NaN	NaN	NaN	male	0.0
10	AZ	4975592	856425	1.234829e+09	NaN	NaN	NaN	male	0.0
11	AZ	4975592	4457778	1.234829e+09	NaN	NaN	NaN	unknown	1.0
12	AZ	4975592	1551960	1.235002e+09	NaN	NaN	NaN	male	0.0
13	AZ	4975592	2835913	1.238026e+09	NaN	NaN	NaN	unknown	0.0
14	AZ	4975592	8845927	1.238544e+09	NaN	NaN	NaN	unknown	1.0

Now we will summarize user profiles:

In [10]:

def summarize_profile(df):
    gender = df['gender']
    dc = df['dcode']
    data = {
        'count': len(df),
        'linked': np.sum(gender != 'unlinked'),
        'ambiguous': np.sum(gender == 'ambiguous'),
        'male': np.sum(gender == 'male'),
        'female': np.sum(gender == 'female'),
        'dcknown': dc.count(),
        'dcyes': dc.sum(skipna=True),
        'PropDC': dc.mean()
    }
    data['Known'] = data['male'] + data['female']
    data['PropFemale'] = data['female'] / data['Known']
    data['PropKnown'] = data['Known'] / data['count']
    return pd.Series(data)

In [11]:

profiles = user_ratings.groupby(['Set', 'user']).apply(summarize_profile)
profiles = profiles.apply(lambda s: s if s.name.startswith('Prop') else s.astype('i4'))
profiles.head()

Out[11]:

		count	linked	ambiguous	male	female	dcknown	dcyes	PropDC	Known	PropFemale	PropKnown
Set	user
AZ	529	8	8	2	1	4	8	3	0.375000	5	0.800000	0.625000
	1723	25	24	3	15	6	25	14	0.560000	21	0.285714	0.840000
	1810	14	6	0	6	0	8	1	0.125000	6	0.000000	0.428571
	2781	8	8	1	5	1	8	5	0.625000	6	0.166667	0.750000
	2863	6	6	0	6	0	6	4	0.666667	6	0.000000	1.000000

How are profile sizes distributed?

In [12]:

g = sns.FacetGrid(profiles.reset_index(), col='Set', sharex=False, sharey=False, height=2)
g.map(sns.distplot, 'count')
plt.savefig(fig_dir / 'profile-size-all.pdf')

In [13]:

g = sns.FacetGrid(profiles.reset_index(), col='Set', sharex=False, sharey=False, height=2)
g.map(sns.distplot, 'Known')
plt.savefig(fig_dir / 'profile-size-known.pdf')

For the paper, we want to make some changes - we're going to show this on a scatter plot.

In [14]:

up_sizes = profiles[['count', 'Known']].reset_index().melt(id_vars=['Set', 'user'], var_name='Type', value_name='Size')
up_sizes['Type'] = up_sizes['Type'].astype('category').cat.rename_categories({
    'count': 'All',
    'Known': 'Known-Gender'
})
up_sizes.head()

Out[14]:

	Set	user	Type	Size
0	AZ	529	All	8
1	AZ	1723	All	25
2	AZ	1810	All	14
3	AZ	2781	All	8
4	AZ	2863	All	6

In [15]:

size_counts = up_sizes.groupby(['Set', 'Type', 'Size'])['user'].count().reset_index(name='Users')
size_counts = size_counts[size_counts['Users'] > 0]
size_counts.head()

Out[15]:

	Set	Type	Size	Users
0	AZ	Known-Gender	5	1300
1	AZ	Known-Gender	6	829
2	AZ	Known-Gender	7	572
3	AZ	Known-Gender	8	416
4	AZ	Known-Gender	9	284

In [16]:

size_counts['Users'].describe()

Out[16]:

count    4202.000000
mean       11.899096
std        46.498697
min         1.000000
25%         1.000000
50%         2.000000
75%         7.000000
max      1300.000000
Name: Users, dtype: float64

In [17]:

make_plot(size_counts, aes(x='Size', y='Users'),
          geom_point(),
          scale_x_log10(),
          scale_y_log10(),
          facet_grid('Type ~ Set', scales='free'),
          xlab('# of Consumed Items'),
          ylab('# of Users'),
          panel_grid=element_blank(),
          file='profile-size.pdf', width=7, height=3.2)

/home/MICHAELEKSTRAND/anaconda3/envs/bookfair/lib/python3.7/site-packages/plotnine/ggplot.py:729: PlotnineWarning: Saving 7 x 3.2 in image.
  from_inches(height, units), units), PlotnineWarning)
/home/MICHAELEKSTRAND/anaconda3/envs/bookfair/lib/python3.7/site-packages/plotnine/ggplot.py:730: PlotnineWarning: Filename: figures/ProfileData/profile-size.pdf
  warn('Filename: {}'.format(filename), PlotnineWarning)

Out[17]:

<ggplot: (8782675280353)>

And % known?

In [18]:

g = sns.FacetGrid(profiles.reset_index(), col='Set')
g.map(sns.distplot, 'PropKnown')

Out[18]:

<seaborn.axisgrid.FacetGrid at 0x7fce06753bd0>

Distribution of Female Authors¶

Quick empirical inspection:

In [19]:

g = sns.FacetGrid(profiles.reset_index(), col='Set', sharey=False)
g.map(sns.distplot, 'PropFemale')

Out[19]:

<seaborn.axisgrid.FacetGrid at 0x7fce067169d0>

In [20]:

profiles.groupby('Set')['PropFemale'].mean()

Out[20]:

Set
AZ      0.414445
BX-E    0.418886
BX-I    0.407030
GR-E    0.446998
GR-I    0.450201
Name: PropFemale, dtype: float64

In [21]:

np.sqrt(profiles.groupby('Set')['PropFemale'].var())

Out[21]:

Set
AZ      0.329436
BX-E    0.267354
BX-I    0.254304
GR-E    0.276480
GR-I    0.269127
Name: PropFemale, dtype: float64

Distribution of Dummy Codes¶

Quick empirical inspection - this should be noise:

In [22]:

g = sns.FacetGrid(profiles.reset_index(), row='Set', sharey=False)
g.map(sns.distplot, 'PropDC')

Out[22]:

<seaborn.axisgrid.FacetGrid at 0x7fcdd3aebc10>

Saving Outputs¶

We save profile data frame to be reloaded in the Bayesian analysis.

In [23]:

profiles.to_pickle('data/profile-data.pkl')

We also want to save the data for STAN.

In [24]:

def stan_inputs(data, kc, pc):
    return {
        'J': len(data),
        'n': data[kc],
        'y': data[pc]
    }

In [25]:

def inf_dir(sname):
    return data_dir / sname / 'inference'

In [26]:

for sname, frame in profiles.groupby('Set'):
    print('preparing STAN input for', sname)
    dir = inf_dir(sname)
    dir.mkdir(exist_ok=True)
    in_fn = dir / 'profile-inputs.json'
    in_fn.write_text(ujson.dumps(stan_inputs(frame, 'Known', 'female')))
    in_fn = dir / 'profile-dcode-inputs.json'
    in_fn.write_text(ujson.dumps(stan_inputs(frame, 'dcknown', 'dcyes')))

preparing STAN input for AZ
preparing STAN input for BX-E
preparing STAN input for BX-I
preparing STAN input for GR-E
preparing STAN input for GR-I