Analyse public tags added to Trove

This notebook loads public tags that users have added to records in Trove from the CSV file harvested by this notebook. It then attempts some analysis of the tags.

The complete CSV is too large to store on GitHub. You can download it from CloudStor or Zenodo.

User content added to Trove, including tags, is available for reuse under a CC-BY-NC licence.

In [2]:
import pandas as pd
import altair as alt
from wordcloud import WordCloud
from IPython.display import display, Image
In [28]:
# You will need to download the CSV file first from CloudStor or Zenodo
df = pd.read_csv('trove_tags_20210710.csv')

Tags by zone

In [8]:
df['zone'].value_counts()
Out[8]:
newspaper     7898606
book           573195
picture         95231
gazette         67312
music           50202
article         21559
list             6456
map              6220
collection       3511
Name: zone, dtype: int64

How many duplicates across zones?

A single resource in Trove can appear in multiple zones – for example, a book that includes maps and illustrations might appear in the 'book', 'picture', and 'map' zones. This means that some of the tags will essentially be duplicates – harvested from different zones, but relating to the same resource. We can quantify this by finding put how many tags there are in the overlapping 'book', 'article', 'picture', 'music', 'map', and 'collection' zones, then dropping duplicates based on the tag, date and record_id fields.

In [29]:
# Total tags across overlapping zones
df.loc[df['zone'].isin(['book', 'article', 'picture', 'music', 'map', 'collection'])].shape
Out[29]:
(749918, 4)

Now let's remove the duplicates and see how many are left.

In [30]:
df.loc[df['zone'].isin(['book', 'article', 'picture', 'music', 'map', 'collection'])].drop_duplicates(subset=['tag', 'date', 'record_id']).shape
Out[30]:
(695255, 4)

So there's about 50,000 'duplicates'. This doesn't really matter if you want to examine tagging behaviour within zones, but if you're aggregating tags across zones you might want to remove them, as demonstrated below.

Top tags!

If we're going to look at the most common tags across all zones, then we should probably remove the duplicates mentioned above first.

In [39]:
# Dedupe overlapping zones
deduped_works = df.loc[df['zone'].isin(['book', 'article', 'picture', 'music', 'map', 'collection'])].drop_duplicates(subset=['tag', 'date', 'record_id'])

# Non overlapping zones
other_zones = df.loc[df['zone'].isin(['newspaper', 'gazette', 'list'])]

# Combine the two to create a new deduped df
deduped = pd.concat([deduped_works, other_zones])
In [40]:
deduped.shape
Out[40]:
(8667629, 4)

Now let's view the 50 most common tags.

In [41]:
deduped['tag'].value_counts()[:50]
Out[41]:
north shore                    41915
lrrsa                          30549
tccc                           28525
poem                           21135
australian colonial music      20412
l1                             18948
gag cartoon                    17875
melbourne football club        17591
tbd                            15667
political cartoon              15078
fiction                        14651
crossword puzzle               14258
crossword puzzle solution      12686
cammeray golf club             12179
rowing & sculling              11871
slvfix                         11139
corrected in full              10640
serials                        10509
australian laureates           10183
advertising                     9939
illustration type cartoon       9752
t a reynolds                    9449
cricket                         9263
captain e t miles               8863
family notices                  8630
horse destroyed                 8536
animation                       8334
second edition                  8274
cane                            7182
phoenix foundry, ballarat       7129
nature notes                    6861
blondie animation               6749
short story                     6195
dmb                             6039
dora animation                  6009
text corrections complete       5812
rwm                             5790
death                           5642
locomotive                      5628
st. leonards school of arts     5609
yelsel2                         5474
william tunks                   5460
little iodine animation         5434
firewood taxa                   5415
illustration by jimmy hatlo     5361
china                           5275
peanuts animation               5205
william henry ogilvie           5162
australian singers              5139
john hesketh                    5106
Name: tag, dtype: int64

Let's convert the complete tag counts into a new dataframe, and save it as a CSV file.

In [34]:
tag_counts = deduped['tag'].value_counts().to_frame().reset_index()
tag_counts.columns = ['tag', 'count']
In [35]:
tag_counts.to_csv('trove_tag_counts_20210710.csv', index=False)

Let's display the top 200 tags as a word cloud.

In [36]:
# Get the top 200 tags
top_200 = tag_counts[:200].to_dict(orient='records')
In [37]:
# Reshape into a tag:count dictionary. 
top_200 = {tag['tag']: tag['count'] for tag in top_200}
In [38]:
WordCloud(width=800, height=500).fit_words(top_200).to_image()
Out[38]:

Tags on pictures

Most of the tags are on newspaper articles, but we can filter the results to look at the top tags in other zones.

In [15]:
df.loc[df['zone'] == 'picture']['tag'].value_counts()[:20]
Out[15]:
c1                           3235
c3                           2370
sun pic                      1917
hillgrove ww1                1277
morgan harry                 1277
politicians                  1099
photos                        968
aviators and aviation         837
1931                          813
1932                          718
daily telegraph pic           685
1930                          669
1928                          662
ship passengers               612
sydney harbour bridge         600
australian colonial music     588
nsw mlas                      562
1927                          528
sydney harbour                480
1925                          479
Name: tag, dtype: int64

View tags by year

We can use the date field to examine when tags were added.

In [16]:
# Convert date to datetime data type
df['date'] = pd.to_datetime(df['date'])
In [17]:
# Create a new column with the year
df['year'] = df['date'].dt.year
In [19]:
# Get counts of tags by year
year_counts = df.value_counts(['year', 'zone']).to_frame().reset_index()
year_counts.columns = ['year', 'zone', 'count']
In [20]:
# Chart tags by year
alt.Chart(year_counts).mark_bar(size=18).encode(
    x=alt.X('year:Q', axis=alt.Axis(format='c')),
    y=alt.Y('count:Q', stack=True),
    color='zone:N',
    tooltip=['year:Q', 'count:Q', 'zone:N']
)
Out[20]:

An obvious feature in the chart above is the large number of tags in zones other than 'newspaper' that were added in 2009. From memory I believe these 'tags' were automatically ingested from related Wikipedia pages. Unlike the bulk of the tags, these were not added by individual users, so if your interest is user activity you might want to exclude these by filtering on date or zone.

View tags by month

In [25]:
# This creates a column with the date of the first day of the month in which the tag was added
# We can use this to aggregate by month
df['year_month'] = df['date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)
In [26]:
# Get tag counts by month
month_counts = df.value_counts(['year_month', 'zone']).to_frame().reset_index()
month_counts.columns = ['year_month', 'zone', 'count']
In [27]:
alt.Chart(month_counts).mark_bar().encode(
    x='yearmonth(year_month):T',
    y='count:Q',
    color='zone:N',
    tooltip=['yearmonth(year_month):T', 'count', 'zone']
).properties(width=700).interactive()
Out[27]:

So we can see that the machine generated tags were added in November 2009. We can even zoom in further to see on which days most of the automatically generated tags were ingested.

In [46]:
df.loc[df['year_month'] == '2009-11-01']['date'].dt.floor('d').value_counts()
Out[46]:
2009-11-01 00:00:00+00:00    519462
2009-11-02 00:00:00+00:00     67297
2009-11-28 00:00:00+00:00      1683
2009-11-13 00:00:00+00:00      1645
2009-11-06 00:00:00+00:00      1107
2009-11-08 00:00:00+00:00      1009
2009-11-29 00:00:00+00:00       934
2009-11-16 00:00:00+00:00       891
2009-11-24 00:00:00+00:00       871
2009-11-20 00:00:00+00:00       869
2009-11-03 00:00:00+00:00       867
2009-11-10 00:00:00+00:00       858
2009-11-15 00:00:00+00:00       846
2009-11-21 00:00:00+00:00       837
2009-11-18 00:00:00+00:00       832
2009-11-22 00:00:00+00:00       812
2009-11-14 00:00:00+00:00       786
2009-11-26 00:00:00+00:00       780
2009-11-19 00:00:00+00:00       740
2009-11-05 00:00:00+00:00       704
2009-11-27 00:00:00+00:00       697
2009-11-07 00:00:00+00:00       696
2009-11-17 00:00:00+00:00       666
2009-11-09 00:00:00+00:00       585
2009-11-25 00:00:00+00:00       582
2009-11-11 00:00:00+00:00       553
2009-11-12 00:00:00+00:00       512
2009-11-23 00:00:00+00:00       428
2009-11-04 00:00:00+00:00       411
2009-11-30 00:00:00+00:00       405
Name: date, dtype: int64

View tags by month in newspapers and gazettes

In [49]:
alt.Chart(month_counts.loc[month_counts['zone'].isin(['newspaper', 'gazette'])]).mark_bar().encode(
    x='yearmonth(year_month):T',
    y='count:Q',
    color='zone:N',
    tooltip=['yearmonth(year_month):T', 'count', 'zone']
).properties(width=700)
Out[49]:

What's the trend in newspaper tagging? There seems to have been a drop since the Trove interface was changed, but the month-to-month differences are quite large, so there might be other factors at play.

In [78]:
base = alt.Chart(month_counts.loc[(month_counts['zone'].isin(['newspaper'])) & (month_counts['year_month'] < '2021-07-01')]).mark_point().encode(
    x='yearmonth(year_month):T',
    y='count:Q',
    tooltip=['yearmonth(year_month):T', 'count', 'zone']
).properties(width=700)

polynomial_fit = base.transform_regression(
        'year_month', 'count', method="poly", order=4
    ).mark_line(color="red")


alt.layer(base, polynomial_fit)
Out[78]:

Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.