This notebook shows a few ways you can start to explore the ABC Radio National metadata harvested using this notebook.
For an earlier experiment playing with this data, see In a word...: Currents in Australian affairs, 2003–2013.
import re
from pathlib import Path
import altair as alt
import nltk
import pandas as pd
from nltk.corpus import stopwords
from wordcloud import WordCloud
nltk.download("stopwords")
[nltk_data] Downloading package stopwords to /home/tim/nltk_data... [nltk_data] Package stopwords is already up-to-date!
True
Load the harvested data.
# Download the most recently harvested data file and convert to a dataframe
df = pd.read_csv("https://cloudstor.aarnet.edu.au/plus/s/ry50ZpoSjOFbb8b/download")
How many records are there?
df.shape[0]
421277
How many programs are there records for?
df["isPartOf"].nunique()
163
Which programs have the most records?
df["isPartOf"].value_counts()[:25]
ABC Radio National. RN Breakfast 63667 ABC Radio. AM 55936 ABC Radio. The World Today 51612 ABC Radio. PM 51213 ABC Radio. RN Breakfast 19877 ABC Radio National. RN Drive 13779 ABC Radio. RN Drive 12758 ABC Radio National. Late Night Live 10680 ABC Radio National. Life Matters 10657 ABC Radio. AM Archive 9825 ABC Radio. PM Archive 8430 ABC Radio. The World Today Archive 7902 ABC Radio National. The Science Show 6505 ABC Radio National. Saturday Extra 5613 ABC Radio 4612 ABC Radio National. Counterpoint 4049 ABC Radio National. Sunday Extra 4005 ABC Radio. Correspondents Report 3927 ABC Radio National. Health Report 3845 ABC Radio National. AWAYE! 3443 ABC Radio National. Big Ideas 3442 ABC Radio National. The Book Show 3106 ABC Radio National. Books and Arts Daily 2242 ABC Radio National. The Drawing Room 2028 ABC Radio National. Ockham's Razor 1957 Name: isPartOf, dtype: int64
To look at the number of records by year, we need to make sure the date
field is being recognised as a datetime
. Then we can extract the year into a new column.
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d", errors="coerce")
df["year"] = df["date"].dt.year.astype("Int64")
Find the number of times each year appears.
year_counts = df["year"].value_counts().to_frame().reset_index()
year_counts.columns = ["year", "count"]
Chart the results.
alt.Chart(year_counts).mark_bar().encode(
x="year:O", y="count:Q", tooltip=["year", alt.Tooltip("count:Q", format=",")]
)
The early records look a bit suspect, and I should probably check them manually. I'm also wondering why there's been such a large decline in the number of records added since 2017.
The contributor
field includes the names of hosts, reporters, and guests. It's stored as a pipe-delimited string, so we have to split the string, then explode the resulting list to create one row per name.
people = df["contributor"].str.split("|").explode().dropna()
Then we can calculate how often people appear in the records.
people.value_counts()[:25]
Fran Kelly 56806 Mark Colvin 31871 Eleanor Hall 27027 Robyn Williams 13576 Phillip Adams 13418 Patricia Karvelas 13154 Natasha Mitchell 10826 Tony Eastley 10102 Elizabeth Jackson 7680 Geraldine Doogue 7298 Richard Aedy 7110 Linda Mottram 6615 Peter Cave 5421 Alexandra Kirk 4797 Brendan Trembath 4496 Michael Cathcart 4481 Kim Landers 4477 Sabra Lane 4240 Michael Brissenden 4236 David Fisher 4230 Dr Norman Swan 4150 Peter Ryan 4000 Paul Barclay 3927 Jonathan Green 3800 Amanda Smith 3595 Name: contributor, dtype: int64
wc_people = WordCloud(width=1000, height=500).fit_words(people.value_counts().to_dict())
wc_people.to_image()
There are three text fields that could yield some interesting analysis. The title
field is obvious enough, though some regular segments do have duplicate titles. The abstract
field is a brief summary of the segment or program. The description
field seems to be the beginning of the transcript, but often seems to be much the same as the abstract
.
Let's try aggregating the titles for a program.
breakfast_titles = list(
df.loc[
(
df["isPartOf"].isin(
["ABC Radio National. RN Breakfast", "ABC Radio. RN Breakfast"]
)
)
& (df["year"] == 2020)
]
.drop_duplicates(subset=["title"], keep=False)["title"]
.unique()
)
wordcloud = WordCloud(
width=1000,
height=500,
stopwords=stopwords.words("english")
+ [
"Australia",
"Australian",
"Australians",
"New",
"News",
"Matt",
"Bevan",
"World",
],
collocations=False,
).generate(" ".join(breakfast_titles))
wordcloud.to_image()