Progress report 2

Asura Enkhbayar, 15.06.2020

This report contains a brief summary of the methodology and a few descriptives to assess it. Furthermore, some statistics and plots are generated for the citationa and altmetrics gathered for the current data.

In [63]:
from pathlib import Path

from IPython.display import Markdown as md

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib_venn import venn3
import pandas as pd
import seaborn as sns

import numpy as np

from tracking_grants import references_f, articles_f, wos_f, altmetric_f
In [64]:
# Seaborn styles
sns.set_style("darkgrid")

# Matplotlib figure configuration fonts and figsizes
plt.rcParams.update({
    'font.family':'sans-serif',
    'font.size': 16.0,
    'text.usetex': False,
    'figure.figsize': (11.69,8.27)
})

# Color palette
cm = "Paired"
cp3 = sns.color_palette(cm, 3)
cp10 = sns.color_palette(cm, 10)

Methodology

1. Input data

The input data always comes from the CDMRP database (https://cdmrp.army.mil/search.aspx). The data contains unstructured references and grant IDs for each article in the selected research program.

In [65]:
# Load references
refs = pd.read_csv(references_f, index_col="reference_id")
In [66]:
keywords = refs.program.unique().tolist()
mds = f"""
**Data from CDMRP database**

- Selected research programs: { ", ".join(keywords) }
- Total number of references found: { len(refs) }
- Unique references: { refs.reference.nunique() } => Several articles are in multiple research programs and multiple grants affiliated
    - Unique grant IDs: { refs.grant_id.nunique() }
"""
md(mds)
Out[66]:

Data from CDMRP database

  • Selected research programs: TSCRP, NFRP, PCRP, PRORP
  • Total number of references found: 9078
  • Unique references: 8620 => Several articles are in multiple research programs and multiple grants affiliated
    • Unique grant IDs: 2227

2. Matching references to articles

From here on, we are using a reference matcher developed by Dominika Tkaczyk at Crossref. The algorithm is described to some degree in a blog series: https://www.crossref.org/blog/matchmaker-matchmaker-make-me-a-match/

The algorithm is implemented in Java and no longer very well maintained. However, it seems to be working fine. Dominika recommends up to 20 threads (already implemented) and with some workarounds we could also implement progress reports for larger datasets which are not supported at the moment.

In our case, I used the 8620 unique references from our dataset and matched 6711 articles. Each matched article also comes with the matching score. In this case, the implementation takes care of thresholds and simply returns Nulls for those references that are not matching (enough).

In [67]:
# Load matched articles
articles = pd.read_csv(articles_f)

articles.DOI = articles.DOI.str.lower()
In [68]:
mds = f"""
**Matching references with Crossref**

- Matched articles: { articles.DOI.nunique() }
- Articles in multiple research programs: { (articles.groupby("DOI").program.nunique()>1).sum() }
- Articles funded by multiple grants: { (articles.groupby("DOI").grant_id.nunique()>1).sum() }
"""
md(mds)
Out[68]:

Matching references with Crossref

  • Matched articles: 6711
  • Articles in multiple research programs: 198
  • Articles funded by multiple grants: 829

3. Retrieving metrics

Once the the references have been matched to articles in Crossref, we are using the DOIs to retrieve metrics from the Web of Science and Altmetric.

In [69]:
# Load metrics from WoS
wos = pd.read_csv(wos_f,  low_memory=False, index_col="DOI")
wos.columns = [x.lower() for x in wos.columns.tolist()]
wos.index = wos.index.str.lower()

wos = wos.rename(columns={'relative citation score':'citation_score'})
In [70]:
# Load metrics from Altmetric
altmetrics = pd.read_json(altmetric_f).T

# Filter out all articles had not altmetrics
altmetrics = altmetrics[altmetrics.altmetric_id.notna()]
In [71]:
dates = ["last_updated", "published_on", "added_on"]
for d in dates:
    altmetrics[d] = pd.to_datetime(altmetrics[d], unit="s")

str_cols = ["pmid", "pmc", "altmetric_id", "doi", 'hollis_id', "arxiv_id"]
for _ in str_cols:
    altmetrics[_] = altmetrics[_].astype(str)
    
metric_cols = {
    'cited_by_posts_count': 'posts_count',
    'cited_by_rh_count': 'research_highlight',
    'cited_by_tweeters_count': 'twitter_accounts',
    'cited_by_patents_count': 'patents',
    'cited_by_msm_count': 'news_outlets',
    'cited_by_feeds_count': 'blogs',
    'cited_by_fbwalls_count': 'fb_pages',
    'cited_by_qna_count': 'qna_count',
    'cited_by_videos_count': 'videos',
    'cited_by_peer_review_sites_count': 'peer_reviews',
    'cited_by_weibo_count': 'weibo',
    'cited_by_gplus_count': 'gplus',
    'cited_by_rdts_count': 'reddit_threads',
    'cited_by_policies_count': 'policies',
    'cited_by_syllabi_count': 'syllabi',
    'cited_by_linkedin_count': 'linkedin',
    'cited_by_wikipedia_count': 'wikipedia',
}
altmetrics = altmetrics.rename(columns=metric_cols)
metric_cols = list(metric_cols.values())

altmetrics[metric_cols] = altmetrics[metric_cols].astype(float)

cols_to_keep = metric_cols + dates + str_cols + ['subjects', 'scopus_subjects']
altmetrics = altmetrics[cols_to_keep]

# Transform all DOIs to lowercase
altmetrics.index = altmetrics.index.str.lower()

Results in detail

In [122]:
metrics = articles.drop_duplicates().merge(altmetrics[metric_cols], left_on="DOI", right_index=True, how="left")
metrics = metrics.merge(wos[["citations", "citation_score"]], left_on="DOI", right_index=True, how="left")
In [123]:
# Replace articles with 0 citations as NaN
metrics = metrics.replace(0.0, np.nan)
In [124]:
mds = f"""
**Metrics from the Web of Science and Altmetric.com**

- Articles found in WoS: { len(wos) }
    - Articles with at least 1 citation: { metrics.citations.count() }
- Articles found in Altmetric.com: { len(altmetrics) }
    - Articles with tweets: { altmetrics.twitter_accounts.notna().sum() }
    - Articles with FB mentions: { altmetrics.fb_pages.notna().sum() }
"""
md(mds)
Out[124]:

Metrics from the Web of Science and Altmetric.com

  • Articles found in WoS: 2688
    • Articles with at least 1 citation: 3298
  • Articles found in Altmetric.com: 3372
    • Articles with tweets: 1488
    • Articles with FB mentions: 432

Coverage

In [76]:
all_articles = set(articles.DOI.unique().tolist())
articles_w_altm = set(altmetrics.index.tolist())
articles_w_cit = set(wos.index.tolist())

total = len(all_articles)

v = venn3([all_articles, articles_w_altm, articles_w_cit],
      set_labels=('', '', ''),
      subset_label_formatter=lambda x: "{:,} ({:.1f})".format(x, 100*x/total));

v.get_patch_by_id('100').set_color(cp3[0])
v.get_patch_by_id('110').set_color(np.add(cp3[0], cp3[1])/2)
v.get_patch_by_id('101').set_color(np.add(cp3[0], cp3[2])/2)
v.get_patch_by_id('111').set_color(np.add(np.add(cp3[1], cp3[0]), cp3[2]) / 3)

for text in v.set_labels:
    text.set_fontsize(10)
# for text in v.subset_labels:
#     text.set_fontsize(12)

handles = []
labels=["All articles", "Altmetric", "WoS"]
for l, c in zip(labels, [cp3[0], np.add(cp3[0], cp3[1])/2, np.add(cp3[0], cp3[2])/2]):
    handles.append(mpatches.Patch(color=c, label=l))
plt.legend(handles=handles);

# plt.gca().legend(handles=[v.get_patch_by_id('100'), v.get_patch_by_id('010'), v.get_patch_by_id('001')],
#                  , prop={'size': 12});

Disciplines

In [77]:
wos.groupby("discipline").size().div(len(wos)/100).sort_values().plot(kind="barh", color=cp3[0])
plt.title(f"Relative number of articles in WoS disciplines (n={len(wos)})")
sns.despine()
In [78]:
pdf = pd.get_dummies(altmetrics['scopus_subjects'].apply(pd.Series).stack()).sum(level=0).melt()
pdf.groupby("variable")['value'].sum().div(len(altmetrics)/100).sort_values().plot(kind="barh", color=cp3[0])
plt.title(f"Relative number of articles in Scopus subjects (n={len(altmetrics)})")
sns.despine()

Counts

In [79]:
pdf = metrics[metrics.columns.drop('citation_score')].count().div(len(metrics)/100).sort_values()
pdf.plot(kind="barh", color=cp3[0])
plt.xlabel("Coverage [%]")
count_order = pdf.index.tolist()
In [80]:
pdf = metrics[metrics.columns].reset_index().melt(id_vars="DOI")
sns.boxenplot(y="variable", x="value", data=pdf, order=count_order[::-1], k_depth="trustworthy", color=cp3[0])
plt.ylabel("")
plt.xlabel("Count")
plt.xscale("log")
ticks = [1,2,3,5,10,100,1000]
plt.xticks(ticks, ticks);
In [81]:
wos.groupby("discipline").citation_score.mean().to_frame("avg_citation_score")
Out[81]:
avg_citation_score
discipline
Biology 2.366737
Biomedical Research 2.236923
Chemistry 1.434047
Clinical Medicine 2.030200
Earth and Space 0.982333
Engineering and Technology 2.124703
Health 0.787700
Mathematics 1.490500
Physics 1.435974
Professional Fields 1.083000
Psychology 0.115000
In [82]:
pdf = wos.melt(id_vars="discipline", value_vars="citation_score")
sns.boxenplot(y="discipline", x="value", data=pdf)
plt.xlim(0,15);

Web of Science in detail

In [83]:
pdf = pd.DataFrame(index=range(2000,2020))
pdf['count'] = wos.groupby("year").size()
pdf['count'].plot(kind="bar", color=cp3[0]);
sns.despine()
plt.title("Articles indexed in WoS by years");
In [84]:
pdf = wos.replace(0, np.nan).melt(id_vars="year", value_vars="citations")
for y in range(pdf.year.min(),pdf.year.max()+1):
    if str(y) not in pdf.year:
        pdf.loc[len(pdf)+1] = [y, 'citations', np.nan]
sns.boxenplot(x="year", y="value", data=pdf, color=cp3[0])
plt.title("Letter-value plot of citation counts by year")
plt.ylim(1,2000)
plt.yscale("log")
plt.ylabel("")
ticks = [1,2,3,5,10,20,30,50,100,200,300,500,1000]
plt.yticks(ticks, ticks);
In [85]:
pdf = pd.DataFrame(index=range(2000,2020))
pdf['mean'] = wos.groupby("year").citations.mean()
pdf['mean'].plot(kind="bar", color=cp3[0])
sns.despine()
plt.title("Average citations for articles by year");

Altmetrics in detail

In [86]:
altmetrics['year'] = altmetrics.published_on.map(lambda x: str(x.year))
In [87]:
pdf = altmetrics.groupby("year").size().to_frame("count")
pdf.index = pdf.index.astype(float)
for y in range(int(pdf.index.min()), int(pdf.index.max()+1)):
    if y not in pdf.index:
        pdf.loc[y] = [np.nan]
In [88]:
pdf = altmetrics.groupby("year").size().to_frame("count")
pdf = pdf[pdf.index.astype(float)>1996]
pdf.plot(kind="bar", color=cp3[0]);
plt.title("Articles indexed in Altmetric.com by year");
In [89]:
plt_metrics = ['twitter_accounts', 'fb_pages']
pdf = altmetrics.melt(id_vars="year", value_vars=plt_metrics, value_name="counts")
pdf = pdf[pdf.year.astype(float)>1996]
pdf = pdf.dropna()

fig, ax = plt.subplots()
fig.set_size_inches(20, 10)
sns.boxenplot(x="year", y="counts", hue="variable", data=pdf, ax=ax, palette="Paired")
plt.yscale("log")
ticks = [1,2,3,5,10,20,30,50,100,200,300,500,1000]
plt.yticks(ticks, ticks);
plt.title("Altmetrics counts by year for select metrics");
In [90]:
plt_metrics = ['news_outlets', 'wikipedia']
pdf = altmetrics.melt(id_vars="year", value_vars=plt_metrics, value_name="counts")
pdf = pdf[pdf.year.astype(float)>1996]
pdf = pdf.dropna()

fig, ax = plt.subplots()
fig.set_size_inches(20, 10)
sns.boxenplot(x="year", y="counts", hue="variable", data=pdf, ax=ax, palette="Paired")
plt.yscale("log")
ticks = [1,2,3,5,10,20,30,50,100,200,300,500,1000]
plt.yticks(ticks, ticks);
plt.title("Altmetrics counts by year for select metrics");

By Research Programs

In [271]:
df = refs.groupby("program").size().to_frame("total references")
df['found in crossref'] = articles.groupby("program").DOI.nunique()
df["found (%)"] = 100 * df['found in crossref'] / refs.groupby("program").size()
df.columns = ["References", "Found DOI", "Found (%)"]
df = df.sort_values("References")
df.round(2)
Out[271]:
References Found DOI Found (%)
program
NFRP 208 197 94.71
TSCRP 208 197 94.71
PRORP 354 271 76.55
PCRP 8308 6246 75.18
In [272]:
cs = ['coci_citations', 'citations', 'posts_count']

x = metrics.groupby(["DOI", "program"])[cs].mean().reset_index().groupby("program").count()[cs]
x[cs] = x[cs].apply(lambda x: 100 * x / articles.groupby("program").DOI.nunique())
x.columns = ["COCI (Cov in %)", "WoS (Cov in %)", "Altmetric (Cov in %)"]
x.reindex(df.index).round(2)
Out[272]:
COCI (Cov in %) WoS (Cov in %) Altmetric (Cov in %)
program
NFRP 97.97 66.50 59.39
TSCRP 97.97 66.50 59.39
PRORP 96.68 71.96 56.09
PCRP 96.13 36.60 49.70
In [215]:
df = metrics.groupby(["grant_id", "program"])[['coci_citations', 'citations', 'twitter_accounts', 'citation_score']].mean()
df['count'] = metrics.groupby(["grant_id", "program"]).size()
df = df.sort_values("count").reset_index()
df
Out[215]:
grant_id program coci_citations citations twitter_accounts citation_score count
0 MP980015 PCRP 1.000000 NaN NaN NaN 1
1 PC081249P1 PCRP 176.000000 163.000000 NaN 4.770000 1
2 PC081249 PCRP 176.000000 163.000000 NaN 4.770000 1
3 PC081246 PCRP 36.000000 30.000000 NaN 0.575000 1
4 PC081176 PCRP 19.000000 12.000000 1.000000 0.672000 1
... ... ... ... ... ... ... ...
2239 PC100473 PCRP 89.487805 71.633333 2.153846 3.444767 41
2240 PC010267 PCRP 23.527273 20.500000 NaN 0.779500 57
2241 PC051369 PCRP 43.629630 30.780000 11.272727 1.319580 83
2242 PC081610 PCRP 73.276316 101.909091 4.750000 3.421500 97
2243 PC021004 PCRP 133.000000 83.000000 1.200000 1.388000 136

2244 rows × 7 columns

In [219]:
# df['citation_score'] = 100*df['citation_score']
sns.scatterplot(x="citation_score", y="twitter_accounts", hue="program", size="count", data=df, sizes=(50,300), alpha=.5)
# plt.xlim(0, 500)
# plt.ylim(0, 100)
# plt.yscale("log")
# plt.xscale("log")
Out[219]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f17dac92358>

To-Do

New data to get

  • Get Metadata from Crossref
  • Get OA status from unpaywall
  • clinicaltrials.gov

Available data from DoD

  • Patents
  • Drugs to market

Notes

  • Do analysis for

For grants

  • award size
  • and some measures

Next Thing To Do:

  • Get some plots going to present the gist of this