FREYA Logo FREYA WP2 User Story 8 As a longitudinal study, I want to be able to deduplicate the metrics/impact for our data, so that I can see the impact of our study’s data as a whole.

Scientific datasets may be composed of individual components, whereby the parent and each component are identified by a different DOI and hence can be cited, viewed and downloaded individually. In order to assess the reuse such datasets, their authors must be able to aggregate views, downloads and citations metrics across all the dataset components.

This notebook uses the DataCite GraphQL API to retrieve all parts of the dataset: 2014 TCCON Data Release dataset, so that its overall impact can be quantified.

Goal: By the end of this notebook, for a given dataset with constituent parts, you should be able to display:

  • Counts of citations, views and downloads metrics, aggregated across the parent dataset and all its parts;
  • An interactive stacked bar plot showing how the metric counts of each part contribute to the corresponding aggregated metric counts, e.g.

Install libraries and prepare GraphQL client

In [3]:
%%capture
# Install required Python packages
!pip install gql requests numpy plotly
In [4]:
# Prepare the GraphQL client
import requests
from IPython.display import display, Markdown
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport

_transport = RequestsHTTPTransport(
    url='https://api.datacite.org/graphql',
    use_json=True,
)

client = Client(
    transport=_transport,
    fetch_schema_from_transport=True,
)

Define and run GraphQL query

Define the GraphQL query to retrieve 2014 TCCON Data Release dataset.

In [5]:
# Generate the GraphQL query to retrieve all parts of the 2014 TCCON Data Release dataset.
query_params = {
    "datasetId" : "https://doi.org/10.14291/tccon.ggg2014"
}

query = gql("""query getDataset($datasetId: ID!)
{
  dataset(id: $datasetId) {
    id
    titles {
      title
    }
    publicationYear
    descriptions {
      description
      descriptionType
    }
    citationCount
    viewCount
    downloadCount
    partCount
    parts {
      nodes {
        id
        titles {
          title
        }
        publicationYear
        descriptions {
          description
          descriptionType
        }
        citationCount
        viewCount
        downloadCount
      }
    }
  }
}
""")

Run the above query via the GraphQL client

In [6]:
import json
data = client.execute(query, variable_values=json.dumps(query_params))

Display total dataset metrics

Display total number of citations, views and downloads of 2014 TCCON Data Release dataset, aggregated across all the parts.

In [7]:
# Get the total count per metric, aggregated for the parent dataset and all its parts
dataset = data['dataset']
# Initialise metric counts for the parent dataset
metricCounts = {}
for metric in ['citationCount', 'viewCount', 'downloadCount']:
    metricCounts[metric] = dataset[metric]
    
# Aggregate metric counts across all the parts
for node in dataset['parts']['nodes']:
    for metric in metricCounts:
         metricCounts[metric] += node[metric]
            
# Display the aggregated metric counts
tableBody=""
for metric in metricCounts:        
    tableBody += "%s | **%s**\n" % (metric, str(metricCounts[metric]))
if tableBody:
   display(Markdown("Aggregated metric counts for [2014 TCCON Data Release dataset](https://doi.org/10.14291/tccon.ggg2014) and its %d parts:" % dataset['partCount']))
   display(Markdown("|Metric | Aggregated Count|\n|---|---|\n%s" % tableBody))                                 

Aggregated metric counts for 2014 TCCON Data Release dataset and its 2 parts:

Metric Aggregated Count

citationCount | 6 viewCount | 210 downloadCount | 54

Plot metric counts per part

Plot stacked bar plot showing how the individual parts of 2014 TCCON Data Release dataset contribute their metric counts to the corresponding aggregated total.

In [8]:
import plotly.io as pio
import plotly.express as px
from IPython.display import IFrame
import pandas as pd

# Adapted from: https://stackoverflow.com/questions/58766305/is-there-any-way-to-implement-stacked-or-grouped-bar-charts-in-plotly-express
def px_stacked_bar(df, color_name='Metric', y_name='Metrics', **pxargs):
    idx_col = df.index.name
    m = pd.melt(df.reset_index(), id_vars=idx_col, var_name=color_name, value_name=y_name)
    # For Plotly colour sequences see: https://plotly.com/python/discrete-color/     
    return px.bar(m, x=idx_col, y=y_name, color=color_name, **pxargs, 
                  color_discrete_sequence=px.colors.qualitative.Pastel1)

# Collect metric counts
dataset = data['dataset']
numParts = dataset['partCount']

# Initialise dicts for the stacked bar plot
labels = {0: 'Dataset and Parts', 1: 'Dataset (%s)' % dataset['publicationYear']}
citationCounts = {}
viewCounts = {}
downloadCounts = {}

# Collect dataset/part labels
partCnt = 0
for node in dataset['parts']['nodes']:
    labels[2 + partCnt] = 'Part %d (%s)' % ((partCnt + 1), node['publicationYear'])
    partCnt += 1
    
# Initialise aggregated metric counts (key: 0) and populate parent dataset metric counts (key: 1)
i = 0
while (i < 2):
    citationCounts[i] = dataset['citationCount']
    viewCounts[i] = dataset['viewCount']
    downloadCounts[i] = dataset['downloadCount']
    i += 1
    
# Populate metric counts for individual parts (key: 2 + partCnt) and add them to the aggregated counts (key: 0)
partCnt = 0
for node in dataset['parts']['nodes']:
    citationCounts[0] += node['citationCount']
    viewCounts[0] += node['viewCount']
    downloadCounts[0] += node['downloadCount']
    citationCounts[2 + partCnt] = node['citationCount']
    viewCounts[2 + partCnt] = node['viewCount']
    downloadCounts[2 + partCnt] = node['downloadCount']
    partCnt += 1

# Create stacked bar plot
df = pd.DataFrame({'Dataset/Parts': labels,
                   'Citations': citationCounts,
                   'Views': viewCounts,
                   'Downloads': downloadCounts})
fig = px_stacked_bar(df.set_index('Dataset/Parts'), y_name = "Counts")

# Set plot background to transparent
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)'
})

# Write interactive plot out to html file
pio.write_html(fig, file='out.html')

# Display plot from the saved html file
display(Markdown("Citations, views and downloads counts for [2014 TCCON Data Release dataset](https://doi.org/10.14291/tccon.ggg2014) and individual parts, shown as stacked bar plot:"))
IFrame(src="./out.html", width=500, height=500)

Citations, views and downloads counts for 2014 TCCON Data Release dataset and individual parts, shown as stacked bar plot:

Out[8]: