FREYA Logo FREYA WP2 User Story 7: As a data center, I want to see the citations of publications that use my repository for the underlying data, so that I can demonstrate the impact of our repository.

It is important for repositories of scientific data to monitor and report on the impact of the data they store. One useful proxy of that impact are secondary citations, i.e. citations of publications which use the deposited data. This notebook focuses on visualisation of these citations by means of a force-directed graph.

This notebook uses the DataCite GraphQL API to retrieve the citations of the following different datasets:

Goal: By the end of this notebook, for a given list of datasets, you should be able to display:

  • Total citation count for each retrieved dataset;
  • An interactive force-directed graph of the datasets and their citations, in which:
    • Pink nodes at the centre of each radial shape corresponds to a dataset;
    • Blue nodes correspond to citations (note that some citations may be shared by more than one dataset);
    • Larger node size represents more citations of the dataset or citation represented by that node. Note that to increase node visibility, node sizes between datasets and citations are not comparable to each other.

Install libraries and prepare GraphQL client

In [123]:
%%capture
# Install required Python packages
!pip install gql requests pyvis jsonpickle
In [124]:
# Prepare the GraphQL client
import requests
from IPython.display import display, Markdown
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport

_transport = RequestsHTTPTransport(
    url='https://api.datacite.org/graphql',
    use_json=True,
)

client = Client(
    transport=_transport,
    fetch_schema_from_transport=True,
)

Define and run GraphQL query

Define the GraphQL query to find all publications including co-authors for Dr Sarah Teichmann:

In [119]:
# Generate the GraphQL query to retrieve up to 100 researchers matching query "John and Smith"
query_params = {
    "ids" : ["10.5061/dryad.234","10.15468/n6ftyd","10.1594/pangaea.314690"]
}

query = gql("""query getDatasetCitations($ids: [String!]) {
  datasets(ids: $ids) {
    nodes {
      id
      titles {
        title
      }
      citationCount
      citations {
        nodes {
          id
          publisher
          titles {
            title
          }
          citationCount
        }
      }
    }
  }
}


""")

Run the above query via the GraphQL client

In [120]:
import json
data = client.execute(query, variable_values=json.dumps(query_params))

Display total number of citations per dataset

In [131]:
# Get total citation counts for each dataset in the query
datasets = data['datasets']
tableBody=""
for dataset in datasets['nodes']:
    id = dataset['id']
    doi = "/".join(id.split("/")[3:])
    titles = []
    for title in dataset['titles']:
        titles.append(title['title'])
    citationCount = dataset['citationCount']
    tableBody += "[%s](%s) | [**%s**](%s/%s)\n" % (', '.join(titles), id, citationCount, "https://search.datacite.org/works",doi)
if tableBody:
    display(Markdown("| Dataset | Citation Count|\n|---|---|\n%s" % tableBody))  

Plot a force-directed graph connecting datasets to their publications and citations of those publications

Plot an interactive force-directed graph of connecting the datasets to their citations (first-degree) and the citations of those citations (second-degree).

  • Pink nodes at the centre of each radial shape corresponds to a dataset;
  • Blue nodes correspond to citations (note that some citations may be shared by more than one dataset);
  • Larger node size represents more citations of the dataset or citation represented by that node. Note that to increase node visibility, node sizes between datasets and citations are not comparable to each other.
In [135]:
from pyvis.network import Network
import pandas as pd
from IPython.display import IFrame
import math

# Colour swatch for the network nodes
dataset_node_colour = "#FB8072"
citation_node_colour = "#80B1D3"

got_net = Network(height="750px", width="100%", bgcolor="#ffffff", font_color="black", notebook=True)
got_net.options.edges.inherit_colors(False)

# set the physics layout of the network
got_net.barnes_hut()

# ------------------------------
# Initialise intermediate data structure to store: (src, trg) -> citation count of the target, where:
# src - dataset or citation; trg - citation
srcTrg2Count = {}
# Initialise intermediate data structure to store: src --> Set of connected trg's
# Note that the number of connected trgs will determine the colour of each src
src2OtherTrgs = {}

datasets = data['datasets']

# Populate srcTrg2Count
allNodes = set()
for node in datasets['nodes']:
    nodeSet = set()
    datasetDOI = "/".join(node['id'].split("/")[3:])
    nodeSet.add(datasetDOI)
    for citation in node['citations']['nodes']:
        citationDOI = "/".join(citation['id'].split("/")[3:])
        citationCount = citation['citationCount']
        nodeSet.add(citationDOI)
        if datasetDOI not in src2OtherTrgs:
            src2OtherTrgs[datasetDOI] = set()
        src2OtherTrgs[datasetDOI].add(citationDOI)
        if citationDOI not in src2OtherTrgs:
            src2OtherTrgs[citationDOI] = set()   
        src2OtherTrgs[citationDOI].add(datasetDOI)        
        srcTrg2Count[(datasetDOI, citationDOI)] = citationCount     
    nodes = sorted(list(nodeSet))
    allNodes.update(nodes)

# Populate data structures needed for the graph
sources, targets, weights = [], [], []
for tuple in srcTrg2Count:
    if srcTrg2Count[tuple] >= 0:
        sources.append(tuple[0])
        targets.append(tuple[1])
        weights.append(srcTrg2Count[tuple])

edge_data = zip(sources, targets, weights)

for e in edge_data:
    src = e[0]
    dst = e[1]
    w = e[2]
    src_node_size = 5 * math.log2(len(src2OtherTrgs[src]) * 5000)
    got_net.add_node(src, src, title="Dataset: %s;" % src, color=dataset_node_colour, size=src_node_size)   
    # We're adding 1 below to make edges representing 0 citations of the target appear in the force-directed graph   
    dst_node_size = 10 * math.log2((w+1) * 10)
    got_net.add_node(dst, dst, title="Citation: %s; Number of citations: %d;" % (dst, w), color=citation_node_colour, size=dst_node_size)
    got_net.add_edge(src, dst, value=1)
    
neighbor_map = got_net.get_adj_list()
# add neighbor data to node hover data
for node in got_net.nodes:
    node["title"] += " Neighbours:<br>" + "<br>".join(neighbor_map[node["id"]])

got_net.show("out.html")
display(Markdown("N.B. Click on the plot, then use down/up mouse scroll to zoom in/out respectively.<br>When zoomed in, you will notice the DOI label against each node.<br>Click on any node to see the list of 'neighbour' citations, and on the citation node to also see the number of its citations."))
IFrame(src="./out.html", width=1000, height=800)

N.B. Click on the plot, then use down/up mouse scroll to zoom in/out respectively.
When zoomed in, you will notice the DOI label against each node.
Click on any node to see the list of 'neighbour' citations, and on the citation node to also see the number of its citations.

Out[135]:
In [ ]: