FREYA Logo FREYA WP2 User Story 6: As a researcher, I am looking for more information about another researcher with a common name, but don’t know his/her ORCID ID.

It is important to be able to locate a researcher of interest even though their ORCID ID is unknown. For example, a reader of a scientific publication may wish to find out more about one of the authors, whereby the publisher has not cross-referenced that author's name to ORCID.

This notebook uses the DataCite GraphQL API to disambiguate a researcher name via a funnel approach:

  • First all researcher records matching query "John AND Smith" and retrieved, and an alphabetically sorted list of affiliations and the corresponding researcher names is displayed;
  • Then the notebook simulates the user selecting one of the affiliations (in our case "University of Arizona"), and then performs a more detailed query: "John AND Smith AND University of Arizona". The second query retrieves and displays a much smaller set of results, now also containing the researcher's publications, thus helping the user pinpoint the researcher of interest more easily.

Goal: By the end of this notebook, you should be able successfully disambiguate a researcher name of interest.

Install libraries and prepare GraphQL client

In [228]:
%%capture
# Install required Python packages
!pip install gql requests
In [229]:
# Prepare the GraphQL client
import requests
from IPython.display import display, Markdown
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport

_transport = RequestsHTTPTransport(
    url='https://api.datacite.org/graphql',
    use_json=True,
)

client = Client(
    transport=_transport,
    fetch_schema_from_transport=True,
)

Define and run GraphQL query

Define the GraphQL query to find all publications including co-authors for Dr Sarah Teichmann:

In [231]:
# Generate the GraphQL query to retrieve up to 100 researchers matching query "John and Smith"
query_params = {
    "query" : "John AND Smith",
    "max_researchers" : 100,
    "query_end_cursor" : ""
}

query_str = """query getResearchersByName(
    $query: String!,
    $max_researchers: Int!,
    $query_end_cursor : String!
    )
{
  people(query: $query, first: $max_researchers, after: $query_end_cursor) {
    totalCount
    pageInfo {
      hasNextPage
      endCursor
    }  
    nodes {
      id
      givenName
      familyName
      name
      affiliation {
        name
      }
    }
  }
}
"""

Run the above query via the GraphQL client

In [232]:
import json
found_next_page = True

# Initialise overall data dict that will store results
data = {}

# Keep retrieving results until there are no more results left
while True:
    query = gql("%s" % query_str)
    res = client.execute(query, variable_values=json.dumps(query_params))
    if "people" not in data:
        data = res
    else:
        people = res["people"]
        data["people"]["nodes"].extend(people["nodes"])
        pageInfo = people["pageInfo"]
        if pageInfo["hasNextPage"]:
            if pageInfo["endCursor"] is not None:
                query_params["query_end_cursor"] = pageInfo["endCursor"]            
            else:
                break
        else:
            break

List researcher details

List in tabular format affilitions and the corresponding researcher names. This allows the user to select one of the affiliations to use in a more detailed query (see below) that also retrieves publications.

In [234]:
# Collect names and affiliations for the researchers found
# Test if fieldValue matches (case-insensitively) a Solr-style query (with " AND " representing the logical AND, and " " representing the logical OR)
def testIfPresentCaseInsensitive(solrQuery, fieldValueLowerCase):
    for orTerms in solrQuery.split(" AND "):
        present = False
        for term in orTerms.split(" "):
            if term.lower() in fieldValueLowerCase:
                present = True
                break
        if not present:
            return False
    return True

people = data['people']
af2Names = {}
totalCount = 0
for node in people['nodes']:
    id = node['id']
    name = node['name']
#     TODO: Remove if we manage to search only individual fields
    if not testIfPresentCaseInsensitive(query_params['query'], name.lower()):
        continue
    totalCount += 1
    for af in node['affiliation']:
        affiliation = af['name']
        if affiliation not in af2Names:
            af2Names[affiliation] = set()
        af2Names[affiliation].add(name)

tableBody = ""
for af,names in sorted(af2Names.items()):
    tableBody += af + " | " + ', '.join(names) + "\n"
display(Markdown("Total number of researchers found: **%d**<br>The list of researchers by affiliation is as follows:" % totalCount))
display(Markdown(""))

display(Markdown("| Affiliation | Researcher Names |\n|---|---|\n%s" % tableBody))

Total number of researchers found: 210
The list of researchers by affiliation is as follows:

Affiliation Researcher Names

American Chemical Society | John Smith American Science and Engineering, Inc. | Henry John Peter Smith Bank Street College of Education | John Smith Bedford Institute of Oceanography | John Smith Beecham Pharmaceuticals | John Smith Birkenhead High School Academy | John Arthur Smith Bureau of Ocean Energy Management, Pacific OCS Region | John Smith CU Sports Medicine and Performance | John-Rudolph Smith Charles Sturt University - Wagga Wagga Campus | John Smith Church of Norway | John Arthur Smith City College of New York | John Smith Del Rosario Colorado School of Mines | John Smith Cornell University | John-David Smith CottonInfo | John Smith Drew University | John Smith East Carolina University | John Smith Fairleigh Dickinson University | John Smith Federation of Liberian Youth - FLY | John Solunta Smith Jr Fire Risk Assessment Network | John Smith Flagburn Health Center | John Smith Fluent Technology | John Smith George Washington University | John Smith Georgia State University | John Smith GlaxoSmithKline Plc | John Smith Lipscomb University | John Smith London University | John Smith Louisiana State University | John F. Smith MSG Software (USA), Inc. | Henry John Peter Smith Manhattan College | Henry John Peter Smith Michigan State University | John Smith Millersville University | John Smith NASA Langley Research Center | John Smith New South Wales Department of Primary Industries Agriculture | John Smith Northeastern University | Henry John Peter Smith Northwestern University | John F. Smith Nova Scotia Health Authority South Western Nova Scotia | John Smith OCS Energy Consultant | John Smith Ohio State University | John R. Smith Oxford University Press | John Arthur Smith Peking University | John Solunta Smith Jr Pennsylvania State University | John Smith Proof Read My File | John Smith RMIT University City Campus | John Smith Retired | John Arthur Smith Rutgers New Jersey Medical School | John Smith Del Rosario Rutgers University Camden | John Smith Sample invited position | John Smith Sigma Xi the Scientific Research Society | John Smith TPE Associates Inc | Henry John Peter Smith Technical Support | John Smith Tennessee Technological University | John Smith The New School for Social Research | John Smith The University of St Andrews | Christopher John Smith Tufts University | Henry John Peter Smith Ulster Univeristy | John Smith Ulster University | John Smith University College London | John Smith University at Buffalo | John Smith University of Arizona | Smith, John E. 3rd University of California Davis | John R Smith University of Cambridge | John Arthur Smith University of Central Missouri | John Smith University of Colorado | John Smith University of Colorado Boulder | JOHN SMITH, John Smith University of Liverpool | John Arthur Smith, Quintin-John Smith University of Michigan | John R. Smith University of Missouri Columbia | John Smith University of Ottawa | John Smith University of Oxford | Christopher John Smith University of Pennsylvania | John F. Smith, John Smith University of St Andrews | Christopher John Smith University of Strathclyde | John Smith University of Toledo | John-David Smith University of Toronto | John Smith University of Virginia | Smith, John E. 3rd University of York | John Smith Vanderbilt University | John Smith Virginia Commonwealth University | John Lee Smith Visidyne, Inc. | Henry John Peter Smith Yale University | John Smith

In [235]:
# Generate the GraphQL query to retrieve all researchers matching query "John and Smith" and affiliation "University of Arizona", now with works
name_query = "John AND Smith"
affiliation_query = "\"University of Arizona\""
query_params1 = {
    "query" : name_query + " AND " + affiliation_query,
    "max_researchers" : 10,
    "query_end_cursor" : ""    
}

query_str = """query getResearchersByName(
    $query: String!,
    $max_researchers: Int!,
    $query_end_cursor : String!
    )
{
  people(query: $query, first: $max_researchers, after: $query_end_cursor) {
    totalCount
    pageInfo {
      hasNextPage
      endCursor
    }      
    nodes {
      id
      givenName
      familyName
      name
      affiliation {
        name
      }
      works(first: 3) {
        nodes {
          id
          publicationYear
          publisher
          titles {
            title
          }
          creators {
            id
            name
            affiliation {
              id
              name
            }
          }
          subjects {
            subject
          }
        }
      }
    }
  }
}
"""

Run the above query via the GraphQL client

In [236]:
import json
found_next_page = True

# Initialise overall data dict that will store results
data1 = {}

# Keep retrieving results until there are no more results left
while True:
    query = gql("%s" % query_str)
    res = client.execute(query, variable_values=json.dumps(query_params1))
    if "people" not in data1:
        data1 = res
    else:
        people = res["people"]
        data1["people"]["nodes"].extend(people["nodes"])
        pageInfo = people["pageInfo"]
        if pageInfo["hasNextPage"]:
            if pageInfo["endCursor"] is not None:
                query_params["query_end_cursor"] = pageInfo["endCursor"]            
            else:
                break
        else:
            break
In [237]:
from textwrap import shorten

# Collect all relevant details for the researchers found
tableBody=set()
people = data1['people']
for node in people['nodes']:
    id = node['id']
    firstName = node['givenName']
    surname = node['familyName']
    name = node['name']
#     TODO: Remove if we manage to search only individual fields
    if not testIfPresentCaseInsensitive(name_query, name.lower()):
        continue    
    orcidHref = ""
    if id is not None and id != "":
        orcidHref = "["+ name +"]("+ id +")"    
    affiliations = []
    for affiliation in node['affiliation']:
        affiliations.append(affiliation['name'])
    works = ""
    if 'works' in node:
        for work in node['works']['nodes']:
            titles = []
            for title in work['titles']:
                titles.append(shorten(title['title'], width=50, placeholder="..."))
            creators = []
            cnt = 0
            for creator in work['creators']:
                cnt += 1
                # Restrict display to the first author only                 
                if (cnt > 1):
                    creators[-1] += " et al."
                    break
                if creator['id'] is not None:
                    creators.append("[" + creator['name'] + "](" + creator['id'] + ")")
                else:
                    creators.append(creator['name'])
            
            works += '; '.join(creators) + " (" + str(work['publicationYear']) + ") ["+ ', '.join(titles) +"]("+ work['id'] + ") *" + work['publisher'] + "*<br>" 
        
    tableBody.add(firstName + " | " + surname + " | " + orcidHref + " | " + '<br>'.join(sorted(affiliations)) + " | " + works)
display(Markdown("| First Name | Surname | Link to ORCID | Affiliations | Works | \n|---|---|---|---|---|\n%s" % '\n'.join(tableBody)))
First Name Surname Link to ORCID Affiliations Works

John E | Smith | Smith, John E. 3rd | University of Arizona
University of Virginia | Smith, John Edward (2020) CS_216516.sf3 Harvard Dataverse
Smith, John Edward (2020) human N2Aus PKA phosphorylation Harvard Dataverse
Lostal, William et al. (2019) Titin splicing regulates cardiotoxicity... American Association for the Advancement of Science (AAAS)

In [ ]: