#!/usr/bin/env python # coding: utf-8 # # Semalytics walkthrough # # Welcome to the Semalytics demo! # # # # ## Introduction # # This is an extended computational narrative focusing on the platform **Semalytics**, a semantic-based tool for analyzing hierarchical data in translational cancer research. This demo is bundled with the paper: # # > _**Semalytics: a semantic analytics platform for the exploration of distributed and heterogenous cancer data (in translational research)**_ # # Biological annotations are modeled in a new **Semantic Web** fashion and are connected to **Wikidata** for knowledge expansion. Please, note that Semalytics explores annotations that are highly scattered along hierarchical data. # # In this notebook, we are going to use Semalytics for analyzing a test dataset in order to investigate gene alteration-drug interactions. In particular, we focus on the response to the **Cetuximab**, an epidermal growth factor inhibitor used for the treatment of several cancer types, such as the colorectal cancer. Each cancer is a complex and variable system with unique characteristics at the molecular level, which may determine drugs performance. In this demo, we match drug responses data with annotations related to a set of 4 genes: # # * BRAF # * EGFR # * HER2 # * KRAS # # which are known to be relevant to Cetuximab response in colorectal cancer. # # * In [Chapter 1](#insights), we show how the platform can be used for getting basic data insights about genomic landscapes and drug responses. In particular, we use Semalytics to identify an investigation set (i.e., data trees with both genomic and pharmacological annotations). # # * In [Chapter 2](#inside), we explore data into the investigation set. First, we get the list of variants for the genes in the panel. Then, we explore the co-occurence of genomic variants and responses to Cetuximab. # # * In [Chapter 3](#wikidata), we use Semalytics for analyzing local data harnessing the extended information of Wikidata, thus gaining new analytical options on our local database. For example, we use federated queries to explore of data about drugs different from Cetuximab, which we do not store and maintain locally. # # * Finally, in [Appendix](#appendix) we list computational references to figures used in the proof-of-concept (PoC) of the paper. # # See the aforementioned article for further details. # # # ## Table of contents # # * [Introduction](#intro) # * [General settings](#settings) # * [Chapter 1 - Basic insights](#insights) # - [Annotated nodes](#annotated) # - [Genomic annotations](#genes) # - [Response annotations](#mice) # - [Investigation set: genomic and responses annotations](#miceanddrug) # * [Chapter 2 - Inside the investigation set](#inside) # - [Variants of non-responders](#variantsnonresp) # - [Getting basic data for co-occurence analysis](#matchinggetdata) # - [Matching annotations in the investigation set](#mset) # - [Matching `feature_amplification` only](#mamplsubset) # - [Matching `sequence_alteration` only](#mseqsubset) # * [Chapter 3 - Querying data with knowledge and Wikidata](#wikidata) # - [Drugs targeting gene products](#drugsrole) # - [Drug information: dabrafenib](#dabrafenib) # - [Querying variants](#queryinnvar) # - [Positive therapeutic predictors](#pos) # - [Negative therapeutic predictors](#neg) # - [Drugs predictions for a specific case](#case) # * [Appendix - PoC figures](#appendix) # # ## General settings # # General imports and vars # In[1]: import utils import pandas as pd from IPython.display import SVG, display # SPARQL endpoints # Semalytics (i.e., local data) # 14,281,125 explicit triples # 2,391,980 inferred triples SEMALYTICS_ENDPOINT = 'http://semalytics:7200/repositories/annotationDB' # Remote knowledge base WIKIDATA_ENDPOINT = 'https://query.wikidata.org/sparql' # genes panel PANEL = {'BRAF','EGFR','ERBB2','KRAS'} # investigated variants VARIANTS = ':sequence_alteration :feature_amplification' # enable inline plotting get_ipython().run_line_magic('matplotlib', 'inline') # do not truncate data in tables pd.set_option('display.max_colwidth', -1) # # ## Chapter 1 - Basic insights # # We query Semalytics data for getting basic insights. Semalytics returns immediately analytics on scattered annotations. # # # ### Annotated nodes # # We use the following query to retrieve **annotated nodes** (samples for genes or mice for drug reponses). # In[2]: # the query my_query = """ PREFIX : PREFIX onto: select (count(distinct ?node) as ?nodes) from onto:disable-sameAs where { ?case a :Case ; :hasDescendant ?node . ?node a :Bioentity ; :has_annotation ?ann . } """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # there you go! result = result_table['nodes.value'][0] print(f'there are {result} annotated nodes in trees') # # ### Genomic annotations # # Let $\mathcal{G}$ be the set of data trees with annotations about `:sequence_alteration` or `:feature_amplification` for **genes in the panel**. We build $\mathcal{G}$ with the following query: # In[3]: # Cases with annotations in the genes panel my_query = """ PREFIX : PREFIX onto: select (count(distinct ?case) as ?cases) from onto:disable-sameAs where { ?case a :Case ; :hasDescendant ?node . ?node :has_annotation ?ann . ?ann :has_reference ?ref . ?gene :has_variant ?ref. ?gene :symbol ?geneSymbol VALUES ?geneSymbol {'KRAS' 'EGFR' 'BRAF' 'ERBB2'} ?ref a ?annotation_Type . VALUES ?annotation_Type { """+VARIANTS+""" } } """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # there you go! result = result_table['cases.value'][0] print(f'there are {result} cases annotated with 1+ variant(s) in the panel (KRAS, EGFR, BRAF, HER2)') # # ### Response annotations # # Let $\mathcal{D}$ be the set of data trees with annotated mice about **pharmacological responses**. We build $\mathcal{D}$ with the following query: # In[4]: # Cases with pharmacological annotations my_query = """ PREFIX : PREFIX onto: select ?drugName (count(distinct ?case) as ?cases) from onto:disable-sameAs where { ?case a :Case ; :hasDescendant ?mouse . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?ref a :drug_response . ?drug :has_drug_response ?ref; :name ?drugName . } GROUP BY ?drugName """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # there you go! drug,cases = result_table['drugName.value'][0],result_table['cases.value'][0] print(f'there are {cases} cases annotated with 1+ pharmacological response(s) for the {drug}') # # ### Investigation set: genomic and responses annotations # # Let $\mathcal{S} = (\mathcal{G} \cap \mathcal{D})$ be the investigation scope (i.e., data trees with both genomic and pharmacological annotations). We get it through this query: # In[5]: # Cases with pharmacological and genomic annotations my_query = """ PREFIX : PREFIX onto: select (count(distinct ?case) as ?cases) from onto:disable-sameAs where { ?case a :Case ; :hasDescendant ?mouse ; :hasDescendant ?node . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?ref a :drug_response . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?ref2 . ?gene :has_variant ?ref2. ?gene :symbol ?geneSymbol VALUES ?geneSymbol {'KRAS' 'EGFR' 'BRAF' 'ERBB2'} ?ref2 a ?annotation_Type. VALUES ?annotation_Type { """+VARIANTS+""" } } """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # there you go! investigation_scope = result_table['cases.value'][0] print(f'there are {investigation_scope} cases annotated with 1+ pharmacological response(s) AND 1+ variant(s)') # # ## Chapter 2 - Inside the investigation set # # In this section we analyze annotation types for cases in the investigation set. Moreover, we exploit Semalytics for matching variants against responses to Cetuximab. # # # ### Variants of non responders # # With the following query, we get the **variants list** of non-responder cases. The column `alt_p.value` represents the type of `point_mutation`. # In[6]: my_query = """ PREFIX : PREFIX sesame: PREFIX onto: select distinct ?case ?geneSymbol ?type ?alt_p from onto:disable-sameAs where { ?case a :Case ; :hasDescendant ?mouse ; :hasDescendant ?node . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?ref a :DRCl_PD . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?ref2 . ?gene :has_variant ?ref2 ; :symbol ?geneSymbol VALUES ?geneSymbol {'KRAS' 'EGFR' 'BRAF' 'ERBB2'} ?ref2 a ?annotation_Type. VALUES ?annotation_Type { :sequence_alteration :feature_amplification } ?ref2 sesame:directType ?type. OPTIONAL {?ref2 :alt_p ?alt_p } } ORDER BY ?case """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # filter URIs prefixes utils.filter_prefixes(result_table) # there you go! result_table[['case.value', 'geneSymbol.value', 'type.value', 'alt_p.value']].fillna("") # # ### Getting basic data for co-occurence analysis # # Creating basic data for further investigations about gene variant - drug matching. # In[7]: # data collections cases_per_gene = dict() cases_per_variant = dict() cases_per_variant_per_gene = dict() cases_per_response = dict() # We get **cases harboring 1+ variants for each gene in the panel**. # # Please, note that we are counting distinct cases per gene. Therefore, cases harboring multiple variants in the same gene will be counted only once. # In[8]: my_query = """ PREFIX : PREFIX sesame: PREFIX onto: select distinct ?case from onto:disable-sameAs where {{ ?case a :Case ; :hasDescendant ?mouse ; :hasDescendant ?node . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?ref2 . ?gene :has_variant ?ref2 ; :symbol ?geneSymbol VALUES ?geneSymbol {{'{}'}} ?ref2 a ?annotation_Type. VALUES ?annotation_Type {{ :sequence_alteration :feature_amplification }} ?ref2 sesame:directType ?type. }}""" for gene in PANEL: print (f'Querying {gene}') result_table = utils.query(SEMALYTICS_ENDPOINT, my_query.format(gene)) cases_per_gene[gene] = set(result_table['case.value']) print ('Cases outline:') for key in cases_per_gene: print(f'{key}: {len(cases_per_gene[key])}') # Then, we get cases **harboring 1+ `:sequence_alteration` or `:feature_amplification`** # # Again, please, note that we are counting distinct cases per gene. Therefore, cases harboring multiple variants in the same gene will be counted only once. # In[9]: my_query = """ PREFIX : PREFIX sesame: PREFIX onto: select distinct ?case from onto:disable-sameAs where {{ ?case a :Case ; :hasDescendant ?mouse ; :hasDescendant ?node . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?ref2 . ?gene :has_variant ?ref2 ; :symbol ?geneSymbol VALUES ?geneSymbol {{'KRAS' 'EGFR' 'BRAF' 'ERBB2'}} ?ref2 a ?annotation_Type. VALUES ?annotation_Type {{ :{} }} ?ref2 sesame:directType ?type. }}""" for variant in ['sequence_alteration', 'feature_amplification']: print (f'Querying {variant}') result_table = utils.query(SEMALYTICS_ENDPOINT, my_query.format(variant)) cases_per_variant[variant] = set(result_table['case.value']) print ('Cases outline:') for variant in cases_per_variant: print(f'\t{variant} {len(cases_per_variant[variant])}') # Besides, we get cases **harboring 1+ `:sequence_alteration` or `:feature_amplification` for each gene in the panel**. # # In[10]: my_query = """ PREFIX : PREFIX sesame: PREFIX onto: select distinct ?case from onto:disable-sameAs where {{ ?case a :Case ; :hasDescendant ?mouse ; :hasDescendant ?node . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?ref2 . ?gene :has_variant ?ref2 ; :symbol ?geneSymbol VALUES ?geneSymbol {{'{}'}} ?ref2 a ?annotation_Type. VALUES ?annotation_Type {{ :{} }} ?ref2 sesame:directType ?type. }}""" for variant in ['sequence_alteration', 'feature_amplification']: print (f'Querying {variant}') cases_per_variant_per_gene[variant] = dict() for gene in PANEL: result_table = utils.query(SEMALYTICS_ENDPOINT, my_query.format(gene, variant)) try: cases_per_variant_per_gene[variant][gene] = set(result_table['case.value']) except KeyError: print (f'no data for {variant} - {gene}') print ('Cases outline:') for variant in cases_per_variant_per_gene: print(f'{variant}') for gene in cases_per_variant_per_gene[variant]: print(f'\t{gene} {len(cases_per_variant_per_gene[variant][gene])}') # Variants summary: # # # | Gene | All variant types | :feature_amplification | :sequence_alteration | # |-------------|-----|------------|-----| # | Annotated | 113 | 33 | 88 | # | BRAF | 13 | 0 | 13 | # | EGFR | 29 | 26 | 4 | # | ERBB2 | 11 | 7 | 5 | # | KRAS | 70 | 0 | 70 | # Finally, we get **cases per response type**. # In[11]: my_query = """ PREFIX : PREFIX sesame: PREFIX onto: select distinct ?case from onto:disable-sameAs where {{ ?case a :Case ; :hasDescendant ?mouse ; :hasDescendant ?node . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?ref sesame:directType :{} . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?ref2 . ?gene :has_variant ?ref2 ; :symbol ?geneSymbol VALUES ?geneSymbol {{'KRAS' 'EGFR' 'BRAF' 'ERBB2'}} ?ref2 a ?annotation_Type. VALUES ?annotation_Type {{ :sequence_alteration :feature_amplification }} ?ref2 sesame:directType ?type. }}""" for response in ['DRCl_OR', 'DRCl_SD', 'DRCl_PD']: print (f'Querying {response}') result_table = utils.query(SEMALYTICS_ENDPOINT, my_query.format(response)) cases_per_response[response] = set(result_table['case.value']) print ('Cases outline:') for key in cases_per_response: print(f'{key}: {len(cases_per_response[key])}') # Since we are also interested in analyzing **variants co-occurrences**, we enumerate all possible combinations (i.e., the power set). We will use these data in the next sections. # In[12]: # create variants co-occurrences list (i.e., the power set of {'BRAF','EGFR','ERBB2','KRAS'}) variants_occurrences = list(utils.powerset(PANEL)) variants_occurrences.sort(key=len) # just combinatorics variants_occurrences # # ### Matching annotations in the investigation set # # We analyze all annotated cases and we match drug information with gene variants data. # In[13]: # the investigation set tot = cases_per_gene['BRAF'] | cases_per_gene['EGFR'] | cases_per_gene['ERBB2'] | cases_per_gene['KRAS'] print(len(tot)) # In[14]: # create a new Semalytics analysis object a = utils.Analysis(tot, cases_per_gene, cases_per_response, variants_occurrences) # gene variants a.variants # In[15]: # plot variants distribution a.plot_variants() # In[16]: # responses a.responses # # (Note: the following is figure 4b/right in the paper) # In[17]: # plot responses a.plot_responses() # In[18]: # variants vs responses a.matching.fillna("") # In[19]: # plot matching a.plot_matching() # # ### Matching `feature_amplification` only # # We analyze cases with only 1+ `feature_amplification` (and with no `sequence_alteration`) # In[20]: # create the subset tot = cases_per_variant['feature_amplification'] - cases_per_variant['sequence_alteration'] print(len(tot)) # In[21]: # create a new Semalytics analysis object a = utils.Analysis(tot, cases_per_gene, cases_per_response, variants_occurrences) # gene variants a.variants # In[22]: # plot variants distribution a.plot_variants() # In[23]: # responses a.responses # In[24]: # plot responses a.plot_responses() # In[25]: # variants vs responses a.matching.fillna("") # In[26]: # plot matching a.plot_matching() # # ### Matching `sequence_alteration` only # # We analyze cases with only 1+ `sequence_alteration` (and with no `feature_amplification`) # In[27]: # create the subset tot = cases_per_variant['sequence_alteration'] - cases_per_variant['feature_amplification'] print (len(tot)) # In[28]: # create a new Semalytics analysis object a = utils.Analysis(tot, cases_per_gene, cases_per_response, variants_occurrences) # gene variants a.variants # In[29]: # plot variants distribution a.plot_variants() # In[30]: # responses a.responses # In[31]: # plot responses a.plot_responses() # In[32]: # variants vs responses a.matching.fillna("") # In[33]: # plot matching a.plot_matching() # # ## Chapter 3 - Querying data with knowlege and Wikidata # # In this section, we are going to query data with extended knowledge. The platform connects Wikidata by leveraging `owl:sameAs` predicates. # # The SPARQL endpoint of Semalytics is federated with the Wikidata one (https://query.wikidata.org/sparql). # # See also this [Web page](https://www.wikidata.org/wiki/User:ProteinBoxBot/SPARQL_Examples#Query_Wikidata_with_SPARQL) for other Wikidata examples related to life sciences. Those queries can be also used for querying local data in Semalytics. # # ### Drugs targeting gene products # # We get chemical compounds (`Q11173`) which physically interacts (`P129`), with a specific role (`P2868`), with products encoded by genes in the investigation panel. # In[34]: my_query = """ PREFIX : PREFIX wd: PREFIX wdt: PREFIX pq: PREFIX ps: PREFIX p: PREFIX wikibase: PREFIX bd: PREFIX rdfs: select ?geneSymbol ?drugLabel ?roleLabel ?gene_productLabel where { # Wikidata endpoint SERVICE { ?chem p:P129 [ ps:P129 ?gene_product ; pq:P2868 ?role ] . ?chem wdt:P31 wd:Q11173 . ?gene_product wdt:P702 ?gene . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . ?chem rdfs:label ?drugLabel . ?gene_product rdfs:label ?gene_productLabel . ?role rdfs:label ?roleLabel . } } #local data ?gene :symbol ?geneSymbol VALUES ?geneSymbol {'KRAS' 'EGFR' 'BRAF' 'ERBB2'} } order by ?geneSymbol """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # there you go! result_table[['geneSymbol.value', 'drugLabel.value', 'roleLabel.value', 'gene_productLabel.value']] # # ### Drug information: dabrafenib # # Now we get from Wikidata the **chemical formula** (P274) of one of those drug: the **dabrafenib**... # In[35]: my_query = """ PREFIX wd: PREFIX wdt: SELECT * WHERE { wd:Q3011604 wdt:P274 ?chem . } """ # get data result_table = utils.query(WIKIDATA_ENDPOINT, my_query) result_table # ...as well as its **chemical structure** (`P117`). # In[36]: my_query = """ PREFIX wd: PREFIX wdt: SELECT * WHERE { wd:Q3011604 wdt:P117 ?struct . } """ # get data result_table = utils.query(WIKIDATA_ENDPOINT, my_query) display(SVG(url=result_table['struct.value'][0])) print (f'live rendering from Wikidata of {result_table["struct.value"][0]}') # Finally, we get **medical conditions treated** (P2175), relative data source (1) and information retrieval date. # # _(1) "dataset containing drug indications extracted from the FDA Adverse Event Reporting System"_ # In[37]: my_query = """ SELECT ?medical_conditionLabel ?referenceLabel ?date WHERE { wd:Q3011604 p:P2175 [ ps:P2175 ?medical_condition ; prov:wasDerivedFrom ?source ]. ?source pr:P248 ?reference ; pr:P813 ?date SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } } """ # get data result_table = utils.query(WIKIDATA_ENDPOINT, my_query) # there you go result_table[['medical_conditionLabel.value', 'referenceLabel.value', 'date.value']] # # ### Querying variants # # We can also query only cases with variants mapped to Wikidata. Those are entry points for knowledge enrichment. The column `alt_p.value` represents the type of `point_mutation`. # In[38]: my_query = """ PREFIX : PREFIX wdt: PREFIX owl: select distinct ?case ?variant ?geneSymbol ?alt_p ?annotation_Type where { SERVICE { ?variant wdt:P3329 ?id . } ?case a :Case ; :hasDescendant ?mouse ; :hasDescendant ?node . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?ref a :drug_response . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?variant . ?gene :has_variant ?variant. OPTIONAL {?variant :alt_p ?alt_p } ?gene :symbol ?geneSymbol VALUES ?geneSymbol {'KRAS' 'EGFR' 'BRAF' 'ERBB2'} ?variant a ?annotation_Type. VALUES ?annotation_Type { :sequence_alteration :feature_amplification } } """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # filter URIs prefixes utils.filter_prefixes(result_table) # there you go result_table[['case.value', 'variant.value', 'geneSymbol.value', 'alt_p.value', 'annotation_Type.value']].fillna("") # # ## Positive therapeutic predictors # # We can use the **variants occurrences** annotated in the local database for querying **associated positive response predictions** to drugs. Moreover, we retrieve also the scientific article from where the evidence comes and the relative medical condition treated. # In[39]: my_query = """ PREFIX : PREFIX wdt: PREFIX owl: PREFIX pq: PREFIX ps: PREFIX pr: PREFIX p: PREFIX prov: PREFIX wikibase: PREFIX bd: PREFIX rdfs: select distinct ?geneSymbol ?variantLabel ?treatmentLabel ?diseaseLabel ?referenceLabel where { SERVICE { ?variant wdt:P3329 ?id . ?variant p:P3354 [ ps:P3354 ?treatment ; pq:P2175 ?disease ; prov:wasDerivedFrom ?source ]. ?source pr:P248 ?reference SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . ?variant rdfs:label ?variantLabel . ?treatment rdfs:label ?treatmentLabel . ?disease rdfs:label ?diseaseLabel . ?reference rdfs:label ?referenceLabel } } ?case a :Case ; :hasDescendant ?mouse ; :hasDescendant ?node . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?ref a :drug_response . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?variant . ?gene :has_variant ?variant. OPTIONAL {?variant :alt_p ?alt_p } ?gene :symbol ?geneSymbol VALUES ?geneSymbol {'KRAS' 'EGFR' 'BRAF' 'ERBB2'} ?variant a ?annotation_Type. VALUES ?annotation_Type { :sequence_alteration :feature_amplification } } order by ?geneSymbol """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # there you go result_table[['geneSymbol.value', 'variantLabel.value', 'treatmentLabel.value', 'diseaseLabel.value', 'referenceLabel.value']] # # ## Negative therapeutic predictors # # We can use the **variants occurrences** annotated in the local database for querying associated **negative response predictions** to drugs. Moreover, we retrieve also the scientific article from where the evidence comes and the relative medical condition treated. # In[40]: my_query = """ PREFIX : PREFIX wdt: PREFIX owl: PREFIX pq: PREFIX ps: PREFIX pr: PREFIX p: PREFIX prov: PREFIX wikibase: PREFIX bd: PREFIX rdfs: select distinct ?geneSymbol ?variantLabel ?treatmentLabel ?diseaseLabel ?referenceLabel where { SERVICE { ?variant wdt:P3329 ?id . ?variant p:P3355 [ ps:P3355 ?treatment ; pq:P2175 ?disease ; prov:wasDerivedFrom ?source ]. ?source pr:P248 ?reference SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . ?variant rdfs:label ?variantLabel . ?treatment rdfs:label ?treatmentLabel . ?disease rdfs:label ?diseaseLabel . ?reference rdfs:label ?referenceLabel } } ?case a :Case ; :hasDescendant ?mouse ; :hasDescendant ?node . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?ref a :drug_response . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?variant . ?gene :has_variant ?variant. OPTIONAL {?variant :alt_p ?alt_p } ?gene :symbol ?geneSymbol VALUES ?geneSymbol {'KRAS' 'EGFR' 'BRAF' 'ERBB2'} ?variant a ?annotation_Type. VALUES ?annotation_Type { :sequence_alteration :feature_amplification } } order by ?geneSymbol """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # there you go result_table[['geneSymbol.value', 'variantLabel.value', 'treatmentLabel.value', 'diseaseLabel.value', 'referenceLabel.value']] # # ### Drugs predictions for a specific case # # We can use the connection to Wikidata for querying evidences of drug responses predictions (i.e., positive or negative) associated with variants harbored by a specific case (i.e., _id=CRC0481_) # # **Positive responses predictions** # In[41]: my_query = """ PREFIX : PREFIX wdt: PREFIX owl: PREFIX pq: PREFIX ps: PREFIX pr: PREFIX p: PREFIX prov: PREFIX wikibase: PREFIX bd: PREFIX rdfs: select distinct ?geneSymbol ?variantLabel ?treatmentLabel ?diseaseLabel ?referenceLabel where { SERVICE { ?variant wdt:P3329 ?id . ?variant p:P3354 [ ps:P3354 ?treatment ; pq:P2175 ?disease ; prov:wasDerivedFrom ?source ]. ?source pr:P248 ?reference SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . ?variant rdfs:label ?variantLabel . ?treatment rdfs:label ?treatmentLabel . ?disease rdfs:label ?diseaseLabel . ?reference rdfs:label ?referenceLabel . } } :CRC0481 :hasDescendant ?node . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?variant . ?gene :has_variant ?variant. OPTIONAL {?variant :alt_p ?alt_p } ?gene :symbol ?geneSymbol VALUES ?geneSymbol {'KRAS' 'EGFR' 'BRAF' 'ERBB2'} ?variant a ?annotation_Type. VALUES ?annotation_Type { :sequence_alteration :feature_amplification } } order by ?geneSymbol """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # there you go result_table[['geneSymbol.value', 'variantLabel.value', 'treatmentLabel.value', 'diseaseLabel.value', 'referenceLabel.value']] # **Negative responses predictions** # In[42]: my_query = """ PREFIX : PREFIX wdt: PREFIX owl: PREFIX pq: PREFIX ps: PREFIX pr: PREFIX p: PREFIX prov: PREFIX wikibase: PREFIX bd: PREFIX rdfs: select distinct ?geneSymbol ?variantLabel ?treatmentLabel ?diseaseLabel ?referenceLabel where { SERVICE { ?variant wdt:P3329 ?id . ?variant p:P3355 [ ps:P3355 ?treatment ; pq:P2175 ?disease ; prov:wasDerivedFrom ?source ]. ?source pr:P248 ?reference SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . ?variant rdfs:label ?variantLabel . ?treatment rdfs:label ?treatmentLabel . ?disease rdfs:label ?diseaseLabel . ?reference rdfs:label ?referenceLabel . } } :CRC0481 :hasDescendant ?node . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?variant . ?gene :has_variant ?variant. OPTIONAL {?variant :alt_p ?alt_p } ?gene :symbol ?geneSymbol VALUES ?geneSymbol {'KRAS' 'EGFR' 'BRAF' 'ERBB2'} ?variant a ?annotation_Type. VALUES ?annotation_Type { :sequence_alteration :feature_amplification } } order by ?geneSymbol """ # get data result_table = utils.query(SEMALYTICS_ENDPOINT, my_query) # there you go result_table[['geneSymbol.value', 'variantLabel.value', 'treatmentLabel.value', 'diseaseLabel.value', 'referenceLabel.value']] # # ## Appendix - PoC figures # # In this final section we list links between figures in the paper and queries or computations in this notebook. # ### Figure 4b (left) # # The **pie chart on the left** (i.e., response fractions in trees with no variants) can be obtained with the following query: # In[43]: # response fractions in trees with no variants my_query = """ PREFIX : PREFIX onto: PREFIX sesame: select (count(distinct ?case) as ?cases) ?type from onto:disable-sameAs where { ?case a :Case ; :hasDescendant ?mouse . ?mouse a :Biomouse ; :has_annotation ?ann . ?ann :has_reference ?ref . ?ref sesame:directType ?type . filter not exists { ?case :hasDescendant ?node . ?node :has_annotation ?ann2 . ?ann2 :has_reference ?ref2 . ?gene :has_variant ?ref2. ?gene :symbol ?geneSymbol VALUES ?geneSymbol {'KRAS' 'EGFR' 'BRAF' 'ERBB2'} ?ref2 a ?annotation_Type. VALUES ?annotation_Type { :sequence_alteration :feature_amplification } } } group by ?type """ # get data result_table_no_var = utils.query(SEMALYTICS_ENDPOINT, my_query) result_table_no_var[['type.value','cases.value']] # In[44]: results = [ int(result_table_no_var[result_table_no_var['type.value'].str.contains('_OR')]['cases.value']), int(result_table_no_var[result_table_no_var['type.value'].str.contains('_SD')]['cases.value']), int(result_table_no_var[result_table_no_var['type.value'].str.contains('_PD')]['cases.value']) ] d = {'response_type': ['response', 'neutral', 'progression'], 'cases': results} df = pd.DataFrame(data=d) chart = df.plot.pie(y = 'cases', rot=0, #labels = df['response_type'], # labels labels = None, legend = False, figsize=(5, 5), colors=utils.response_colors(df['response_type']), title ='No variants' # title ).set_ylabel('') # ### Figure 4b (right) # We computed the **pie chart on the right** (responses of cases with 1+ variant(s)) in this [cell](#fig4br). # ### Figure 5a - Data matrix # # The data matrix in Figure 5a cotains several charts. # # * Row *Mutations only*: the three charts of the first row are the ones computed in cells of section [Matching `sequence_alteration` only](#mseqsubset) # # * Row _Amplifications only_: charts of this row are computed in cells of section [Matching `feature_amplification` only](#mamplsubset) # # * Row _All cases_: these charts are generated in cells of section [Matching annotations in the investigation set](#mset) # # ### Figure 5b - Variants occurences # # This figure shows the distribution of variants detected in cases that did not respond to Cetuximab. In particular, data coming from the query presented in section [Variants of non-responders](#variantsnonresp) are sliced and diced in a pivot table to present distributions about: # # * altered genes # * alteration types per gene (mutations or amplifications) # * mutations detected per gene # #