(fcb-notebook-2)=

Structuring data

{panels}
:container: container-lg pb-3<br>
:column: col-lg-3 col-md-4 col-sm-6 col-xs-12 p-1<br>
:card: rounded<br>

<i class='fa fa-qrcode fa-2x' style='color:#7e0038;'></i><br>
^^^
<h4><b>Recipe metadata</b></h4>
 identifier: <a href="http://w3id.org/faircookbook/FCB039">FCB039</a>
 version: <a href="">v1.0</a>

---
<i class='fa fa-fire fa-2x' style='color:#7e0038;'></i>
^^^
<h4><b>Difficulty level</b></h4>
<i class='fa fa-fire fa-lg' style='color:#7e0038;'></i>
<i class='fa fa-fire fa-lg' style='color:#7e0038;'></i>
<i class='fa fa-fire fa-lg' style='color:#7e0038;'></i>
<i class='fa fa-fire fa-lg' style='color:#7e0038;'></i>
<i class='fa fa-fire fa-lg' style='color:lightgrey'></i>

---
<i class='fas fa-clock fa-2x' style='color:#7e0038;'></i>
^^^
<h4><b>Reading Time</b></h4>
<i class='fa fa-clock fa-lg' style='color:#7e0038;'></i> 30 minutes
<h4><b>Recipe Type</b></h4>
<i class='fa fa-globe fa-lg' style='color:#7e0038;'></i> Hands-on
<h4><b>Executable Code</b></h4>
<i class='fa fa-play-circle fa-lg' style='color:#7e0038;'></i> Yes

---
<i class='fa fa-users fa-2x' style='color:#7e0038;'></i>
^^^
<h4><b>Intended Audience</b></h4>
<!-- <p> <i class='fa fa-user-md fa-lg' style='color:#7e0038;'></i> Principal Investigator </p> -->
<p> <i class='fa fa-database fa-lg' style='color:#7e0038;'></i> Data Manager </p>
<p> <i class='fa fa-wrench fa-lg' style='color:#7e0038;'></i> Data Scientist </p>

Background:

Experimental results such as metabolite profiling data published in [1,2] can be straightfowardly reported using OKFN Data Packages. Such components can be easily parsed as data frames and exploited for data visualization purpose using libraries implementing graphical grammar concepts. Here, we show how to use the python equivalent of ggplot2: the rich R graphical libraries (https://ggplot2.tidyverse.org/). A few lines of code allow to query datasets and rapidly explore the information. Most importantly, this rapid exploration is possible because of independent variables and because their levels have been clearly and unambiguously declared in the Tabular Data Package itself.

  1. Let's begin by installing the Python packages allowing easy access and use of data formatted as JSON Data Package
In [ ]:
import pandas as pd
import numpy as np
from plotnine import *
  1. Reading the data

We now simply read in the comma-separated-file associated with the tabular data package (a "long" table)

In [ ]:
data = pd.read_csv("../data/processed/rose-data/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv")

Alternately, one could read the relevant data file from the corresponding zenodo dataset

In [ ]:
#data = pd.read_csv("https://zenodo.org/api/files/ba3fbc84-14af-4858-a9ed-e6cfe8d4efd2/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv") 
In [ ]:
data.head()
  1. Plotting the data: We then generate a barplot using the python plotnine library, which delivers a similar functionality as the R ggplot2 package
In [ ]:
# width = figure_size[0]
# height = figure_size[0] * aspect_ratio
gray = '#666666'
orange = '#FF8000'
blue = '#3333FF'

p1 = (ggplot(data)
 + aes('chemical_name', 'sample_mean',fill='factor(treatment)')
 + geom_col()
 
 + theme(axis_text_x=element_text(rotation=90, hjust=1, fontsize=6, color=blue))
 + theme(axis_text_y=element_text(rotation=90, hjust=2, fontsize=6, color=orange))
 + scale_y_continuous(expand = (0,0))   
 + facet_wrap('~treatment', dir='v',ncol=1)
 + theme(figure_size = (8, 16))      
)

p1 + theme(panel_background=element_rect(fill=blue)
       )

p1
  1. Let's now compare the dataset generated in 2015 and the dataset generated in 2018

Both datasets have been generated by the same team, on the same genotype (Rosa Chinensis 'Old Blush') and organism part ('sepals'). Both datasets are held in a Tabular Data Package with the same structure. To perform the comparison, we have simply created another tabular data package, which retains the exact same structure and that simply holds the measurements for the relevant conditions extracted from each dataset (the function to create such file is omitted).

In [ ]:
ng2018sc2015 = pd.read_csv("../data/processed/rose-data/rose_aroma_compound_science2015_vs_NG2018_data_integration.csv")
# ng2018sc2015 = pd.read_csv("https://zenodo.org/api/files/268f29fc-8ead-4049-bb86-181b72073682/rose_aroma_compound_science2015_vs_NG2018_data_integration.csv")
  1. We generate another barplot, which shows the concentration of the chemicals targeted by the GC-MS profiling assay
In [ ]:
(ggplot(ng2018sc2015)
 + aes('chemical_name', 'normalized_to_total_sum_concentration',fill='factor(publication_year)')
 + geom_col()
 + facet_wrap('~publication_year', dir='h', ncol=1)
 + theme(axis_text_x=element_text(rotation=90, hjust=1, fontsize=6))

)

What do we see? The figure shows how consistent the chemical profile of the scent between the 2 studies is, and which prevalent compounds such as X, Y, and Z show roughly similar relative amount within and across studies.