# Using Qurro with Arbitrary Compositional Data¶

Although Qurro was initially designed for use with microbiome sequencing data, it can totally be used on any sort of compositional data. The main challenge is just getting your data formatted properly.

We're going to demonstrate this by creating a Qurro visualization from "color composition data for 22 abstract paintings." These data were taken from Table 1 of Aitchison and Greenacre (2002).

## Requirements¶

This notebook relies on Qurro and seaborn being installed.

## 0. Setting up¶

In this section, we replace the output directory with an empty directory. This just lets us run this notebook multiple times, without any tools complaining about overwriting files.

In [1]:
# Clear the output directory so we can write these files there
!rm -rf output/*
# Since git doesn't keep track of empty directories, create the output/ directory if it doesn't already exist
# (if it does already exist, -p ensures that an error won't be thrown)
!mkdir -p output


## 1. Getting the input data ready¶

At minimum, three files are needed to generate a Qurro visualization. This section goes into detail on each of these three files, and what they look like for the color composition data.

### 1.1. Feature Table¶

This is a table of abundance data detailing the frequencies of features in samples. Qurro expects this table to be in the BIOM format, but fortunately converting TSV files to BIOM isn't too bad.

#### 1.1.1. Wait, hold on, what do you mean by "features" and "samples"?¶

In the color composition data, we consider each of the 22 paintings as a sample, and each color (e.g. Red) as a feature.

#### 1.1.2. Viewing the example file¶

We've provided a TSV file input/color-table.tsv containing the color composition data for the 22 paintings. Notice how the columns are samples, and the rows are features.

In [2]:
from qurro._metadata_utils import read_metadata_file

Out[2]:
1 2 3 4 5 6 7 8 9 10 ... 13 14 15 16 17 18 19 20 21 22
FeatureID
Black 0.125 0.143 0.147 0.164 0.197 0.157 0.153 0.115 0.178 0.164 ... 0.155 0.126 0.199 0.163 0.136 0.184 0.169 0.146 0.200 0.135
White 0.243 0.224 0.231 0.209 0.151 0.256 0.232 0.249 0.167 0.183 ... 0.251 0.273 0.170 0.196 0.185 0.152 0.207 0.240 0.172 0.225
Blue 0.153 0.111 0.058 0.120 0.132 0.072 0.101 0.176 0.048 0.158 ... 0.091 0.045 0.080 0.107 0.162 0.110 0.111 0.141 0.059 0.217
Red 0.031 0.051 0.129 0.047 0.033 0.116 0.062 0.025 0.143 0.027 ... 0.085 0.156 0.076 0.054 0.020 0.039 0.057 0.038 0.120 0.019
Yellow 0.181 0.159 0.133 0.178 0.188 0.153 0.170 0.176 0.118 0.186 ... 0.161 0.131 0.158 0.144 0.193 0.165 0.156 0.184 0.136 0.187

5 rows × 22 columns

#### 1.1.3. Converting from TSV to BIOM¶

We need to convert this TSV file to a BIOM file that can be used with Qurro:

In [3]:
!biom convert \
-i input/color-table.tsv \
--to-json \
-o output/color-table.biom


#### 1.1.4. Summarizing the newly created BIOM file¶

The | head -4 thing below just means "only show the first four lines of the output summary."

In [4]:
!biom summarize-table -i output/color-table.biom | head -4

Num samples: 22
Num observations: 6
Total count: 21
Table density (fraction of non-zero values): 1.000


This is a file containing descriptive information about samples, where each sample has a row in the file and each sample metadata field has a column in the file. Qurro expects this to be a TSV file.

#### 1.2.1. What sort of "metadata" do we have for the color composition data?¶

We don't have much, honestly. Just from Table 1 in Aitchison and Greenacre (2002), all we really know about a given painting is its color composition.

For illustrative purposes (we need some sort of sample metadata to run Qurro), we've added proportion_blue, proportion_black, etc. columns to the sample metadata, as well as a data_source column which is just AitchisonGreenacre2002 for all samples. These columns are obviously a bit silly; if we were super interested in studying why certain paintings seem different, you could imagine us taking the time to investigate and then adding in more useful metadata columns like artist, date painted, canvas height, etc.

#### 1.2.2. Viewing the example file¶

We've provided an example TSV file, input/color-sample-metadata.tsv, containing the sample metadata for the color composition data. This file is suitable as-is for use in Qurro as sample metadata.

In [5]:
metadata = read_metadata_file("input/color-sample-metadata.tsv")

Out[5]:
proportion_black proportion_white proportion_blue proportion_red proportion_yellow proportion_other data_source
SampleID
1 0.125 0.243 0.153 0.031 0.181 0.266 AitchisonGreenacre2002
2 0.143 0.224 0.111 0.051 0.159 0.313 AitchisonGreenacre2002
3 0.147 0.231 0.058 0.129 0.133 0.303 AitchisonGreenacre2002
4 0.164 0.209 0.120 0.047 0.178 0.282 AitchisonGreenacre2002
5 0.197 0.151 0.132 0.033 0.188 0.299 AitchisonGreenacre2002

### 1.3. Feature Rankings¶

By "feature rankings," we usually mean either the feature loadings in a biplot or "differentials." Please see Qurro's paper (preprint here) for more details on what these terms mean.

In the next section we're going to generate a biplot for the color composition abundance data using Aitchison PCA, and use the feature loadings in that biplot as the feature rankings.

## 2. Generating and visualizing a compositional biplot¶

We generate the biplot using Aitchison PCA, wherein we take the singular value decomposition of the center log-ratio transform of the feature table.

As you can see, this looks pretty similar to the biplot figures of this data shown in Aitchison and Greenacre (2002). Some of the axes are inverted compared to that paper's biplots (i.e. here Red points to the right and Blue points to the left, whereas in the 2002 paper it's the opposite), but the interpretation should be the same.

(One fun tidbit: if you're wondering why painting 20 here seems incorrectly placed compared to the A&G 2002 paper, it's because there's a small error in some of that paper's figures! See here for details.)

In [6]:
from plotting_helper import apca, draw_painting_biplot

# Perform Aitchison PCA
ordination = apca(table.astype(float))

# Style and draw the biplot, using the first and second principal components
# https://github.com/jupyter/notebook/issues/3523#issuecomment-534379015
%matplotlib inline
draw_painting_biplot(ordination, "Axis 1", "Axis 2")


When we used Aitchison PCA above, we got a scikit-bio OrdinationResults object. This contains the sample and feature loadings underlying the biplot that was generated, as well as some additional information. (If you're interested in more details, we encourage you to check out the plotting_helper.py code provided in this folder.)

In [7]:
ordination.features.head()

Out[7]:
Axis 1 Axis 2
FeatureID
Black 0.064761 -0.544208
White 0.020050 0.724314
Blue -0.541021 0.119259
Red 0.822854 0.130236
Yellow -0.153330 0.028251

In [8]:
ordination.samples.head()

Out[8]:
Axis 1 Axis 2
SampleID
1 -0.201856 0.229034
2 -0.030162 0.070389
3 0.287573 0.124102
4 -0.064628 -0.006208
5 -0.159943 -0.367294

### 2.2. Export the ordination information to a file¶

This will enable us to use the feature loadings contained in this file as feature rankings in Qurro.

In [9]:
ordination.write("output/apca-ordination.txt")

Out[9]:
'output/apca-ordination.txt'

As we mentioned before, we don't really have a lot of information about these paintings. One thing we do have now, though, are loadings in the biplot for each sample. You can imagine visualizing these loadings in relation to a selected log-ratio—for example, as shown in the bottom four sub-figures of Fig. 5 in Martino et al. 2019.

In [10]:
merged_metadata = metadata.merge(
ordination.samples,
how="left",
left_index=True,
right_index=True,
suffixes=(False, False)
)

Out[10]:
proportion_black proportion_white proportion_blue proportion_red proportion_yellow proportion_other data_source Axis 1 Axis 2
SampleID
1 0.125 0.243 0.153 0.031 0.181 0.266 AitchisonGreenacre2002 -0.201856 0.229034
2 0.143 0.224 0.111 0.051 0.159 0.313 AitchisonGreenacre2002 -0.030162 0.070389
3 0.147 0.231 0.058 0.129 0.133 0.303 AitchisonGreenacre2002 0.287573 0.124102
4 0.164 0.209 0.120 0.047 0.178 0.282 AitchisonGreenacre2002 -0.064628 -0.006208
5 0.197 0.151 0.132 0.033 0.188 0.299 AitchisonGreenacre2002 -0.159943 -0.367294

## 3. Running Qurro¶

Now that we have everything ready, we can finally use Qurro with this data.

### 3.1. Listing the available command-line options¶

In [11]:
!qurro --help

Usage: qurro [OPTIONS]

Generates a visualization of feature rankings and log-ratios.

The resulting visualization contains two plots. The first plot shows how
features are ranked, and the second plot shows the log-ratio of "selected"
features' abundances within samples.

The visualization is interactive, so which features are "selected" to
construct log-ratios -- as well as various other properties of the
visualization -- can be changed by the user.

Options:
-r, --ranks TEXT                Either feature differentials (contained in a
TSV file, where each row describes a feature
and each column describes a differential
field) or a scikit-bio OrdinationResults
file for a biplot (containing feature
provide 'rankings.'  [required]
-t, --table TEXT                A BIOM table describing the abundances of
the ranked features in samples. Note that
empty samples and features will be removed
from the Qurro visualization.  [required]
(where each row describes a sample and each
column describes a 'metadata' field, and the
first column contains sample IDs). In Qurro
visualizations, you can use sample metadata
fields to change the x-axis and colors in
the sample plot.  [required]
(where each row describes a feature and each
column describes a 'metadata' field, and the
first column contains feature IDs). In Qurro
visualizations, you can use feature metadata
fields to filter features in the rank plot
when selecting log-ratios.
-o, --output-dir TEXT           Directory to write the HTML/JS/... files
defining a Qurro visualization to. If this
already within it will be overwritten if
necessary. Note that you need to keep the
files in this directory together -- moving
the index.html file in this directory to
another location, without also moving the
JS/etc. files, will break the visualization.
[required]
-x, --extreme-feature-count INTEGER
If specified, Qurro will only use this many
"extreme" features from both ends of all of
the rankings. This is useful when dealing
with huge datasets (e.g. with BIOM tables
exceeding 1 million entries), for which
running Qurro normally might take a long
amount of time or crash due to memory
limits. Note that the automatic removal of
empty samples and features from the table
will be done *after* this filtering step.
--debug                         If this flag is used, Qurro will output
debug messages.
--version                       Show the version and exit.
--help                          Show this message and exit.


### 3.2. Generating a Qurro visualization¶

Our inputs will be the following three files:

• Feature table: The BIOM table we generated in section 1.1.3 above.
• Sample metadata: The merged metadata file we generated in section 2.3 above.
• Feature rankings: The feature loadings we exported in section 2.2 above.
In [12]:
!qurro \
--table output/color-table.biom \
--ranks output/apca-ordination.txt \
--output-dir output/qurro-viz/

/home/marcus/Dropbox/Work/KnightLab/qurro/qurro/_df_utils.py:126: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

table_sdf = pd.SparseDataFrame(table.matrix_data, default_fill_value=0.0)
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/sparse/frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

>>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

sparse_index=BlockIndex(N, blocs, blens),
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/frame.py:3471: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

>>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

return klass(values, index=self.index, name=items, fastpath=True)
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/ops/__init__.py:1641: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

>>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

return self._constructor(new_values, index=self.index, name=self.name)
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/sparse/frame.py:339: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

default_fill_value=self.default_fill_value,
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/generic.py:6289: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

return self._constructor(new_data).__finalize__(self)
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/generic.py:5884: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

>>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

return self._constructor(new_data).__finalize__(self)
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/sparse/frame.py:785: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

return self._constructor(new_arrays, index=index, columns=columns).__finalize__(
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/generic.py:3606: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

result = self._constructor(new_data).__finalize__(self)
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/generic.py:1999: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

return self._constructor(result, **d).__finalize__(self)
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/sparse/frame.py:745: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

default_fill_value=self._default_fill_value,
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/generic.py:9126: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

return self._constructor(new_data).__finalize__(self)
/home/marcus/Software/miniconda2/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/core/sparse/frame.py:854: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

default_kind=self._default_kind,
Successfully generated a visualization in the folder output/qurro-viz/.


### 3.3. Interacting with the Qurro visualization¶

The command you just ran will generate a folder containing the Qurro visualization. To view the visualization, you can just open up the index.html file contained within this folder in a web browser. You should see something like this:

The top-left of the screen contains the rank plot: a plot showing the loadings for each feature for a selected axis or principal component. The top-right of the screen contains the sample plot: a plot that will show how a selected log-ratio of features looks for all of the samples.

Things look pretty blank right now, since nothing is selected. Let's fix that!

#### 3.3.1. Selecting a log-ratio¶

One thing that's clear from looking at the biplot visualization we generated earlier is that Red and Blue seemed to differentiate samples along Axis 1. Looking at the rank plot for Axis 1 confirms this -- check out how the magnitudes of Red and Blue for the Axis 1 feature loadings are relatively larger than the other colors.

So, let's try seeing how the Red:Blue log-ratio looks in Qurro. To select a log-ratio of individual features, you can just click on the rank plot -- the first click sets the new numerator and the second click sets the new denominator. In this case, we're going to click on the rightmost bar (Red), and then the leftmost bar (Blue).

#### 3.3.2. Adjusting the sample plot¶

The sample plot just looks like a bunch of noise! Mostly, this is because the way the sample plot x-axis is set up doesn't make sense: it's set to proportion_black (which we don't really have a reason to expect would be associated with the Red:Blue log-ratio), and its scale type is set to Categorical (despite the fact that proportion_black is a quantitative field). If we set the sample plot x-axis to proportion_red, and change up some of the other sample plot controls, we get a much more useful visualization:

So we can see from here that the Red:Blue log-ratio is very correlated with the proportion of Red in a given painting. Hopefully this makes sense! All the sample plot is showing is that ln(r / b) is correlated with r, which shouldn't be too crazy.

But we have some other things we can try out.

Remember the sample loadings we merged into the metadata a while back? We can use those here, and replicate the sorts of figures shown in Fig. 5 in Martino et al. 2019.

We know that Red and Blue differentiate samples along Axis 1, so let's look at how the Axis 1 sample loadings are correlated with the Red:Blue log-ratio. We already have that log-ratio selected, so all we need to do is change the sample plot x-axis field from proportion_red to Axis 1:

That's cool. We can see that the Red:Blue log-ratio is highly correlated with the Axis 1 sample loadings of paintings, which confirms our observations from looking at the biplot visualization.

As an exercise for the reader: try switching the rank plot's Feature Loading to Axis 2, then try selecting the log-ratio of White:Black. How does this look when we view samples' Axis 1 sample loadings? How does this look when we view samples' Axis 2 sample loadings? (This is shown below.) What differences do you see, and why do you think these are the case?