import sys
sys.version_info
sys.version_info(major=3, minor=4, micro=2, releaselevel='final', serial=0)
This case study illustrates the use of the petl package for doing some simple profiling and comparison of data from two tables.
The files used in this case study can be downloaded from the following link:
Download and unzip the files:
!wget http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip
!unzip -o petl-case-study-1-files.zip
--2015-01-19 17:37:39-- http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip Resolving aliman.s3.amazonaws.com (aliman.s3.amazonaws.com)... 54.231.9.241 Connecting to aliman.s3.amazonaws.com (aliman.s3.amazonaws.com)|54.231.9.241|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 3076773 (2.9M) [application/zip] Saving to: ‘petl-case-study-1-files.zip’ 100%[======================================>] 3,076,773 2.34MB/s in 1.3s 2015-01-19 17:37:41 (2.34 MB/s) - ‘petl-case-study-1-files.zip’ saved [3076773/3076773] Archive: petl-case-study-1-files.zip inflating: popdata.csv inflating: snpdata.csv
The first file (snpdata.csv
) contains a list of locations in the
genome of the malaria parasite P. falciparum, along with some basic
data about genetic variations found at those locations.
The second file (popdata.csv
) is supposed to contain the same list
of genome locations, along with some additional data such as allele
frequencies in different populations.
The main point for this case study is that the first file
(snpdata.csv
) contains the canonical list of genome locations, and
the second file (popdata.csv
) contains some additional data about
the same genome locations and therefore should be consistent with the
first file. We want to check whether this second file is in fact
consistent with the first file.
Start by importing the petl package:
import petl as etl
etl.__version__
'1.0.0'
To save some typing, let *a* be the table of data extracted from the
first file (snpdata.csv
), and let *b* be the table of data extracted
from the second file (popdata.csv
), using the fromcsv()
function:
a = etl.fromtsv('snpdata.csv')
b = etl.fromtsv('popdata.csv')
Examine the header from each file:
a.header()
('Chr', 'Pos', 'Ref', 'Nref', 'Der', 'Mut', 'isTypable', 'GeneId', 'GeneAlias', 'GeneDescr')
b.header()
('Chromosome', 'Coordinates', 'Ref. Allele', 'Non-Ref. Allele', 'Outgroup Allele', 'Ancestral Allele', 'Derived Allele', 'Ref. Aminoacid', 'Non-Ref. Aminoacid', 'Private Allele', 'Private population', 'maf AFR', 'maf PNG', 'maf SEA', 'daf AFR', 'daf PNG', 'daf SEA', 'nraf AFR', 'nraf PNG', 'nraf SEA', 'Mutation type', 'Gene', 'Gene Aliases', 'Gene Description', 'Gene Information')
There is a common set of 9 fields that is present in both tables, and
we would like focus on comparing these common fields, however
different field names have been used in the two files. To simplify
comparison, use rename()
to rename some fields in the
second file:
b_renamed = b.rename({'Chromosome': 'Chr',
'Coordinates': 'Pos',
'Ref. Allele': 'Ref',
'Non-Ref. Allele': 'Nref',
'Derived Allele': 'Der',
'Mutation type': 'Mut',
'Gene': 'GeneId',
'Gene Aliases': 'GeneAlias',
'Gene Description': 'GeneDescr'})
b_renamed.header()
('Chr', 'Pos', 'Ref', 'Nref', 'Outgroup Allele', 'Ancestral Allele', 'Der', 'Ref. Aminoacid', 'Non-Ref. Aminoacid', 'Private Allele', 'Private population', 'maf AFR', 'maf PNG', 'maf SEA', 'daf AFR', 'daf PNG', 'daf SEA', 'nraf AFR', 'nraf PNG', 'nraf SEA', 'Mut', 'GeneId', 'GeneAlias', 'GeneDescr', 'Gene Information')
Use cut()
to extract only the fields we're interested in
from both tables:
common_fields = ['Chr', 'Pos', 'Ref', 'Nref', 'Der', 'Mut', 'GeneId', 'GeneAlias', 'GeneDescr']
a_common = a.cut(common_fields)
b_common = b_renamed.cut(common_fields)
Inspect the data:
a_common
Chr | Pos | Ref | Nref | Der | Mut | GeneId | GeneAlias | GeneDescr |
---|---|---|---|---|---|---|---|---|
MAL1 | 91099 | G | A | - | S | PFA0095c | MAL1P1.10 | rifin |
MAL1 | 91104 | A | T | - | N | PFA0095c | MAL1P1.10 | rifin |
MAL1 | 93363 | T | A | - | N | PFA0100c | MAL1P1.11 | hypothetical protein, conserved in P. falciparum |
MAL1 | 93382 | T | G | - | N | PFA0100c | MAL1P1.11 | hypothetical protein, conserved in P. falciparum |
MAL1 | 93384 | G | A | - | N | PFA0100c | MAL1P1.11 | hypothetical protein, conserved in P. falciparum |
...
b_common
Chr | Pos | Ref | Nref | Der | Mut | GeneId | GeneAlias | GeneDescr |
---|---|---|---|---|---|---|---|---|
MAL1 | 91099 | G | A | - | SYN | PFA0095c | MAL1P1.10,RIF | rifin |
MAL1 | 91104 | A | T | - | NON | PFA0095c | MAL1P1.10,RIF | rifin |
MAL1 | 93363 | T | A | - | NON | PFA0100c | MAL1P1.11 | Plasmodium exported protein (PHISTa), unknown function |
MAL1 | 93382 | T | G | - | NON | PFA0100c | MAL1P1.11 | Plasmodium exported protein (PHISTa), unknown function |
MAL1 | 93384 | G | A | - | NON | PFA0100c | MAL1P1.11 | Plasmodium exported protein (PHISTa), unknown function |
...
The fromucsv()
function does not attempt to parse any of the
values from the underlying CSV file, so all values are reported as
strings:
b_common.display(vrepr=repr)
Chr | Pos | Ref | Nref | Der | Mut | GeneId | GeneAlias | GeneDescr |
---|---|---|---|---|---|---|---|---|
'MAL1' | '91099' | 'G' | 'A' | '-' | 'SYN' | 'PFA0095c' | 'MAL1P1.10,RIF' | 'rifin' |
'MAL1' | '91104' | 'A' | 'T' | '-' | 'NON' | 'PFA0095c' | 'MAL1P1.10,RIF' | 'rifin' |
'MAL1' | '93363' | 'T' | 'A' | '-' | 'NON' | 'PFA0100c' | 'MAL1P1.11' | 'Plasmodium exported protein (PHISTa), unknown function' |
'MAL1' | '93382' | 'T' | 'G' | '-' | 'NON' | 'PFA0100c' | 'MAL1P1.11' | 'Plasmodium exported protein (PHISTa), unknown function' |
'MAL1' | '93384' | 'G' | 'A' | '-' | 'NON' | 'PFA0100c' | 'MAL1P1.11' | 'Plasmodium exported protein (PHISTa), unknown function' |
...
However, the 'Pos' field should be interpreted as an integer.
Also, the 'Mut' field has a different representation in the two tables, which needs to be converted before the data can be compared:
a_common.valuecounts('Mut')
Mut | count | frequency |
---|---|---|
N | 71162 | 0.6865804123611875 |
S | 31535 | 0.30425386166507473 |
- | 950 | 0.009165725973737783 |
b_common.valuecounts('Mut')
Mut | count | frequency |
---|---|---|
NON | 70880 | 0.6840510336042 |
SYN | 32738 | 0.31594896639579995 |
Use the convert()
function to convert the type of the 'Pos'
field in both tables and the representation of the 'Mut' field in
table *b*:
a_conv = a_common.convert('Pos', int)
b_conv = (
b_common
.convert('Pos', int)
.convert('Mut', {'SYN': 'S', 'NON': 'N'})
)
highlight = 'background-color: yellow'
a_conv.display(caption='a', vrepr=repr, td_styles={'Pos': highlight})
Chr | Pos | Ref | Nref | Der | Mut | GeneId | GeneAlias | GeneDescr |
---|---|---|---|---|---|---|---|---|
'MAL1' | 91099 | 'G' | 'A' | '-' | 'S' | 'PFA0095c' | 'MAL1P1.10' | 'rifin' |
'MAL1' | 91104 | 'A' | 'T' | '-' | 'N' | 'PFA0095c' | 'MAL1P1.10' | 'rifin' |
'MAL1' | 93363 | 'T' | 'A' | '-' | 'N' | 'PFA0100c' | 'MAL1P1.11' | 'hypothetical protein, conserved in P. falciparum' |
'MAL1' | 93382 | 'T' | 'G' | '-' | 'N' | 'PFA0100c' | 'MAL1P1.11' | 'hypothetical protein, conserved in P. falciparum' |
'MAL1' | 93384 | 'G' | 'A' | '-' | 'N' | 'PFA0100c' | 'MAL1P1.11' | 'hypothetical protein, conserved in P. falciparum' |
...
b_conv.display(caption='b', vrepr=repr, td_styles={'Pos': highlight, 'Mut': highlight})
Chr | Pos | Ref | Nref | Der | Mut | GeneId | GeneAlias | GeneDescr |
---|---|---|---|---|---|---|---|---|
'MAL1' | 91099 | 'G' | 'A' | '-' | 'S' | 'PFA0095c' | 'MAL1P1.10,RIF' | 'rifin' |
'MAL1' | 91104 | 'A' | 'T' | '-' | 'N' | 'PFA0095c' | 'MAL1P1.10,RIF' | 'rifin' |
'MAL1' | 93363 | 'T' | 'A' | '-' | 'N' | 'PFA0100c' | 'MAL1P1.11' | 'Plasmodium exported protein (PHISTa), unknown function' |
'MAL1' | 93382 | 'T' | 'G' | '-' | 'N' | 'PFA0100c' | 'MAL1P1.11' | 'Plasmodium exported protein (PHISTa), unknown function' |
'MAL1' | 93384 | 'G' | 'A' | '-' | 'N' | 'PFA0100c' | 'MAL1P1.11' | 'Plasmodium exported protein (PHISTa), unknown function' |
...
Now the tables are ready for comparison.
Because both tables should contain the same list of genome locations,
they should have the same number of rows. Use nrows()
to
compare:
a_conv.nrows()
103647
b_conv.nrows()
103618
There is some discrepancy. First investigate by comparing just the
genomic locations, defined by the 'Chr' and 'Pos' fields, using
complement()
:
a_locs = a_conv.cut('Chr', 'Pos')
b_locs = b_conv.cut('Chr', 'Pos')
locs_only_in_a = a_locs.complement(b_locs)
locs_only_in_a.nrows()
29
locs_only_in_a.displayall(caption='a only')
Chr | Pos |
---|---|
MAL1 | 216961 |
MAL10 | 538210 |
MAL10 | 548779 |
MAL10 | 1432969 |
MAL11 | 500289 |
MAL11 | 1119809 |
MAL11 | 1278859 |
MAL12 | 51827 |
MAL13 | 183727 |
MAL13 | 398404 |
MAL13 | 627342 |
MAL13 | 1216664 |
MAL13 | 2750149 |
MAL14 | 1991758 |
MAL14 | 2297918 |
MAL14 | 2372268 |
MAL14 | 2994810 |
MAL2 | 38577 |
MAL2 | 64017 |
MAL4 | 1094258 |
MAL5 | 1335335 |
MAL5 | 1338718 |
MAL7 | 670602 |
MAL7 | 690509 |
MAL8 | 489937 |
MAL9 | 416116 |
MAL9 | 868677 |
MAL9 | 1201970 |
MAL9 | 1475245 |
locs_only_in_b = b_locs.complement(a_locs)
locs_only_in_b.nrows()
0
So it appears that 29 locations are missing from table *b*. Export
these missing locations to a CSV file using toucsv()
:
locs_only_in_a.tocsv('missing_locations.csv')
An alternative method for finding rows in one table where some key
value is not present in another table is to use the antijoin()
function:
locs_only_in_a = a_conv.antijoin(b_conv, key=('Chr', 'Pos'))
locs_only_in_a.nrows()
29
We'd also like to compare the values given in the other fields, to find any discrepancies between the two tables.
The simplest way to find conflicts is to merge()
both tables under a given key:
ab_merge = etl.merge(a_conv, b_conv, key=('Chr', 'Pos'))
ab_merge.display(caption='ab_merge',
td_styles=lambda v: highlight if isinstance(v, etl.Conflict) else '')
Chr | Pos | Ref | Nref | Der | Mut | GeneId | GeneAlias | GeneDescr |
---|---|---|---|---|---|---|---|---|
MAL1 | 91099 | G | A | - | S | PFA0095c | Conflict({'MAL1P1.10', 'MAL1P1.10,RIF'}) | rifin |
MAL1 | 91104 | A | T | - | N | PFA0095c | Conflict({'MAL1P1.10', 'MAL1P1.10,RIF'}) | rifin |
MAL1 | 93363 | T | A | - | N | PFA0100c | MAL1P1.11 | Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'}) |
MAL1 | 93382 | T | G | - | N | PFA0100c | MAL1P1.11 | Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'}) |
MAL1 | 93384 | G | A | - | N | PFA0100c | MAL1P1.11 | Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'}) |
...
From a glance at the conflicts above, it appears there are discrepancies in the 'GeneAlias' and 'GeneDescr' fields. There may also be conflicts in other fields, so we need to investigate further.
Note that the table *ab_merge* will contain all rows, not only those containing conflicts. To find only conflicting rows, use cat()
then conflicts()
, e.g.:
ab = etl.cat(a_conv.addfield('source', 'a', index=0),
b_conv.addfield('source', 'b', index=0))
ab_conflicts = ab.conflicts(key=('Chr', 'Pos'), exclude='source')
ab_conflicts.display(10)
source | Chr | Pos | Ref | Nref | Der | Mut | GeneId | GeneAlias | GeneDescr |
---|---|---|---|---|---|---|---|---|---|
a | MAL1 | 91099 | G | A | - | S | PFA0095c | MAL1P1.10 | rifin |
b | MAL1 | 91099 | G | A | - | S | PFA0095c | MAL1P1.10,RIF | rifin |
a | MAL1 | 91104 | A | T | - | N | PFA0095c | MAL1P1.10 | rifin |
b | MAL1 | 91104 | A | T | - | N | PFA0095c | MAL1P1.10,RIF | rifin |
a | MAL1 | 93363 | T | A | - | N | PFA0100c | MAL1P1.11 | hypothetical protein, conserved in P. falciparum |
b | MAL1 | 93363 | T | A | - | N | PFA0100c | MAL1P1.11 | Plasmodium exported protein (PHISTa), unknown function |
a | MAL1 | 93382 | T | G | - | N | PFA0100c | MAL1P1.11 | hypothetical protein, conserved in P. falciparum |
b | MAL1 | 93382 | T | G | - | N | PFA0100c | MAL1P1.11 | Plasmodium exported protein (PHISTa), unknown function |
a | MAL1 | 93384 | G | A | - | N | PFA0100c | MAL1P1.11 | hypothetical protein, conserved in P. falciparum |
b | MAL1 | 93384 | G | A | - | N | PFA0100c | MAL1P1.11 | Plasmodium exported protein (PHISTa), unknown function |
...
Finally, let's find conflicts in a specific field:
ab_conflicts_mut = ab.conflicts(key=('Chr', 'Pos'), include='Mut')
ab_conflicts_mut.display(10, caption='Mut conflicts', td_styles={'Mut': highlight})
source | Chr | Pos | Ref | Nref | Der | Mut | GeneId | GeneAlias | GeneDescr |
---|---|---|---|---|---|---|---|---|---|
a | MAL1 | 99099 | G | T | - | - | PFA0110w | MAL1P1.13,Pf155 | ring-infected erythrocyte surface antigen |
b | MAL1 | 99099 | G | T | - | N | PFA0110w | MAL1P1.13,Pf155,RESA | ring-infected erythrocyte surface antigen |
a | MAL1 | 99211 | C | T | - | - | PFA0110w | MAL1P1.13,Pf155 | ring-infected erythrocyte surface antigen |
b | MAL1 | 99211 | C | T | - | N | PFA0110w | MAL1P1.13,Pf155,RESA | ring-infected erythrocyte surface antigen |
a | MAL1 | 197903 | C | A | A | S | PFA0220w | MAL1P1.34b | ubiquitin carboxyl-terminal hydrolase, putative |
b | MAL1 | 197903 | C | A | A | N | PFA0220w | PFA0215w,MAL1P1.34b | ubiquitin carboxyl-terminal hydrolase, putative |
a | MAL1 | 384429 | C | T | - | N | PFA0485w | MAL1P2.26 | dolichol kinase |
b | MAL1 | 384429 | C | T | - | S | - | - | - |
a | MAL1 | 513268 | A | G | - | N | PFA0650w | MAL1P3.12,MAL1P3.12a,PFA0655w | surface-associated interspersed gene pseudogene, (SURFIN) pseudogene |
b | MAL1 | 513268 | A | G | - | S | PFA0650w | MAL1P3.12,PFA0655,MAL1P3.12a,3D7surf1.2,PFA0655w,MAL1P12a | surface-associated interspersed gene (SURFIN), pseudogene |
...
ab_conflicts_mut.nrows()
3592
For more information about the petl
package see the petl online documentation.