In [1]:

import sys
sys.version_info

Out[1]:

sys.version_info(major=3, minor=4, micro=2, releaselevel='final', serial=0)

petl Case Study 1 - Comparing Tables¶

This case study illustrates the use of the petl package for doing some simple profiling and comparison of data from two tables.

Introduction¶

The files used in this case study can be downloaded from the following link:

http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip

Download and unzip the files:

In [2]:

!wget http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip
!unzip -o petl-case-study-1-files.zip

--2015-01-19 17:37:39--  http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip
Resolving aliman.s3.amazonaws.com (aliman.s3.amazonaws.com)... 54.231.9.241
Connecting to aliman.s3.amazonaws.com (aliman.s3.amazonaws.com)|54.231.9.241|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3076773 (2.9M) [application/zip]
Saving to: ‘petl-case-study-1-files.zip’

100%[======================================>] 3,076,773   2.34MB/s   in 1.3s   

2015-01-19 17:37:41 (2.34 MB/s) - ‘petl-case-study-1-files.zip’ saved [3076773/3076773]

Archive:  petl-case-study-1-files.zip
  inflating: popdata.csv             
  inflating: snpdata.csv

The first file (snpdata.csv) contains a list of locations in the genome of the malaria parasite P. falciparum, along with some basic data about genetic variations found at those locations.

The second file (popdata.csv) is supposed to contain the same list of genome locations, along with some additional data such as allele frequencies in different populations.

The main point for this case study is that the first file (snpdata.csv) contains the canonical list of genome locations, and the second file (popdata.csv) contains some additional data about the same genome locations and therefore should be consistent with the first file. We want to check whether this second file is in fact consistent with the first file.

Preparing the data¶

Start by importing the petl package:

In [3]:

import petl as etl
etl.__version__

Out[3]:

'1.0.0'

To save some typing, let *a* be the table of data extracted from the first file (snpdata.csv), and let *b* be the table of data extracted from the second file (popdata.csv), using the fromcsv() function:

In [4]:

a = etl.fromtsv('snpdata.csv')
b = etl.fromtsv('popdata.csv')

Examine the header from each file:

In [5]:

a.header()

Out[5]:

('Chr',
 'Pos',
 'Ref',
 'Nref',
 'Der',
 'Mut',
 'isTypable',
 'GeneId',
 'GeneAlias',
 'GeneDescr')

In [6]:

b.header()

Out[6]:

('Chromosome',
 'Coordinates',
 'Ref. Allele',
 'Non-Ref. Allele',
 'Outgroup Allele',
 'Ancestral Allele',
 'Derived Allele',
 'Ref. Aminoacid',
 'Non-Ref. Aminoacid',
 'Private Allele',
 'Private population',
 'maf AFR',
 'maf PNG',
 'maf SEA',
 'daf AFR',
 'daf PNG',
 'daf SEA',
 'nraf AFR',
 'nraf PNG',
 'nraf SEA',
 'Mutation type',
 'Gene',
 'Gene Aliases',
 'Gene Description',
 'Gene Information')

There is a common set of 9 fields that is present in both tables, and we would like focus on comparing these common fields, however different field names have been used in the two files. To simplify comparison, use rename() to rename some fields in the second file:

In [7]:

b_renamed = b.rename({'Chromosome': 'Chr', 
                      'Coordinates': 'Pos', 
                      'Ref. Allele': 'Ref', 
                      'Non-Ref. Allele': 'Nref', 
                      'Derived Allele': 'Der', 
                      'Mutation type': 'Mut', 
                      'Gene': 'GeneId', 
                      'Gene Aliases': 'GeneAlias', 
                      'Gene Description': 'GeneDescr'})
b_renamed.header()

Out[7]:

('Chr',
 'Pos',
 'Ref',
 'Nref',
 'Outgroup Allele',
 'Ancestral Allele',
 'Der',
 'Ref. Aminoacid',
 'Non-Ref. Aminoacid',
 'Private Allele',
 'Private population',
 'maf AFR',
 'maf PNG',
 'maf SEA',
 'daf AFR',
 'daf PNG',
 'daf SEA',
 'nraf AFR',
 'nraf PNG',
 'nraf SEA',
 'Mut',
 'GeneId',
 'GeneAlias',
 'GeneDescr',
 'Gene Information')

Use cut() to extract only the fields we're interested in from both tables:

In [8]:

common_fields = ['Chr', 'Pos', 'Ref', 'Nref', 'Der', 'Mut', 'GeneId', 'GeneAlias', 'GeneDescr']
a_common = a.cut(common_fields)
b_common = b_renamed.cut(common_fields)

Inspect the data:

In [9]:

a_common

Out[9]:

Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
MAL1	91099	G	A	-	S	PFA0095c	MAL1P1.10	rifin
MAL1	91104	A	T	-	N	PFA0095c	MAL1P1.10	rifin
MAL1	93363	T	A	-	N	PFA0100c	MAL1P1.11	hypothetical protein, conserved in P. falciparum
MAL1	93382	T	G	-	N	PFA0100c	MAL1P1.11	hypothetical protein, conserved in P. falciparum
MAL1	93384	G	A	-	N	PFA0100c	MAL1P1.11	hypothetical protein, conserved in P. falciparum

...

In [10]:

b_common

Out[10]:

Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
MAL1	91099	G	A	-	SYN	PFA0095c	MAL1P1.10,RIF	rifin
MAL1	91104	A	T	-	NON	PFA0095c	MAL1P1.10,RIF	rifin
MAL1	93363	T	A	-	NON	PFA0100c	MAL1P1.11	Plasmodium exported protein (PHISTa), unknown function
MAL1	93382	T	G	-	NON	PFA0100c	MAL1P1.11	Plasmodium exported protein (PHISTa), unknown function
MAL1	93384	G	A	-	NON	PFA0100c	MAL1P1.11	Plasmodium exported protein (PHISTa), unknown function

...

The fromucsv() function does not attempt to parse any of the values from the underlying CSV file, so all values are reported as strings:

In [11]:

b_common.display(vrepr=repr)

Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
'MAL1'	'91099'	'G'	'A'	'-'	'SYN'	'PFA0095c'	'MAL1P1.10,RIF'	'rifin'
'MAL1'	'91104'	'A'	'T'	'-'	'NON'	'PFA0095c'	'MAL1P1.10,RIF'	'rifin'
'MAL1'	'93363'	'T'	'A'	'-'	'NON'	'PFA0100c'	'MAL1P1.11'	'Plasmodium exported protein (PHISTa), unknown function'
'MAL1'	'93382'	'T'	'G'	'-'	'NON'	'PFA0100c'	'MAL1P1.11'	'Plasmodium exported protein (PHISTa), unknown function'
'MAL1'	'93384'	'G'	'A'	'-'	'NON'	'PFA0100c'	'MAL1P1.11'	'Plasmodium exported protein (PHISTa), unknown function'

...

However, the 'Pos' field should be interpreted as an integer.

Also, the 'Mut' field has a different representation in the two tables, which needs to be converted before the data can be compared:

In [12]:

a_common.valuecounts('Mut')

Out[12]:

Mut	count	frequency
N	71162	0.6865804123611875
S	31535	0.30425386166507473
-	950	0.009165725973737783

In [13]:

b_common.valuecounts('Mut')

Out[13]:

Mut	count	frequency
NON	70880	0.6840510336042
SYN	32738	0.31594896639579995

Use the convert() function to convert the type of the 'Pos' field in both tables and the representation of the 'Mut' field in table *b*:

In [14]:

a_conv = a_common.convert('Pos', int)
b_conv = (
    b_common
    .convert('Pos', int)
    .convert('Mut', {'SYN': 'S', 'NON': 'N'})
)

In [15]:

highlight = 'background-color: yellow'
a_conv.display(caption='a', vrepr=repr, td_styles={'Pos': highlight})

a
Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
'MAL1'	91099	'G'	'A'	'-'	'S'	'PFA0095c'	'MAL1P1.10'	'rifin'
'MAL1'	91104	'A'	'T'	'-'	'N'	'PFA0095c'	'MAL1P1.10'	'rifin'
'MAL1'	93363	'T'	'A'	'-'	'N'	'PFA0100c'	'MAL1P1.11'	'hypothetical protein, conserved in P. falciparum'
'MAL1'	93382	'T'	'G'	'-'	'N'	'PFA0100c'	'MAL1P1.11'	'hypothetical protein, conserved in P. falciparum'
'MAL1'	93384	'G'	'A'	'-'	'N'	'PFA0100c'	'MAL1P1.11'	'hypothetical protein, conserved in P. falciparum'

...

In [16]:

b_conv.display(caption='b', vrepr=repr, td_styles={'Pos': highlight, 'Mut': highlight})

b
Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
'MAL1'	91099	'G'	'A'	'-'	'S'	'PFA0095c'	'MAL1P1.10,RIF'	'rifin'
'MAL1'	91104	'A'	'T'	'-'	'N'	'PFA0095c'	'MAL1P1.10,RIF'	'rifin'
'MAL1'	93363	'T'	'A'	'-'	'N'	'PFA0100c'	'MAL1P1.11'	'Plasmodium exported protein (PHISTa), unknown function'
'MAL1'	93382	'T'	'G'	'-'	'N'	'PFA0100c'	'MAL1P1.11'	'Plasmodium exported protein (PHISTa), unknown function'
'MAL1'	93384	'G'	'A'	'-'	'N'	'PFA0100c'	'MAL1P1.11'	'Plasmodium exported protein (PHISTa), unknown function'

...

Now the tables are ready for comparison.

Looking for missing or unexpected rows¶

Because both tables should contain the same list of genome locations, they should have the same number of rows. Use nrows() to compare:

In [17]:

a_conv.nrows()

Out[17]:

In [18]:

b_conv.nrows()

Out[18]:

There is some discrepancy. First investigate by comparing just the genomic locations, defined by the 'Chr' and 'Pos' fields, using complement():

In [19]:

a_locs = a_conv.cut('Chr', 'Pos')
b_locs = b_conv.cut('Chr', 'Pos')
locs_only_in_a = a_locs.complement(b_locs)
locs_only_in_a.nrows()

Out[19]:

In [20]:

locs_only_in_a.displayall(caption='a only')

a only
Chr	Pos
MAL1	216961
MAL10	538210
MAL10	548779
MAL10	1432969
MAL11	500289
MAL11	1119809
MAL11	1278859
MAL12	51827
MAL13	183727
MAL13	398404
MAL13	627342
MAL13	1216664
MAL13	2750149
MAL14	1991758
MAL14	2297918
MAL14	2372268
MAL14	2994810
MAL2	38577
MAL2	64017
MAL4	1094258
MAL5	1335335
MAL5	1338718
MAL7	670602
MAL7	690509
MAL8	489937
MAL9	416116
MAL9	868677
MAL9	1201970
MAL9	1475245

In [21]:

locs_only_in_b = b_locs.complement(a_locs)
locs_only_in_b.nrows()

Out[21]:

So it appears that 29 locations are missing from table *b*. Export these missing locations to a CSV file using toucsv():

In [22]:

locs_only_in_a.tocsv('missing_locations.csv')

An alternative method for finding rows in one table where some key value is not present in another table is to use the antijoin() function:

In [23]:

locs_only_in_a = a_conv.antijoin(b_conv, key=('Chr', 'Pos'))
locs_only_in_a.nrows()

Out[23]:

Finding conflicts¶

We'd also like to compare the values given in the other fields, to find any discrepancies between the two tables.

The simplest way to find conflicts is to merge() both tables under a given key:

In [24]:

ab_merge = etl.merge(a_conv, b_conv, key=('Chr', 'Pos'))
ab_merge.display(caption='ab_merge', 
                 td_styles=lambda v: highlight if isinstance(v, etl.Conflict) else '')

ab_merge
Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
MAL1	91099	G	A	-	S	PFA0095c	Conflict({'MAL1P1.10', 'MAL1P1.10,RIF'})	rifin
MAL1	91104	A	T	-	N	PFA0095c	Conflict({'MAL1P1.10', 'MAL1P1.10,RIF'})	rifin
MAL1	93363	T	A	-	N	PFA0100c	MAL1P1.11	Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})
MAL1	93382	T	G	-	N	PFA0100c	MAL1P1.11	Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})
MAL1	93384	G	A	-	N	PFA0100c	MAL1P1.11	Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})

...

From a glance at the conflicts above, it appears there are discrepancies in the 'GeneAlias' and 'GeneDescr' fields. There may also be conflicts in other fields, so we need to investigate further.

Note that the table *ab_merge* will contain all rows, not only those containing conflicts. To find only conflicting rows, use cat() then conflicts(), e.g.:

In [25]:

ab = etl.cat(a_conv.addfield('source', 'a', index=0), 
             b_conv.addfield('source', 'b', index=0))
ab_conflicts = ab.conflicts(key=('Chr', 'Pos'), exclude='source')
ab_conflicts.display(10)

source	Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
a	MAL1	91099	G	A	-	S	PFA0095c	MAL1P1.10	rifin
b	MAL1	91099	G	A	-	S	PFA0095c	MAL1P1.10,RIF	rifin
a	MAL1	91104	A	T	-	N	PFA0095c	MAL1P1.10	rifin
b	MAL1	91104	A	T	-	N	PFA0095c	MAL1P1.10,RIF	rifin
a	MAL1	93363	T	A	-	N	PFA0100c	MAL1P1.11	hypothetical protein, conserved in P. falciparum
b	MAL1	93363	T	A	-	N	PFA0100c	MAL1P1.11	Plasmodium exported protein (PHISTa), unknown function
a	MAL1	93382	T	G	-	N	PFA0100c	MAL1P1.11	hypothetical protein, conserved in P. falciparum
b	MAL1	93382	T	G	-	N	PFA0100c	MAL1P1.11	Plasmodium exported protein (PHISTa), unknown function
a	MAL1	93384	G	A	-	N	PFA0100c	MAL1P1.11	hypothetical protein, conserved in P. falciparum
b	MAL1	93384	G	A	-	N	PFA0100c	MAL1P1.11	Plasmodium exported protein (PHISTa), unknown function

...

Finally, let's find conflicts in a specific field:

In [26]:

ab_conflicts_mut = ab.conflicts(key=('Chr', 'Pos'), include='Mut')
ab_conflicts_mut.display(10, caption='Mut conflicts', td_styles={'Mut': highlight})

Mut conflicts
source	Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
a	MAL1	99099	G	T	-	-	PFA0110w	MAL1P1.13,Pf155	ring-infected erythrocyte surface antigen
b	MAL1	99099	G	T	-	N	PFA0110w	MAL1P1.13,Pf155,RESA	ring-infected erythrocyte surface antigen
a	MAL1	99211	C	T	-	-	PFA0110w	MAL1P1.13,Pf155	ring-infected erythrocyte surface antigen
b	MAL1	99211	C	T	-	N	PFA0110w	MAL1P1.13,Pf155,RESA	ring-infected erythrocyte surface antigen
a	MAL1	197903	C	A	A	S	PFA0220w	MAL1P1.34b	ubiquitin carboxyl-terminal hydrolase, putative
b	MAL1	197903	C	A	A	N	PFA0220w	PFA0215w,MAL1P1.34b	ubiquitin carboxyl-terminal hydrolase, putative
a	MAL1	384429	C	T	-	N	PFA0485w	MAL1P2.26	dolichol kinase
b	MAL1	384429	C	T	-	S	-	-	-
a	MAL1	513268	A	G	-	N	PFA0650w	MAL1P3.12,MAL1P3.12a,PFA0655w	surface-associated interspersed gene pseudogene, (SURFIN) pseudogene
b	MAL1	513268	A	G	-	S	PFA0650w	MAL1P3.12,PFA0655,MAL1P3.12a,3D7surf1.2,PFA0655w,MAL1P12a	surface-associated interspersed gene (SURFIN), pseudogene

...

In [27]:

ab_conflicts_mut.nrows()

Out[27]:

For more information about the petl package see the petl online documentation.