Introduction to pyalign¶

Bernhard Liebl
Computational Humanities Group
Leipzig University

In the following sections, this notebook will introduce you to the core concepts of pyalign. This will enable you to

correctly and efficiently formulate specific alignment problems
choose the right solver for solving them
take care of runtime performance considerations to obtain fast results

Setup¶

In [1]:

import bokeh.io
bokeh.io.output_notebook()

BokehJS 2.3.2 successfully loaded.

Problems, Solvers and Solutions¶

To compute an optimal alignment with pyalign we first need to state an alignment problem. The most common and efficient way to state an alignment problem in pyalign consists of three parts:

An alphabet $\Omega$ of all elements that may occur in a sequence
Two sequences $s, t$ from $\Omega$ [^1] that are going to be aligned
A function $w(x, y)$ which gives similarity (or distance) between $x$ and $y$

[^1] $s=s_1,...,s_n$ and $t=t_1,...,t_m$ , with $\forall i, s_i \in \Omega$ and $\forall j, t_j \in \Omega$

Let us first define $w(x, y)$ as the following binary measure of similarity:

$w(x, y) = \begin{cases} 1 && \text{if} \quad x = y\\ -1 && else \end{cases}$

Through pyalign's problems.alphabetic, the following code cell defines the set of problems that can be posed over the alphabet $\{A, G, C, T\}$ and using $w(x, y)$ as a measure of similarity as just defined. Since the values from $w(x, y)$ are similarities and higher values mean better alignments, we configure direction to maximize in order to find optimal alignments. We write $w(x, y)$ as a lambda.

In [2]:

import pyalign.problems

problem_set = pyalign.problems.alphabetic(
    "AGCT",
    lambda x, y: 1 if x == y else -1,
    direction="maximize")

Using problem_set, we can now pose a problem for two specific sequences with all the conditions just defined above:

In [3]:

problem = problem_set.new_problem("AATCG", "AACG")

We now create a solver to solve the problem just posed. We choose a LocalSolver, which means we want to obtain an optimal local alignment. In later sections, we will see that pyalign also offers analog similar classes for global and semiglobal alignment problems as well as elastic - dynamic time warping - problems.

Later, we will talk about how to specify a specific sort of gap cost here - i.e. the cost incurred when skipping parts of the aligned sequences - but for now we will not specify any gap cost - i.e. we do not incur any such cost.

⚠	For performance reasons, try to avoid recreating Solver objects, and instead try to reuse them for as many Problem instances as possible.

In [4]:

import pyalign.solve

solver = pyalign.solve.LocalSolver()

To solve the problem, we run solve. We obtain a Solution object that contains the optimal alignment.

In [5]:

solution = solver.solve(problem)
type(solution)

?? a (1, 5) (5, 4)

Out[5]:

pyalign.solve.Solution

Here is the optimal alignment associated with the obtained solution:

In [6]:

solution.alignment

Out[6]:

A	A	T	C	G
\|	\|		\|	\|
A	A		C	G

And here is the score that is achieved with this optimal alignment:

In [7]:

solution.score

Out[7]:

4.0

The Solution object contains various other data that allows us to reason about why this solution was obtained. By displaying it in a Jupyter notebook, a Solution object gives the obtained value matrix that has been annotated with traceback paths (see arrows between cells) and the actually obtained path (see shaded cells).

In [8]:

solution

Problems and Alphabets¶

Arbitrary Alphabets¶

One important feature of pyalign is that it does not rely on letters or characters as elements of the sequences that are going to be aligned, but works with literally any (comparable) Python object. The latter only needs to know about equality and hash values - i.e. it needs to have __eq__ and __hash__ methods.

This means that sequences are not restricted to consist of elements that are represented in a fixed encoding with a very limited maximum size, such as 7-bit ASCII.

The following section illustrates this by generating sequences of graphical shape objects that we then align to each other. We use the shapyter utility library to create shapes. Each shape you see in this section is a graphical representation of an shapyter.Shape instance and its attributes - namely form and color.

shapyter allows us to create shape instances that are rendered in Jupyter:

In [9]:

#import importlib
#importlib.reload(shapyter)

import shapyter
shapyter.Triangle()

Out[9]:

Shapes are differentiated by their form and color:

In [10]:

shapyter.List([shapyter.Triangle("yellow"), shapyter.Circle("#F010F0")])

Out[10]:

[ , ]

shapyter also comes with Shapifier that maps letters to shapes. As stated above, our goal here is to stop working with letters and instead use general object instances. The mapping we obtain from one Shapifier is consistent, e.g. "A" is always mapped to a reddish circle.

In [11]:

shapifier = shapyter.Shapifier(30)
shapifier("AGG")

Out[11]:

[ , , ]

We can map arbitrary letter sequences to shapes consistently:

In [12]:

shapifier("AATCG")

Out[12]:

[ , , , , ]

In [13]:

shapifier("AACG")

Out[13]:

[ , , , ]

Remember that - independent of the rendering and the animation effects above - we are dealing with lists of generic object instances on a Python level. The list of shapes above therefore is really the following list of (comparable) instances:

In [14]:

repr(shapifier("AACG"))

Out[14]:

'[Circle("#F8C3B6"), Circle("#F8C3B6"), Triangle("#86AED1"), Square("#F8C3B6")]'

We are now ready to demonstrate that pyalign really does not care about the element type we pass it. Instead of strings - i.e. sequences of letters - we pass in lists of Shape objects.

In [15]:

shape_problem_set = pyalign.problems.alphabetic(
    shapifier("AGCT"),
    lambda x, y: 1 if x == y else -1,
    direction="maximize")

shape_problem = shape_problem_set.new_problem(
    shapifier("AATCG"), shapifier("AACG"))

solver.solve(shape_problem).alignment

Out[15]:


\|	\|	\|	\|

Large Alphabets¶

In [23]:

import shapyter
import importlib
importlib.reload(shapyter)

Out[23]:

<module 'shapyter' from '/Users/arbeit/Projects/pyalign-demo/shapyter/__init__.py'>

In some use cases, the size of the alphabet is too large to specify. For these cases pyalign offers a different way to formulate problems without specifying an explicit alphabet.

For example, let us consider alignments over sequences whose elements are sets $S_1, ..., S_k$ and the similarity measure between two element sets is the Jaccard similarity. If the sets are drawn from $n$ elements, i.e. $S_i \in 2^{\{1, ..., n\}}$ , then the alphabet we draw from has $2^ n$ elements, which is infeasible to specify for larger $n$ . For these cases, we use pyalign.problems.general to specify the problem without an explicit alphabet.

⚠	If you can specify a small alphabet, problems created through `alphabetic` are solved faster than those specified through `general`. Therefore always check that your problem ist not suitable for `alphabetic` first.

Let us illustrate the example from above with $n=9$ . We visualize sets as a matrix of of indices, where black squares indicate that the $i$ -th element is in the set.

In [24]:

sets = shapyter.List([
    shapyter.Set(100, 9),
    shapyter.Set(101, 9),
    shapyter.Set(50, 9),
    shapyter.Set(211, 9)
])
sets

Out[24]:

[ , , , ]

We can compute the Jaccard similarity between these sets:

In [25]:

sets[0].jaccard(sets[1])

Out[25]:

0.3333333333333333

In [26]:

sets[1].jaccard(sets[2])

Out[26]:

0.1111111111111111

Using these sets as elements, we can now use general to pose an alignment problem without an alphabet:

In [27]:

jaccard_problem_set = pyalign.problems.general(
    lambda x, y: x.jaccard(y),
    direction="maximize")

In [28]:

jaccard_problem = jaccard_problem_set.new_problem(
    [sets[i] for i in (0, 2, 1, 0, 2)],
    [sets[i] for i in (1, 1, 2, 2)],
)

In [29]:

solver.solve(jaccard_problem).alignment

Out[29]:


\|	\|	\|	\|

Gaps and Locality¶

Gaps and Gap Costs¶

pyalign offers various configurable standard models of gap costs as well as custom gap costs driven by user functions.

⚠

`pyalign` chooses algorithms automatically. You just pass the desired gap cost function to the `Solver` and `pyalign` will internally pick the optimal implemented algorithm to use. For example, if you pick affine gap costs, you will get Gotoh's algorithm and O(n^2) runtime. If you configure logarithmic gap cost, `pyalign` will internally use a O(n^3) solver.

In [18]:

import pyalign.solve
import pyalign.gaps

solver_g = pyalign.solve.LocalSolver(
    gap_cost=pyalign.gaps.AffineGapCost(1, 2))

solver_g.solve(problem).alignment

Out[18]:

A	A	T			C	G
					\|	\|
			A	A	C	G

Here is a plot of the configured affine gap cost in terms of gap length:

In [19]:

solver_g.gap_cost

Here we configure a constant gap cost independent of length $k$ , given $k \ge 1$ :

In [20]:

pyalign.gaps.ConstantGapCost(0.3)

As a final example here is how to configure a logarithmic gap cost:

In [21]:

pyalign.gaps.LogarithmicGapCost(0.1, 0.3)

Local, Global and Semiglobal Solvers¶

In [9]:

import ipywidgets as widgets
from typing import Iterator

def render_solvers(problem):
    solvers = (
        ("global", pyalign.solve.GlobalSolver),
        ("local", pyalign.solve.LocalSolver),
        ("semiglobal", pyalign.solve.SemiglobalSolver),
        ("elastic", pyalign.solve.ElasticSolver)
    )

    items = []

    for solver_name, construct_solver in solvers:
        solver = construct_solver(
            gap_cost=pyalign.gaps.LinearGapCost(2),
            codomain=Iterator[pyalign.solve.Solution])
        
        items.append(widgets.Label(solver_name))

        for i, solution in enumerate(solver.solve(problem)):
            if i > 0:
                items.append(widgets.Label(""))
            items.append(widgets.HTML(solution.alignment._repr_html_()))
            
    return widgets.GridBox(
        items, layout=widgets.Layout(grid_template_columns="repeat(2, 200px)"))
    
render_solvers(problem_set.new_problem(
    "AATCG",
    "AACG"
))

GridBox(children=(Label(value='global'), HTML(value='\n\t\t\t<table style="border-collapse: collapse; border-s…

Speeding up Computations¶

Codomain, Solutions, Alignments and Scores¶

"only compute what you want to know"

In terms of the constants involved in a $O(n^2)$ or $O(n^3)$ runtime for solving an alignment problem, computing full Solution objects - consisting of the value matrix, traceback information and so on - is more expensive than only computing a single score - even though both share the same quadratic or cubic runtime in $n$ . If you solve many problems with small $n$ these these two tasks can show a considerable difference in runtime.

For common use cases, where you do not actually use the full solution data (e.g. the full traceback matrix), but only the alignment or even only the score, pyalign offers you to restrict the data type you want the Solver to return as result. This type is called codomain in pyalign, since it defines the codomain of the Solver's solve() method.

⚠	pyalign will internally create an optimized solver that calls optimized C++ code for the specific codomain you request. For example, if you request to compute only scores, pyalign will not compute any traceback information in the first place.

Starting with a generic LocalSolver, let us create a new Solver that will not generate full Solution objects, but instead return Alignment instances. We can achieve this by calling to_codomain with the desired target codomain.

In [9]:

solver_1 = pyalign.solve.LocalSolver(
    gap_cost=pyalign.gaps.LinearGapCost(2))

In [10]:

solver_2 = solver_1.to_codomain(pyalign.solve.Alignment)
solver_2.solve(problem)

Out[10]:

A	A	T			C	G
					\|	\|
			A	A	C	G

Similarly, we can create a Solver that only returns scores.

In [11]:

solver_3 = solver_1.to_codomain(pyalign.solve.Score)
solver_3.solve(problem)

Out[11]:

2.0

To illustrate the difference in runtimes, here is a quick benchmark.

Another important facette of codomain control is setting the number of returned results. By default, only one optimal solution or alignment gets computed.

However, using codomain specificationas, you can instruct Solver to return all optimal scores, alignments or solutions. To enable this, set the codomain to an Iterator of those types. With the returned Iterator it is then possible to iterate over all optimal solutions of the problem.

In [12]:

from typing import Iterator

solver_4 = solver_1.to_codomain(Iterator[pyalign.solve.Alignment])
[x.score for x in solver_4.solve(problem)]

Out[12]:

[2.0, 2.0]

It is also possible to obtain Lists of items. Lists might be easier to work with than Iterators. Iterators on the other hand create the results lazily and on demand. This means Iterators consume only $O(1)$ memory, whereas Lists might contain $O(n^2)$ elements for some problems.

In [13]:

from typing import List

solver_4 = solver_1.to_codomain(List[pyalign.solve.Alignment])
solver_4.solve(problem)

Out[13]:

[<pyalign.solve.Alignment at 0x7f89a7a79550>,
 <pyalign.solve.Alignment at 0x7f89a7a79730>]

Batches¶

"compute x for the price of one"

Computation of alignments can be sped up with data level parallelism in modern CPUs (i.e. SIMD instructions). In order to make use of these, problems need to be given to pyalign in lists, i.e. not one at a time.

⚠	Batches incur some overhead and the speed-up is more noticeable with longer sequences and batches that contain sequences of similar length.

In [62]:

import pyalign
import random

def generate_problems(n, a=5, b=15):
    '''
    generate n problems, each with random sequences
    of length between a and b
    '''
    
    alphabet = "ABCD"
    
    def gen_seq(k):
        return "".join(random.choices(alphabet, k=k))

    problem_set = pyalign.problems.alphabetic(
        alphabet,
        lambda x, y: 1 if x == y else -1,
        direction="maximize")

    problems = []
    for _ in range(n):
        problems.append(problem_set.new_problem(
            gen_seq(random.randint(a, b)),
            gen_seq(random.randint(a, b))))
    
    return problems

We generate a list of problems and a solver that returns scores:

In [ ]:

some_problems = generate_problems(10)
score_solver = pyalign.solve.LocalSolver(codomain=pyalign.solve.Score)

Solving the problems through single calls to solve vs. using a single batched call returns the same results:

In [63]:

[score_solver.solve(p) for p in some_problems]

Out[63]:

[4.0, 6.0, 2.0, 4.0, 3.0, 3.0, 8.0, 2.0, 4.0, 6.0]

In [64]:

score_solver.solve(some_problems)

Out[64]:

[4.0, 6.0, 2.0, 4.0, 3.0, 3.0, 8.0, 2.0, 4.0, 6.0]

As demonstrated below, the runtimes are different though:

In [65]:

import time

def benchmark(f, title, n=1000):
    t0 = time.perf_counter_ns()
    for _ in range(n):
        r = f()
    t1 = time.perf_counter_ns()
    print(f"{title}: time per solve: {(t1 - t0) / (n * 1000):.1f} μs")

In [66]:

benchmark(lambda: [score_solver.solve(p) for p in some_problems], "no batch")

no batch: time per solve: 1092.9 μs

In [67]:

benchmark(lambda: score_solver.solve(some_problems), "as batch")

as batch: time per solve: 366.9 μs