#grayscale");filter:gray;-webkit-filter:grayscale(100%);}.bk-root .bk-logo-small{width:20px;height:20px;background-image:url();}.bk-root .bk-logo-notebook{display:inline-block;vertical-align:middle;margin-right:5px;}.rendered_html .bk-root .bk-tooltip table,.rendered_html .bk-root .bk-tooltip tr,.rendered_html .bk-root .bk-tooltip th,.rendered_html .bk-root .bk-tooltip td{border:none;padding:1px;}
Bernhard Liebl
Computational Humanities Group
Leipzig University
In the following sections, this notebook will introduce you to the core concepts of pyalign
. This will enable you to
To compute an optimal alignment with pyalign we first need to state an alignment problem. The most common and efficient way to state an alignment problem in pyalign consists of three parts:
[^1] s=s1,...,sn and t=t1,...,tm, with ∀i,si∈Ω and ∀j,tj∈Ω
Let us first define w(x,y) as the following binary measure of similarity:
w(x,y)={1ifx=y−1elseThrough pyalign's problems.alphabetic
, the following code cell defines the set of problems that can be posed over the alphabet {A,G,C,T} and using w(x,y) as a measure of similarity as just defined. Since the values from w(x,y) are similarities and higher values mean better alignments, we configure direction
to maximize in order to find optimal alignments. We write w(x,y) as a lambda.
import pyalign.problems
problem_set = pyalign.problems.alphabetic(
"AGCT",
lambda x, y: 1 if x == y else -1,
direction="maximize")
Using problem_set
, we can now pose a problem for two specific sequences with all the conditions just defined above:
problem = problem_set.new_problem("AATCG", "AACG")
We now create a solver to solve the problem just posed. We choose a LocalSolver
, which means we want to obtain an optimal local alignment. In later sections, we will see that pyalign
also offers analog similar classes for global and semiglobal alignment problems as well as elastic - dynamic time warping - problems.
Later, we will talk about how to specify a specific sort of gap cost here - i.e. the cost incurred when skipping parts of the aligned sequences - but for now we will not specify any gap cost - i.e. we do not incur any such cost.
⚠ | For performance reasons, try to avoid recreating Solver objects, and instead try to reuse them for as many Problem instances as possible. |
import pyalign.solve
solver = pyalign.solve.LocalSolver()
To solve the problem, we run solve
. We obtain a Solution
object that contains the optimal alignment.
solution = solver.solve(problem)
type(solution)
?? a (1, 5) (5, 4)
pyalign.solve.Solution
Here is the optimal alignment associated with the obtained solution:
solution.alignment
A | A | T | C | G |
| | | | | | | | |
A | A | C | G |
And here is the score that is achieved with this optimal alignment:
solution.score
4.0
The Solution
object contains various other data that allows us to reason about why this solution was obtained. By displaying it in a Jupyter notebook, a Solution
object gives the obtained value matrix that has been annotated with traceback paths (see arrows between cells) and the actually obtained path (see shaded cells).
solution
One important feature of pyalign
is that it does not rely on letters or characters as elements of the sequences that are going to be aligned, but works with literally any (comparable) Python object. The latter only needs to know about equality and hash values - i.e. it needs to have __eq__
and __hash__
methods.
This means that sequences are not restricted to consist of elements that are represented in a fixed encoding with a very limited maximum size, such as 7-bit ASCII.
The following section illustrates this by generating sequences of graphical shape objects that we then align to each other. We use the shapyter
utility library to create shapes. Each shape you see in this section is a graphical representation of an shapyter.Shape
instance and its attributes - namely form and color.
shapyter
allows us to create shape instances that are rendered in Jupyter:
#import importlib
#importlib.reload(shapyter)
import shapyter
shapyter.Triangle()
Shapes are differentiated by their form and color:
shapyter.List([shapyter.Triangle("yellow"), shapyter.Circle("#F010F0")])
shapyter
also comes with Shapifier
that maps letters to shapes. As stated above, our goal here is to stop working with letters and instead use general object instances. The mapping we obtain from one Shapifier
is consistent, e.g. "A" is always mapped to a reddish circle.
shapifier = shapyter.Shapifier(30)
shapifier("AGG")
We can map arbitrary letter sequences to shapes consistently:
shapifier("AATCG")
shapifier("AACG")
Remember that - independent of the rendering and the animation effects above - we are dealing with lists of generic object instances on a Python level. The list of shapes above therefore is really the following list of (comparable) instances:
repr(shapifier("AACG"))
'[Circle("#F8C3B6"), Circle("#F8C3B6"), Triangle("#86AED1"), Square("#F8C3B6")]'
We are now ready to demonstrate that pyalign
really does not care about the element type we pass it. Instead of strings - i.e. sequences of letters - we pass in lists of Shape
objects.
shape_problem_set = pyalign.problems.alphabetic(
shapifier("AGCT"),
lambda x, y: 1 if x == y else -1,
direction="maximize")
shape_problem = shape_problem_set.new_problem(
shapifier("AATCG"), shapifier("AACG"))
solver.solve(shape_problem).alignment
| | | | | | | | |
import shapyter
import importlib
importlib.reload(shapyter)
<module 'shapyter' from '/Users/arbeit/Projects/pyalign-demo/shapyter/__init__.py'>
In some use cases, the size of the alphabet is too large to specify. For these cases pyalign
offers a different way to formulate problems without specifying an explicit alphabet.
For example, let us consider alignments over sequences whose elements are sets S1,...,Sk and the similarity measure between two element sets is the Jaccard similarity. If the sets are drawn from n elements, i.e. Si∈2{1,...,n}, then the alphabet we draw from has 2n elements, which is infeasible to specify for larger n. For these cases, we use pyalign.problems.general
to specify the problem without an explicit alphabet.
⚠ | If you can specify a small alphabet, problems created through `alphabetic` are solved faster than those specified through `general`. Therefore always check that your problem ist not suitable for `alphabetic` first. |
Let us illustrate the example from above with n=9. We visualize sets as a matrix of of indices, where black squares indicate that the i-th element is in the set.
sets = shapyter.List([
shapyter.Set(100, 9),
shapyter.Set(101, 9),
shapyter.Set(50, 9),
shapyter.Set(211, 9)
])
sets
We can compute the Jaccard similarity between these sets:
sets[0].jaccard(sets[1])
0.3333333333333333
sets[1].jaccard(sets[2])
0.1111111111111111
Using these sets as elements, we can now use general
to pose an alignment problem without an alphabet:
jaccard_problem_set = pyalign.problems.general(
lambda x, y: x.jaccard(y),
direction="maximize")
jaccard_problem = jaccard_problem_set.new_problem(
[sets[i] for i in (0, 2, 1, 0, 2)],
[sets[i] for i in (1, 1, 2, 2)],
)
solver.solve(jaccard_problem).alignment
| | | | | | | | |
pyalign
offers various configurable standard models of gap costs as well as custom gap costs driven by user functions.
⚠ | `pyalign` chooses algorithms automatically. You just pass the desired gap cost function to the `Solver` and `pyalign` will internally pick the optimal implemented algorithm to use. For example, if you pick affine gap costs, you will get Gotoh's algorithm and O(n^2) runtime. If you configure logarithmic gap cost, `pyalign` will internally use a O(n^3) solver. |
import pyalign.solve
import pyalign.gaps
solver_g = pyalign.solve.LocalSolver(
gap_cost=pyalign.gaps.AffineGapCost(1, 2))
solver_g.solve(problem).alignment
A | A | T | C | G | ||
| | | | |||||
A | A | C | G |
Here is a plot of the configured affine gap cost in terms of gap length:
solver_g.gap_cost
Here we configure a constant gap cost independent of length k, given k≥1:
pyalign.gaps.ConstantGapCost(0.3)
As a final example here is how to configure a logarithmic gap cost:
pyalign.gaps.LogarithmicGapCost(0.1, 0.3)
import ipywidgets as widgets
from typing import Iterator
def render_solvers(problem):
solvers = (
("global", pyalign.solve.GlobalSolver),
("local", pyalign.solve.LocalSolver),
("semiglobal", pyalign.solve.SemiglobalSolver),
("elastic", pyalign.solve.ElasticSolver)
)
items = []
for solver_name, construct_solver in solvers:
solver = construct_solver(
gap_cost=pyalign.gaps.LinearGapCost(2),
codomain=Iterator[pyalign.solve.Solution])
items.append(widgets.Label(solver_name))
for i, solution in enumerate(solver.solve(problem)):
if i > 0:
items.append(widgets.Label(""))
items.append(widgets.HTML(solution.alignment._repr_html_()))
return widgets.GridBox(
items, layout=widgets.Layout(grid_template_columns="repeat(2, 200px)"))
render_solvers(problem_set.new_problem(
"AATCG",
"AACG"
))
GridBox(children=(Label(value='global'), HTML(value='\n\t\t\t<table style="border-collapse: collapse; border-s…
"only compute what you want to know"
In terms of the constants involved in a O(n2) or O(n3) runtime for solving an alignment problem, computing full Solution
objects - consisting of the value matrix, traceback information and so on - is more expensive than only computing a single score - even though both share the same quadratic or cubic runtime in n. If you solve many problems with small n these these two tasks can show a considerable difference in runtime.
For common use cases, where you do not actually use the full solution data (e.g. the full traceback matrix), but only the alignment or even only the score, pyalign
offers you to restrict the data type you want the Solver
to return as result. This type is called codomain in pyalign
, since it defines the codomain of the Solver
's solve()
method.
⚠ | pyalign will internally create an optimized solver that calls optimized C++ code for the specific codomain you request. For example, if you request to compute only scores, pyalign will not compute any traceback information in the first place. |
Starting with a generic LocalSolver
, let us create a new Solver
that will not generate full Solution
objects, but instead return Alignment
instances. We can achieve this by calling to_codomain
with the desired target codomain.
solver_1 = pyalign.solve.LocalSolver(
gap_cost=pyalign.gaps.LinearGapCost(2))
solver_2 = solver_1.to_codomain(pyalign.solve.Alignment)
solver_2.solve(problem)
A | A | T | C | G | ||
| | | | |||||
A | A | C | G |
Similarly, we can create a Solver
that only returns scores.
solver_3 = solver_1.to_codomain(pyalign.solve.Score)
solver_3.solve(problem)
2.0
To illustrate the difference in runtimes, here is a quick benchmark.
Another important facette of codomain control is setting the number of returned results. By default, only one optimal solution or alignment gets computed.
However, using codomain specificationas, you can instruct Solver
to return all optimal scores, alignments or solutions. To enable this, set the codomain to an Iterator
of those types. With the returned Iterator
it is then possible to iterate over all optimal solutions of the problem.
from typing import Iterator
solver_4 = solver_1.to_codomain(Iterator[pyalign.solve.Alignment])
[x.score for x in solver_4.solve(problem)]
[2.0, 2.0]
It is also possible to obtain List
s of items. List
s might be easier to work with than Iterator
s. Iterator
s on the other hand create the results lazily and on demand. This means Iterator
s consume only O(1) memory, whereas List
s might contain O(n2) elements for some problems.
from typing import List
solver_4 = solver_1.to_codomain(List[pyalign.solve.Alignment])
solver_4.solve(problem)
[<pyalign.solve.Alignment at 0x7f89a7a79550>, <pyalign.solve.Alignment at 0x7f89a7a79730>]
"compute x for the price of one"
Computation of alignments can be sped up with data level parallelism in modern CPUs (i.e. SIMD instructions). In order to make use of these, problems need to be given to pyalign
in lists, i.e. not one at a time.
⚠ | Batches incur some overhead and the speed-up is more noticeable with longer sequences and batches that contain sequences of similar length. |
import pyalign
import random
def generate_problems(n, a=5, b=15):
'''
generate n problems, each with random sequences
of length between a and b
'''
alphabet = "ABCD"
def gen_seq(k):
return "".join(random.choices(alphabet, k=k))
problem_set = pyalign.problems.alphabetic(
alphabet,
lambda x, y: 1 if x == y else -1,
direction="maximize")
problems = []
for _ in range(n):
problems.append(problem_set.new_problem(
gen_seq(random.randint(a, b)),
gen_seq(random.randint(a, b))))
return problems
We generate a list of problems and a solver that returns scores:
some_problems = generate_problems(10)
score_solver = pyalign.solve.LocalSolver(codomain=pyalign.solve.Score)
Solving the problems through single calls to solve
vs. using a single batched call returns the same results:
[score_solver.solve(p) for p in some_problems]
[4.0, 6.0, 2.0, 4.0, 3.0, 3.0, 8.0, 2.0, 4.0, 6.0]
score_solver.solve(some_problems)
[4.0, 6.0, 2.0, 4.0, 3.0, 3.0, 8.0, 2.0, 4.0, 6.0]
As demonstrated below, the runtimes are different though:
import time
def benchmark(f, title, n=1000):
t0 = time.perf_counter_ns()
for _ in range(n):
r = f()
t1 = time.perf_counter_ns()
print(f"{title}: time per solve: {(t1 - t0) / (n * 1000):.1f} μs")
benchmark(lambda: [score_solver.solve(p) for p in some_problems], "no batch")
no batch: time per solve: 1092.9 μs
benchmark(lambda: score_solver.solve(some_problems), "as batch")
as batch: time per solve: 366.9 μs