The Meataxe64 Package, Core Functionality

In [4]:
LoadPackage("meataxe64");; Read("../gap/bench.g"); 
Read("../gap/mtx64utils.g"); LoadPackage("jupyterviz");;

Monster Matrix Multiplication

The Classification of Finite Simple Groups was one of the highlights of 20th Century mathematics.

It included the discovery of the sporadic simple Fischer-Griess Monster $\mathbb{M}$ of order

$808017424794512875886459904961710757005754368000000000$
.

$\mathbb{M}$ has many intriguing links to other areas of mathematics and even to theoretical physics, so we'd like to compute in it.

**But**, the most tractable representation of this group is as $196882\times 196882$ matrices of integers mod 2 -- Each matrix is 5GB.

For many years this was clearly beyond any practical computation. The first actual multiplication of two general elements was done in 1998, using most of the computing resources of a large maths department for **45 hours**.

Before OpenDreamKit this computation would still have taken days. We have contributed to the development of a massively improved C and assembler library and a GAP interface for it.

A Monster matrix multiplication now takes about **8 minutes** on a laptop, scaling to 2-3 minutes on a multicore server.

We start with matrices half that size for speed.

In [9]:
size := 196882/2;; f := MTX64_FiniteField(2);;
m := MTX64_RandomMat(f,size,size);
Out[9]:
< matrix 98441x98441 : <MTX64 GF(2)>>

We can multiply this by itself in about 1 minute:

In [11]:
ShowBench(ParMult,m,m);
wall time: 53.7s cpu time: 505s memory allocated: 1.12GB result returned

We'll start the full sized computation and come back to it after the rest of Clement's talk

In [13]:
m := MTX64_RandomMat(f,2*size,2*size);; ShowBench(ParMult, m, m);
wall time: 469s cpu time: 4800s memory allocated: 4.51GB result returned

This is not just for the Monster. High performance linear algebra over these small fields are an essential component of a VRE for users in many areas. As well as multiplication we need

Gaussian Elimination

Our other key primitive operation. Much more difficult to parallelize in general.

Our basic Gaussian elimnination operation applied to a matrix $A$, computes $M$, $K$, $R$, $\gamma$ and $\rho$ satisfying:

$$\pmatrix{M&0\cr K & 1} \rho A \gamma = \pmatrix{-1&R\cr0&0}$$

where $\gamma$ and $\rho$ are permutations that effectively select the pivot columns and pivot rows of $A$. This is effectively full reduced row echelon form, with transforming matrices.

Using this, we can compute inverses, solve systems of equations, determine nullspaces, etc. efficiently.

To see what it does properly we need a singular matrix. We take the Kronecker (tensor) product of two rectangular matrices. (If $A$ is $n\times m$ and $B$ is $m\times n$ with $m < n$ then $A\otimes B$ will be $nm\times nm$ of rank at most $m^2$.)

We'll try a different small finite field.

In [42]:
f := MTX64_FiniteField(9);;
m1 := MTX64_RandomMat(f, 200,99);;
m2 := MTX64_RandomMat(f, 99,200);;
m := MTX64_KroneckerProduct(m1,m2);
Out[42]:
< matrix 19800x19800 : <MTX64 GF(3^2)>>
In [47]:
ech := fail;; # suppress a warning.
ShowBench(function() ech := MTX64_Echelize(m);end); 
ech.multiplier; ech.cleaner; ech.remnant;  # M, K and R in the above formula
wall time: 11.8s cpu time: 11.0s memory allocated: 747.81MB no result returned
Out[47]:
< matrix 9801x9801 : <MTX64 GF(3^2)>>
Out[47]:
< matrix 9999x9801 : <MTX64 GF(3^2)>>
Out[47]:
< matrix 9801x9999 : <MTX64 GF(3^2)>>

We can also use the multi-threaded version of the Gaussian elimination (although this problenm is a little small).

In [35]:
MTX64_WriteMatrix(m, "a"); 
ShowBench(MTX64_fEchelize, ".", "a", "gamma", "rho", "m", "k", "r");
wall time: 4.37s cpu time: 44.1s memory allocated: 144B result returned
Out[35]:
true

Run-time of Multiplication versus matrix size

We set the field and maximum dimension and make a set of random matrices of different sizes

In [18]:
q := 5;; maxdim := 12000;; steps := 10;;
sizes := List([1..steps], i-> i*QuoInt(maxdim, steps));;
mats := List(sizes, i-> MTX64_RandomMat(MTX64_FiniteField(q), i, i));
Out[18]:
[ < matrix 1200x1200 : <MTX64 GF(5)>>, < matrix 2400x2400 : <MTX64 GF(5)>>, < matrix 3600x3600 : <MTX64 GF(5)>>, < matrix 4800x4800 : <MTX64 GF(5)>>, < matrix 6000x6000 : <MTX64 GF(5)>>, < matrix 7200x7200 : <MTX64 GF(5)>>, < matrix 8400x8400 : <MTX64 GF(5)>>, < matrix 9600x9600 : <MTX64 GF(5)>>, < matrix 10800x10800 : <MTX64 GF(5)>>, < matrix 12000x12000 : <MTX64 GF(5)>> ]

And look at the timing for squaring them:

In [21]:
marks1 := List(mats, x-> BenchMark(\*,x,x));;
marksm := List(mats, x-> BenchMark(ParMult,x,x));;
Plot(
[sizes,List(marks1, x-> x.cpu), rec(name := "single-threaded",
title := "Meataxe64 runtimes for matrix multiply", xaxis := "Dimension", yaxis := "ms")],
[sizes,List(marksm, x-> QuoInt(x.wall,10^6)), rec(name := "multi-threaded wall time")]
);
Out[21]:
In [ ]: