In [4]:
CODE_DIR := "../gap/";;
Read(Concatenation(CODE_DIR, "pmatmul.g"));
Read(Concatenation(CODE_DIR, "makepar.g"));
Read(Concatenation(CODE_DIR, "bench.g"));

Simple speedups on multicore workstations with minimal programming

Multiplying matrices

We create a couple of random integr matrices and check GAP's default serial multiply.

In [7]:
m1 := MakeImmutable(RandomMat(2000, 2000, Integers));;
m2 := MakeImmutable(RandomMat(2000, 2000, Integers));;
ShowBench(\*, m1, m2);
wall time: 14.6s cpu time: 14.6s memory allocated: 30.62MB result returned

If we didn't have a parallel matrix multiply already, then in a few lines of code, we could implement a simple blocked matrix multiply based on $$(A*B)_{ik} = \sum_j A_{ij}*B_{jk}$$

In [8]:
MatMulWithTasks := function(m1, m2, chop1, chop2, chop3)
    local  A, B, prodtasks, sumtasks, C;
    
    # divide matrices into blocks
    A := block(m1, chop1, chop2);
    B := block(m2, chop2, chop3);

    # Start chop1*chop2*chop3 multiply tasks
    prodtasks := List([1..chop1], i-> List([1..chop2], j-> 
        List([1..chop3], k -> RunTask(\*, A[i][j],B[j][k]))));
    # And chop1 * chop3 tasks to do the summations
    sumtasks := List([1..chop1], i -> List([1..chop3], k-> 
        RunTask(Accumulate,AddMat,ShallowCopyMat, 
                prodtasks[i]{[1..chop2]}[k])));
    # Finally wait for the summations to complete and assemble the result
    C := List(sumtasks, row -> List(row, TaskResult));
    return unblock(C);
end;
Out[8]:
function( m1, m2, chop1, chop2, chop3 ) ... end
In [9]:
ShowBench(MatMulWithTasks, m1, m2, 4, 4, 4);
wall time: 2.55s cpu time: 27.1s memory allocated: 280.67MB result returned

The Accumulate function used in this illustrates another mechanism: WaitAnyTask

It takes a list of tasks and combines their results in whatever order they become available, allowing memory to be recovered quickly.

These two are the only functions that are HPCGAP specific.

In [10]:
Display(Accumulate);
function ( op, makebase, tasks )
    local i, acc;
    i := WaitAnyTask( tasks );
    acc := makebase( TaskResult( tasks[i] ) );
    Remove( tasks, i );
    while Length( tasks ) > 0 do
        i := WaitAnyTask( tasks );
        op( acc, TaskResult( tasks[i] ) );
        Remove( tasks, i );
    od;
    return acc;
end

We search for Association Schemes preserved by interesting permutation groups. Our initial filter selects relevant groups where the problem is non-trivial from GAP's primtive groups database.

In [7]:
c := cands([136,165],[1..13]);; List(c, x -> x.g); List(c, x-> x.rank);
Out[7]:
[ PSL(2, 17), M_11 ]
Out[7]:
[ 11, 7 ]

We apply, for now, a brute force search over all partitions of the set $\{2\ldots r\}$ where $r$ is the rank of the permutation action. This is a fair sized search space and grows very rapidly with $r$.

In [ ]:
NrPartitionsSet([2..11]);
In [15]:
BruteForceSearch := function ( s )
    return Filtered( PartitionsSet([2..s.rank]), 
    p -> TestPartition( s, p ));        
end;;
ShowBench(BruteForceSearch, c[1]);
wall time: 34.7s cpu time: 29.0s memory allocated: 16.38GB result returned

A very simple approach to parallelising this brute force search produces a useful speedup

In [9]:
ParBruteForceSearch := function ( s )
    return ParFiltered( PartitionsSet( [ 2 .. s.rank ] ), 
    p -> TestPartition( s, p ), 10);        
end;
ShowBench(ParBruteForceSearch, c[1]);
wall time: 21.7s cpu time: 40.8s memory allocated: 15.56GB result returned
Out[9]:
function( s ) ... end

Atomic and Locked Data

Shared data is unavoidable in many symbolic computation. HPCGAP offers a number of forms of support. First lets see what happens if we use the default lists (this would normally be forbidden, but we've suppressed some checking).

A simple but irregular calculation:

In [12]:
p := NextPrimeInt(2^100);;
x := NextPrimeInt(2^50);;
ShowBench(PowerModInt, x, x, p );
wall time: 3.00us cpu time: 0ns memory allocated: 96B result returned

We run 10000 instances of this in tasks, and they all attempt to append their results to a normal list l:

In [17]:
l := [];;
process := function(i) Add(l, MakeImmutable([i, PowerModInt(x+i,x+i,p)])); end;;
tasks := List([1..10000], i -> RunTask(process,i));; WaitTasks(tasks);;
Length(l);
Out[17]:
9967

Some are lost. Now if we use an atomic list datastructure, all the data is successfully written.

In [21]:
l := AtomicList();;
tasks := List([1..10000], i -> RunTask(process,i));; WaitTasks(tasks);;
Length(l);
Out[21]:
10000

However atomic lists only protect single operations. If we run tasks which append twice to the list, we quickly find that some of the pairs do not stay together.

In [29]:
l := AtomicList();;
process := function(i)  Add(l,i); Add(l,i); 
end;;
   
tasks := List([1..100000], i -> RunTask(process,i));; WaitTasks(tasks);;
Length(l); 
x := First([1..100000], i-> l[2*i-1] <>l[2*i]); l[2*x-1];l[2*x];
Out[29]:
200000
Out[29]:
10
Out[29]:
10
Out[29]:
11

If we want something more like atomic transactions, we need to use shared objects, which have explicit locking.

In [35]:
l := [];; ShareObj(l);;
process := function(i) atomic readwrite l do 
    Add(l,i); Add(l,i); 
od; end;;
tasks := List([1..100000], i -> RunTask(process,i));; WaitTasks(tasks);;
atomic readonly l do  # we need to get a a lock on l to make it readonly.
    Print(Length(l),"\n");
    Print(ForAll([1..100000], i-> l[2*i-1] = l[2*i]),"\n");
od;
200000
true