In this notebook we will take a look at how to write some array-oriented APL code that justifies some text. In case you are wondering what "justified text" is, take a look at the example in the Wikipedia.
(The content of this notebook draws heavily from Lesson 42 of the APL Cultivation chat lessons and is a written follow-up to Rodrigo Girão Serrão's presentation at APL Seeds'21.)
While we prototype our solution that justifies some text, let us define some text to play around with:
⎕← text ← 'To APL, or not to APL, that is' 'the question: whether ''tis' 'nobler in the mind to suffer' 'the slings and arrows of the' 'outrageous fortune...'
Now we have a vector with 5 elements, where each box represents one of the lines of the text we want to justify:
≢text
However, instead of dealing with a vector of "strings" – which in APL we call character vectors – it would be more useful to work with the text data in a different format:
↑text
If we work with the text in this format, it looks more like printed text on a page and it also becomes easier to develop our solution.
With that in mind, let us redefine our text
variable to hold the text like this:
text ← ↑text
We can look at ↑
(which is called mix) as a function that is stacking up the vectors that compose text
.
The resulting data we have is now a matrix: a two-dimensional structure composed of characters.
The way you check text
is a matrix is by asking APL to tell you the shape of text
:
⍴text
⍴
is called shape and it tells you that text
has 5
rows and 30
columns...
But that is certainly interesting, because some of the lines are clearly shorter than 30 characters...
When we used ↑
to convert our text
to a matrix, APL padded the shorter lines with spaces so that all lines had the same length.
The easiest way to verify that we actually have trailing whitespace on the shorter lines is by using @
to replace them with something else:
'⎕'@(' '=⊢)text
You can read the line above as
“Put the character
'⎕'
at the positions where the character' '
is equal to the characters in thetext
matrix.”
Working with a matrix of characters is already half of the work done, because it does simplify our code greatly.
Of course this particular data format is only useful because we had an idea for how to solve this problem of justifying some text. Let us take another look at the matrix:
text
In order to justify this text, what we want to do is take the trailing spaces and spread them along the inner spaces of each line. In essence, we want to go from
⍝ To⎕APL,⎕or⎕not⎕to⎕APL,⎕that⎕is
⍝ the⎕question:⎕whether⎕'tis⎕⎕⎕⎕
⍝ nobler⎕in⎕the⎕mind⎕to⎕suffer⎕⎕
⍝ the⎕slings⎕and⎕arrows⎕of⎕the⎕⎕
⍝ outrageous⎕fortune...⎕⎕⎕⎕⎕⎕⎕⎕⎕
to
⍝ To⎕APL,⎕or⎕not⎕to⎕APL,⎕that⎕is
⍝ the⎕⎕⎕question:⎕⎕whether⎕⎕'tis
⍝ nobler⎕⎕in⎕⎕the⎕mind⎕to⎕suffer
⍝ the⎕⎕slings⎕⎕and⎕arrows⎕of⎕the
⍝ outrageous⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕fortune...
If we take a closer look at the second row of these two matrices, we notice that there were four trailing spaces that got divided as evenly as possible among the three inner spaces that existed: the first space (between “the” and “question”) is now 3 spaces wide and the two other spaces are 2 spaces wide:
⍝ the⎕question:⎕whether⎕'tis⎕⎕⎕⎕
⍝ the⎕⎕⎕question:⎕⎕whether⎕⎕'tis
In order to achieve this, what we will do is a build an integer matrix (with a shape that matches the shape of text
) and each integer in that matrix will tell how many copies of each character we want in the final result.
For example, for the second row of text
, we would like this integer row:
chars ← 'the question: whether ''tis '
ints ← 1 1 1 3 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 0 0 0 0
↑chars ints
Each integer below each character tells how many copies of that character we want:
1
, because we want to preserve that character as-is;0
, because we want to remove that character from the line; and3
and 2
, because we want to create copies of those characters.Without further ado, let's start building this matrix. We will do it in three simpler steps.
' '=text
What we want to do now is take the Boolean matrix above and extract another Boolean matrix that contains 1
for all characters in the text
that we care about and that contains 0
in the positions of the trailing spaces.
To achieve this effect, we first use spaces
to create a matrix that identifies the trailing spaces with 1
s, and then we negate that matrix:
⌽∧\⌽' '=text
There are three distinct symbols in the line above, so what do they do?
⌽
is the reverse function and the way it works is really simple:
⌽1 2 3 4
⌽text
The part that may require a deeper explanation is the ∧\
part.
Start by noticing that ∧
is the Boolean and function:
0∧0
0∧1
1∧0
1∧1
(In APL we use 0
for “false” and 1
for “true”.)
Then \
is an operator: an APL glyph that takes a function and modifies it in some way.
To understand what \
does, let us start by looking at the following:
2×3×4×5
×/2 3 4 5
10+45+23+1
+/10 45 23 1
Can you spot the pattern?
/
is the reduce operator: it takes a function on its left and puts it in between the elements of the vector on the right.
\
is the scan operator and works in a similar manner, but it gives you the intermediate results as well:
(2) (2×3) (2×3×4) (2×3×4×5)
×\2 3 4 5
∧
¶f/
can often be read as a single unit:
+/
is "sum"×/
is "product",/
is "join"In particular, ∧/
is read as "all", as it returns 1
only if all elements are 1
:
∧/1 1 1
∧/1 1 0 1
Tied with ∧/
is ∧\
which returns the partial results:
∧\1 1 0 1
Because of the interpretation of ∧/
and the way \
scans the input vector, ∧\
takes a Boolean vector and returns a Boolean vector with the leading 1
s preserved and everything else as 0
s:
∧\1 1 1 1
∧\1 1 0 0 1 1
∧\0 1 1 1 1 1
You can read ∧\
as "were all 1
so far?".
When you use f/
or f\
with a matrix, it is as if you did f/
or f\
on each row separately, so
+/' '=text
tells you how many spaces each line has, for example.
Now you should be able to better grasp what we did above: ⌽spaces
puts the trailing spaces in the beginning and then ∧\
makes it so that only the leading 1
s remain:
∧\⌽' '=text
When we reverse once more we put the 1
s back in their place, therefore creating a Boolean matrix that tells us exactly where the trailing spaces are:
⌽∧\⌽' '=text
We can negate this matrix, therefore creating a matrix that tells us what characters we want to keep and which characters (the trailing spaces) to discard:
⎕← keep ← ~⌽∧\⌽' '=text
The function ~
above is the negate function and it turns 0
s into 1
s and 1
s into 0
s.
Now that we have two Boolean matrices, one that tells the positions that contain spaces, and another that tells the positions that contain characters to keep, we can combine them to create a matrix with the positions of the inner spaces:
⎕← inner ← keep∧' '=text
We will now investigate, for each of the positions marked in this inner
matrix, how many spaces should be added in those positions.
In order to figure that out, we first need to know how many trailing spaces each row has to distribute:
⎕← trail ← +/~keep
Just remember that +/
sums each row and ~keep
has 1
s where the trailing spaces are.
Similarly, we need to know how many positions exist that could receive trailing spaces, but this is just a matter of summing the inner
matrix along the rows:
⎕← spaces ← +/inner
Let us put the trail
and spaces
vectors on top of each other:
↑trail spaces
Here is what this means:
0
trailing spaces that need to be distributed among 7
positions;4
trailing spaces that need to be distributed among 3
positions;9
trailing spaces that need to be distributed among 1
position.If we divide trail
by spaces
we can figure out how many spaces to add to each position in each row:
trail÷spaces
However, we cannot add 1.33333
spaces in a position, nor 0.4
spaces.
We can only add whole characters, so let us round those values down:
⌊trail÷spaces
This shows how many spaces each position in each row can get, so let us encode that information in the inner
matrix that we have:
inner
The inner
matrix just says which positions can receive spaces, but we want a matrix that says how many spaces to add in each position.
One of the simplest ways to do that is by multiplying each row by the corresponding value in ⌊trail÷spaces
.
It would be great if we could write
inner × ⌊trail÷spaces
RANK ERROR: Mismatched left and right argument ranks inner×⌊trail÷spaces ∧
But unfortunately we cannot, so we need to tell APL that the ×
times function should operate on each row of inner
and each element of the vector ⌊trail÷spaces
, and we can do that with the rank operator ⍤
:
⎕← add ← inner (×⍤1 0) ⌊trail÷spaces
First, note that in each row, the 1
s have been replaced by the corresponding number in ⌊trail÷spaces
.
The rank operator ⍤
is a really powerful operator but it might take some time to wrap your head around it.
I can recommend Richard Park's webinars on the rank operator.
You might have noticed that the add
matrix, that tells you how many spaces can be added to each position, does not account for all the trailing spaces that had to be distributed:
+/add
trail
That is because some of the rows were such that the number of inner spaces did not divide into the total number of trailing spaces that had to be distributed. We still need to distribute these spaces.
First of all, how many do we still have to distribute? You can compute it as
trail - +/add
But it can also be
spaces|trail
if you make use of the residue function |
.
Now that we know how many spaces are missing in each row, we need to figure out which of the inner
positions receive those extra spaces.
From the result of spaces|trail
we see that
We wish to create a matrix, similar to inner
, but that tells us what positions get these extra spaces.
In order to get there, we start by scanning inner
with +
:
+\inner
Notice that this creates an interesting integer matrix, where the value in each row increases by 1
whenever we hit a 1
in the original inner
matrix:
inner
If we multiply the two together, we get an enumeration of the marked inner
positions:
inner × +\inner
Now we can see which of these numbered positions need an extra space.
We can do so by comparing the rows of that matrix with the values in spaces|trail
.
Again, we cannot do this directly:
(inner×+\inner) ≤ spaces|trail
RANK ERROR: Mismatched left and right argument ranks (inner×+\inner)≤spaces|trail ∧
We need to use the rank operator ⍤
once more to pair the rows of inner×+\inner
with the values of spaces|trail
:
(inner×+\inner) (≤⍤1 0) spaces|trail
This result looks funny because our comparison also looked at all of the 0
s in the inner
matrix, although we don't care about those.
In order to fix it, we need to multiply by inner
again:
inner × (inner×+\inner) (≤⍤1 0) spaces|trail
Now this shows the positions that will receive one extra space. Let us save it in a variable:
extra ← inner × (inner×+\inner) (≤⍤1 0) spaces|trail
Notice that we could multiply by inner
just once, in the end, instead of twice:
inner × (+\inner) (≤⍤1 0) spaces|trail
We have our three matrices keep
, add
, and extra
, so we can build the matrix we set out to build:
keep+add+extra
We just have to use it to replicate the original characters by the correct amount:
text
Luckily, APL has a replicate function, which is also in the symbol /
.
For example:
1 0 2 0 3 0 / 'DYALOG'
Unfortunately, /
does not work as-is with two matrices:
(keep+add+extra)/text
RANK ERROR (keep+add+extra)/text ∧
We need to flatten the two matrices with ,
, and then use ⍴
to reshape the result back to its original shape:
(⍴text)⍴(,keep+add+extra)/,text
If we take all our code and put it in a (single-line) dfn, here is what we get:
Just ← { keep ← ~⌽∧\⌽' '=⍵ ⋄ inner ← keep∧' '=⍵ ⋄ trail ← +/~keep ⋄ spaces ← +/inner ⋄ add ← inner(×⍤1 0)⌊trail÷spaces ⋄ extra ← inner×(+\inner)(≤⍤1 0)spaces|trail ⋄ (⍴⍵)⍴(,keep+add+extra)/,⍵ }
Just text
(Notice that, for one, this implementation may not be the best solution to this task, but was written in this way because of the context in which it was presented. Secondly, this dfn should probably be defined across multiple lines, but current limitations of the TryAPL site mean we can only define one-line dfns.)
While our code works quite well, it has a very serious shortcoming: it does not work on text with empty lines...
text ← ↑'This is the first line.' '' 'This is another paragraph.'
text
Just text
DOMAIN ERROR: Divide by zero Just[0] Just←{keep←~⌽∧\⌽' '=⍵ ⋄ inner←keep∧' '=⍵ ⋄ trail←+/~keep ⋄ spaces←+/inner ⋄ add←inner(×⍤1 0)⌊trail÷spaces ⋄ extra←inner×(+\inner)(≤⍤1 0)spaces|trail ⋄ (⍴⍵)⍴(,keep+add+extra)/,⍵} ∧
Gladly, we can fix this without too much trouble.
We can just add some pre-processing to ignore blank lines or lines that do not need justifying (lines that have only 1 word or lines that are too short) and reuse our Just
function.
(This was shown in the APL Seeds presentation very briefly, with no explanation, as it was used to justify a text with approximately 12.500 lines (a text file of Jules Verne's “Twenty Thousand Leagues Under the Sea”, downloaded from Project Gutenberg.)
BetterJust ← { s ← ' '=⍵ ⋄ t ← ⌽∧\⌽s ⋄ fewWs ← 0=+/s-t ⋄ shortL ← (+/t)>0.25×⊃⌽⍴⍵ ⋄ use ← ~fewWs∨shortL ⋄ result ← ⍵ ⋄ (use⌿result) ← Just use⌿⍵ ⋄ result }
BetterJust text
We can also take our original text, add a header and a blank line for demonstration purposes, and justify it again:
⎕← text ← ↑'HAMLET' '' 'To APL, or not to APL, that is' 'the question: whether ''tis' 'nobler in the mind to suffer' 'the slings and arrows of the' 'outrageous fortune...'
BetterJust text
Notice that, now, the last line doesn't get justified because it is "too short".
What "too short" means is controlled by a parameter in the BetterJust
function, which is set to 0.25
right now (meaning lines that have 25% or more of trailing whitespace are not justified).