breaking paragraphs into lines
a typesetting system often involves breaking a long sequence of words into lines of appropriate size; this job seems trivial, but actually needs careful thoughts to produce aesthetic results; for example, if we have the following text in one line:
aaaaaaaaa aaaaaa aaaaaaa aaaaaaaaaaaa a aaaa aaaa aaaa aaaaaa aaaaaaaaaaaaa
aaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa aa aaaaa a aaaaaaaaaaaaaaaaaa aaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaa aaa a a aaaaaaaaaaaaaaaaaaa aaaaa aaaaaaa
aaaaa aaaaaaaa aaaaaaa aaaa aaaaaaa a aaaaaaa aaaaaaaaa a aaaaaaaaa aaaaaaa
paste it in vim, set tw=40
and format with gq
, the result is like this:
aaaaaaaaa aaaaaa aaaaaaa aaaaaaaaaaaa a
aaaa aaaa aaaa aaaaaa aaaaaaaaaaaaa
aaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa aa
aaaaa a aaaaaaaaaaaaaaaaaa aaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaa aaa a
a aaaaaaaaaaaaaaaaaaa aaaaa aaaaaaa
aaaaa aaaaaaaa aaaaaaa aaaa aaaaaaa a
aaaaaaa aaaaaaaaa a aaaaaaaaa aaaaaaa
internally, vim uses a greedy algorithm; while its result looks acceptable, there is a more beautiful way of breaking the same words:
aaaaaaaaa aaaaaa aaaaaaa aaaaaaaaaaaa
a aaaa aaaa aaaa aaaaaa aaaaaaaaaaaaa
aaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aa aaaaa a aaaaaaaaaaaaaaaaaa aaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaa aaa
a a aaaaaaaaaaaaaaaaaaa aaaaa aaaaaaa
aaaaa aaaaaaaa aaaaaaa aaaa aaaaaaa a
aaaaaaa aaaaaaaaa a aaaaaaaaa aaaaaaa
we will see how this is made shortly; for now, this introduces the main topic of this article: how do we break a sequence of words into beautiful lines?
the tex algorithm
one such algorithm, often referred to as knuth and plass algorithm, or the tex algorithm, is presented in the paper Breaking Paragraphs into Lines;
it is a long paper with rich backgrounds, but the core ideas are simple:
- define box, glue and penalty;
- define adjustment ratio, badness and demerit;
- find a path along feasible breakpoints that minimizes total demerits;
these are actually more than what we need here; below we give a much simplified model to analyze the problem;
our model
our model is really simple: we are given a sequence of words delimited by single spaces; our goal is to divide them into lines with minimum raggedness; we formulate raggedness using sum of squares of spaces at end of lines; this is the approach adopted by the wikipedia article;
borrowing its example, breaking the same input text into linewidth 6:
AAA BB CC DDDDD
greedy result: 0^2+4^2+1^2=17
------ #space
AAA BB 0
CC 4
DDDDD 1
optimal result: 3^2+1^2+1^2=11
------ #space
AAA 3
BB CC 1
DDDDD 1
so what you see? no boxes, glues, penalties; no adjustment ratios; only sum of squared spaces as badness, which is the same as demerit;
our algorithm
to be precise, this is our implementation, not invention, of the algorithm; the algorithm itself has been referenced decades ago as “conventional”;
#!/usr/bin/env python3
## break a paragraph into lines;
##
## ## run
##
## ./linebreak.py [pro] [rev] [alt] [{linewidth}] < {input.txt}
##
## - pro: calc width on the fly (faster);
## - rev: reverse dp walk;
## - alt: solve alternate problem (ignoring last line badness);
import numpy as np
import sys
## read tokens;
tokens = input().rstrip().split(' ')
## number of tokens;
N = len(tokens)
## line width;
W = 80
## const: infinity;
INF = 10000000
## init dp cost function (mininum total badness upto, including, this token);
def init_cost():
F = np.empty(N + 1)
for n in range(N + 1):
F[n] = 0 if n == 0 else INF
return F
## init dp path function (leading token 1-index of each line);
def init_path():
P = np.empty(N + 1, dtype=object)
for n in range(N + 1):
P[n] = []
return P
## badness of a line having tokens [i, j];
def badness(i, j):
w = -1
while i <= j:
w += len(tokens[i - 1]) + 1
i += 1
if w > W: return INF
return (W - w) ** 2
## badness of a line having tokens [i, j]; pro;
def badness_pro(i, j, w):
if w > W: return INF
return (W - w) ** 2
## badness of a line having tokens [i, j]; alt;
def badness_alt(i, j):
w = -1
while i <= j:
w += len(tokens[i - 1]) + 1
i += 1
if w > W: return INF
return (W - w) ** 2 if j != N else 0
## badness of a line having tokens [i, j]; alt; pro;
def badness_alt_pro(i, j, w):
if w > W: return INF
return (W - w) ** 2 if j != N else 0
## dp function;
def dp(F, P, badness):
for i in range(1, N + 1):
for j in range(i, N + 1):
tmp = F[i - 1] + badness(i, j)
if tmp < F[j]:
F[j] = tmp
P[j] = P[i - 1] + [i]
## dp function; pro;
def dp_pro(F, P, badness):
for i in range(1, N + 1):
w = -1
for j in range(i, N + 1):
w += 1 + len(tokens[j - 1])
if w > W: break
tmp = F[i - 1] + badness(i, j, w)
if tmp < F[j]:
F[j] = tmp
P[j] = P[i - 1] + [i]
## dp function; rev;
def dp_rev(F, P, badness):
for j in range(1, N + 1):
for i in range(j, 0, -1):
tmp = F[i - 1] + badness(i, j)
if tmp < F[j]:
F[j] = tmp
P[j] = P[i - 1] + [i]
## dp function; rev; pro;
def dp_rev_pro(F, P, badness):
for j in range(1, N + 1):
w = -1
for i in range(j, 0, -1):
w += len(tokens[i - 1]) + 1
if w > W: break
tmp = F[i - 1] + badness(i, j, w)
if tmp < F[j]:
F[j] = tmp
P[j] = P[i - 1] + [i]
## output result using calculated breakpoints;
def output(F, P):
print('cost = {}'.format(F[N]))
print('breakpoints = {}'.format(P[N]))
if F[N] == INF:
print('IMPOSSIBLE')
else:
ans = ''
for i in range(1, N + 1):
if i == 1:
pass
elif i in P[N]:
ans += '\n'
else:
ans += ' '
ans += tokens[i - 1]
print(ans)
## main function;
def main():
alt = 'alt' in sys.argv
pro = 'pro' in sys.argv
rev = 'rev' in sys.argv
for i in range(1000):
if str(i) in sys.argv:
global W
W = i
if pro:
if alt:
_badness = badness_alt_pro
else:
_badness = badness_pro
if rev:
_dp = dp_rev_pro
else:
_dp = dp_pro
else:
if alt:
_badness = badness_alt
else:
_badness = badness
if rev:
_dp = dp_rev
else:
_dp = dp
F = init_cost()
P = init_path()
_dp(F, P, _badness)
output(F, P)
if __name__ == '__main__':
main()
the algorithm is 1-d dynamic programming; F[j]
stores the minimum badness of
breaking tokens upto (including) token j
, whose value is calculated from the
minimum badness of breaking tokens upto (including) token i - 1
, plus badness
of line made by tokens [i, j]
, for all 1 <= i <= j
;
the code has some interesting suffixes:
-
pro
: calculate line width on the fly; this is faster; another technique to speed up line width calculation is subtracting partial sums; -
rev
: reverse dp walk; this simply uses a different order when walking the dp table, which shouldnt affect the result or the running time; -
alt
: this solves an alternate version of the problem, where the last line badness is ignored;
these suffixes can be mixed together; so you may run the above code (saved as
linebreak.py
) as:
./linebreak.py [pro] [rev] [alt] [{linewidth}] < {input.txt}
vim integration
vim has an option formatprg
, with which you can use an external format program
for gq
; you can use the above code after some minor modifications:
-
remove excess
print
statements; -
use the same linewidth as in vim;
-
enable
pro
,rev
,alt
by default, as appropriate;
to use the above code in vim:
:set formatprg=./linebreak.py
now you can reproduce the example at the beginning of this article in vim, using
a fancier gq
;
linear time solution
we havent come to the end of story yet; many years ago, clever people discovered this linebreaking problem has linear time solution:
ok, now we have;