breaking paragraphs into lines

a typesetting system often involves breaking a long sequence of words into lines of appropriate size; this job seems trivial, but actually needs careful thoughts to produce aesthetic results; for example, if we have the following text in one line:

aaaaaaaaa aaaaaa aaaaaaa aaaaaaaaaaaa a aaaa aaaa aaaa aaaaaa aaaaaaaaaaaaa
aaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa aa aaaaa a aaaaaaaaaaaaaaaaaa aaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaa aaa a a aaaaaaaaaaaaaaaaaaa aaaaa aaaaaaa
aaaaa aaaaaaaa aaaaaaa aaaa aaaaaaa a aaaaaaa aaaaaaaaa a aaaaaaaaa aaaaaaa

paste it in vim, set tw=40 and format with gq, the result is like this:

aaaaaaaaa aaaaaa aaaaaaa aaaaaaaaaaaa a
aaaa aaaa aaaa aaaaaa aaaaaaaaaaaaa
aaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa aa
aaaaa a aaaaaaaaaaaaaaaaaa aaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaa aaa a
a aaaaaaaaaaaaaaaaaaa aaaaa aaaaaaa
aaaaa aaaaaaaa aaaaaaa aaaa aaaaaaa a
aaaaaaa aaaaaaaaa a aaaaaaaaa aaaaaaa

internally, vim uses a greedy algorithm; while its result looks acceptable, there is a more beautiful way of breaking the same words:

aaaaaaaaa aaaaaa aaaaaaa aaaaaaaaaaaa
a aaaa aaaa aaaa aaaaaa aaaaaaaaaaaaa
aaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aa aaaaa a aaaaaaaaaaaaaaaaaa aaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaa aaa
a a aaaaaaaaaaaaaaaaaaa aaaaa aaaaaaa
aaaaa aaaaaaaa aaaaaaa aaaa aaaaaaa a
aaaaaaa aaaaaaaaa a aaaaaaaaa aaaaaaa

we will see how this is made shortly; for now, this introduces the main topic of this article: how do we break a sequence of words into beautiful lines?

the tex algorithm

one such algorithm, often referred to as knuth and plass algorithm, or the tex algorithm, is presented in the paper Breaking Paragraphs into Lines;

it is a long paper with rich backgrounds, but the core ideas are simple:

define box, glue and penalty;
define adjustment ratio, badness and demerit;
find a path along feasible breakpoints that minimizes total demerits;

these are actually more than what we need here; below we give a much simplified model to analyze the problem;

our model

our model is really simple: we are given a sequence of words delimited by single spaces; our goal is to divide them into lines with minimum raggedness; we formulate raggedness using sum of squares of spaces at end of lines; this is the approach adopted by the wikipedia article;

borrowing its example, breaking the same input text into linewidth 6:

AAA BB CC DDDDD

greedy result: 0^2+4^2+1^2=17

------  #space
AAA BB  0
CC      4
DDDDD   1

optimal result: 3^2+1^2+1^2=11

------  #space
AAA     3
BB CC   1
DDDDD   1

so what you see? no boxes, glues, penalties; no adjustment ratios; only sum of squared spaces as badness, which is the same as demerit;

our algorithm

to be precise, this is our implementation, not invention, of the algorithm; the algorithm itself has been referenced decades ago as “conventional”;

#!/usr/bin/env python3

##  break a paragraph into lines;
##
##  ##  run
##
##  ./linebreak.py [pro] [rev] [alt] [{linewidth}] < {input.txt}
##
##  -   pro: calc width on the fly (faster);
##  -   rev: reverse dp walk;
##  -   alt: solve alternate problem (ignoring last line badness);

import numpy as np
import sys

##  read tokens;
tokens = input().rstrip().split(' ')

##  number of tokens;
N = len(tokens)
##  line width;
W = 80
##  const: infinity;
INF = 10000000

##  init dp cost function (mininum total badness upto, including, this token);
def init_cost():
    F = np.empty(N + 1)
    for n in range(N + 1):
        F[n] = 0 if n == 0 else INF
    return F

##  init dp path function (leading token 1-index of each line);
def init_path():
    P = np.empty(N + 1, dtype=object)
    for n in range(N + 1):
        P[n] = []
    return P

##  badness of a line having tokens [i, j];
def badness(i, j):
    w = -1
    while i <= j:
        w += len(tokens[i - 1]) + 1
        i += 1
    if w > W: return INF
    return (W - w) ** 2

##  badness of a line having tokens [i, j]; pro;
def badness_pro(i, j, w):
    if w > W: return INF
    return (W - w) ** 2

##  badness of a line having tokens [i, j]; alt;
def badness_alt(i, j):
    w = -1
    while i <= j:
        w += len(tokens[i - 1]) + 1
        i += 1
    if w > W: return INF
    return (W - w) ** 2 if j != N else 0

##  badness of a line having tokens [i, j]; alt; pro;
def badness_alt_pro(i, j, w):
    if w > W: return INF
    return (W - w) ** 2 if j != N else 0

##  dp function;
def dp(F, P, badness):
    for i in range(1, N + 1):
        for j in range(i, N + 1):
            tmp = F[i - 1] + badness(i, j)
            if tmp < F[j]:
                F[j] = tmp
                P[j] = P[i - 1] + [i]

##  dp function; pro;
def dp_pro(F, P, badness):
    for i in range(1, N + 1):
        w = -1
        for j in range(i, N + 1):
            w += 1 + len(tokens[j - 1])
            if w > W: break
            tmp = F[i - 1] + badness(i, j, w)
            if tmp < F[j]:
                F[j] = tmp
                P[j] = P[i - 1] + [i]

##  dp function; rev;
def dp_rev(F, P, badness):
    for j in range(1, N + 1):
        for i in range(j, 0, -1):
            tmp = F[i - 1] + badness(i, j)
            if tmp < F[j]:
                F[j] = tmp
                P[j] = P[i - 1] + [i]

##  dp function; rev; pro;
def dp_rev_pro(F, P, badness):
    for j in range(1, N + 1):
        w = -1
        for i in range(j, 0, -1):
            w += len(tokens[i - 1]) + 1
            if w > W: break
            tmp = F[i - 1] + badness(i, j, w)
            if tmp < F[j]:
                F[j] = tmp
                P[j] = P[i - 1] + [i]

##  output result using calculated breakpoints;
def output(F, P):
    print('cost = {}'.format(F[N]))
    print('breakpoints = {}'.format(P[N]))

    if F[N] == INF:
        print('IMPOSSIBLE')
    else:
        ans = ''
        for i in range(1, N + 1):
            if i == 1:
                pass
            elif i in P[N]:
                ans += '\n'
            else:
                ans += ' '
            ans += tokens[i - 1]
        print(ans)

##  main function;
def main():
    alt = 'alt' in sys.argv
    pro = 'pro' in sys.argv
    rev = 'rev' in sys.argv
    for i in range(1000):
        if str(i) in sys.argv:
            global W
            W = i

    if pro:
        if alt:
            _badness = badness_alt_pro
        else:
            _badness = badness_pro
        if rev:
            _dp = dp_rev_pro
        else:
            _dp = dp_pro
    else:
        if alt:
            _badness = badness_alt
        else:
            _badness = badness
        if rev:
            _dp = dp_rev
        else:
            _dp = dp

    F = init_cost()
    P = init_path()
    _dp(F, P, _badness)
    output(F, P)

if __name__ == '__main__':
    main()

the algorithm is 1-d dynamic programming; F[j] stores the minimum badness of breaking tokens upto (including) token j, whose value is calculated from the minimum badness of breaking tokens upto (including) token i - 1, plus badness of line made by tokens [i, j], for all 1 <= i <= j;

the code has some interesting suffixes:

pro: calculate line width on the fly; this is faster; another technique to speed up line width calculation is subtracting partial sums;
rev: reverse dp walk; this simply uses a different order when walking the dp table, which shouldnt affect the result or the running time;
alt: this solves an alternate version of the problem, where the last line badness is ignored;

these suffixes can be mixed together; so you may run the above code (saved as linebreak.py) as:

./linebreak.py [pro] [rev] [alt] [{linewidth}] < {input.txt}

vim integration

vim has an option formatprg, with which you can use an external format program for gq; you can use the above code after some minor modifications:

remove excess print statements;
use the same linewidth as in vim;
enable pro, rev, alt by default, as appropriate;

to use the above code in vim:

:set formatprg=./linebreak.py

now you can reproduce the example at the beginning of this article in vim, using a fancier gq;

linear time solution

we havent come to the end of story yet; many years ago, clever people discovered this linebreaking problem has linear time solution:

ok, now we have;