calculate spanned column widths

here is another job a typesetting system has to do: calculate spanned column widths in tables when cells span multiple columns;

our input

an example speaks best for this problem; normally we have tables like this:

| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |

we have used the symbol | for tabskips; and to make our explanation easier, we have labeled each cell with 2 numbers: its beginning and ending columns, inclusive; for example, cell44 means a cell beginning at column 4 and ending at column 4; because every cell is 1-column wide, these two numbers are always the same;

they become different when we allow cells to span multiple columns; to do so, we first change some | into *, which are called imaginary tabskips in our terminology; they are called imaginary because they have actually been removed, but could be helpful in the analysis of our algorithm if we think they are still there; now the table looks like this:

| cell00 | cell11 | cell22 | cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 | cell22 | cell33 * cell44 * cell55 | cell66 |
| cell00 | cell11 * cell22 * cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 * cell22 | cell33 * cell44 * cell55 | cell66 |
| cell00 | cell11 * cell22 | cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 * cell22 * cell33 | cell44 * cell55 | cell66 |
| cell00 | cell11 * cell22 | cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 | cell22 | cell33 | cell44 * cell55 | cell66 |

then we rename the cell numbers; for example: cell11 * cell22 * cell33 becomes cell13; now the table looks like this:

| cell00 | cell11 | cell22 | cell33 | cell45          | cell66 |
| cell01          | cell22 | cell35                   | cell66 |
| cell00 | cell13                   | cell45          | cell66 |
| cell02                   | cell35                   | cell66 |
| cell00 | cell12          | cell33 | cell45          | cell66 |
| cell03                            | cell45          | cell66 |
| cell00 | cell12          | cell33 | cell45          | cell66 |
| cell01          | cell22 | cell33 | cell45          | cell66 |

can we use this format as input for specifying a column-spanning table? no, because it is ambigious; this format does not say how many columns a cell spans without looking at the numbers; so we need a little more symbols to mark the length of spans; in fact, we could just invite the * symbol back into the table, making it like this:

| cell00 | cell11 | cell22 | cell33 | cell45          | cell66 |
|* cell01         | cell22 |** cell35                 | cell66 |
| cell00 |** cell13                 |* cell45         | cell66 |
|** cell02                 |** cell35                 | cell66 |
| cell00 |* cell12         | cell33 |* cell45         | cell66 |
|*** cell03                         |* cell45         | cell66 |
| cell00 |* cell12         | cell33 |* cell45         | cell66 |
|* cell01         | cell22 | cell33 |* cell45         | cell66 |

what we see?

| leads a cell that spans 1 column;
|* leads a cell that spans 2 columns;
|** leads a cell that spans 3 columns;
similarly, a | with n * leads a cell that spans n + 1 columns;

this makes our input specification unambigious;

our goal

our goal is to calculate a width for each column, so that the total width of the table is minimized and all constraints are satisfied (no cell spills over); to formalize, if there are c columns numbered as 0 to c - 1, the optimum widths d[0], …, d[c - 1] should have their sum minimized, while satisfying all constraints, where each constraint is the sum d[i] + ... + d[j] bounded by the length of text in cell [i, j], for each cell [i, j];

our algorithm

this problem is a linear programming where all coefficients are either 0 or 1; furthermore, if any constraint involves columns i and j where i <= j, then it also involves all k such that i <= k <= j;

this linear programming can be solved iteratively from 0 to c - 1; when we are solving column j, we only consider constraints at or before j:

when j = 0: we consider all cells that span from column 0 to column 0 (denoted by cell[0, 0]), and we know d[0] must be at least the maximum width of all these cells;
we claim that there is an optimum solution with d[0] equal to the maximum width of all these cells; because all other constraints involving column 0 would also involve column 1; if there is an optimum solution whose d[0] is greater than the maximum width of all cell[0, 0], then we take that solution, decrease d[0] and increase d[1] by the same amount, until d[0] is equal to the maximum width of all cell[0, 0];
- we know we are not violating any constraint, because:
  - d[0] is always big enough for constraints involving only column 0;
  - all other constraints have the form d[0] + d[1] + ..., where we have kept d[0] + d[1] constant;
- we know we are not changing the objective function, because:
  - the objective function has the form d[0] + d[1] + ..., where we have kept d[0] + d[1] constant;
so finally we get an optimum solution where d[0] is equal to the maximum width of all cell[0, 0];
we have solved d[0], and the number of free variables is decreased by 1; now we consider d[1], which has three types of constraints:
- those imposed by cell[1, 1] (1-column cell);
- those imposed by cell[0, 1] (2-column cell);
- all other constraints must involve d[2];
because d[0] has been solved, so the first two types of constraints both involve only one free variable d[1]; so we can take the maximum of those constraints as d[1], and by a similar analysis as above, we know there is an optimum solution with such d[1]; so we have solved d[1];
a similar approach applies to d[2], d[3], etc.; and finally we have solved all of them and we know that our solution is optimum;

column reduction

there is a caveat in the above algorithm: when we have solved d[0], …, d[j - 1] and work on d[j], we may find there are no constraints for d[j] because there are no cells ending in column j; that is, any cell that spans column j also spans column j + 1; this can happen only when j is not the last column, that is, j != c - 1;

to see an example, look at cell44 in the table given at the beginning of this article; cell44 are followed by * in all the rows; this is exactly the way to detect such columns;

there are two ways to solve this tricky case:

set d[j] to 0; because all constraints involving d[j] also involve d[j + 1], we know any optimum solution with d[j] > 0 can be converted to an optimum solution where d[j] == 0 by decreasing d[j] and increasing d[j + 1];
we can use a merge process to reduce the columns so that the reduced table does not contain such column j; the merge process simply merges column j and column j + 1 (and column j + 2 if no cell ends in column j + 1, and column j + 3 if no cell ends in column j + 2, etc.) into new column j, and we know there is a cell ending in the new column j;

our implementation

our implementation is a lengthy python program; we implemented both ways to solve the tricky case mentioned above; there should be no need to merge, but we keep it there because the merge is employed by the tex algorithm and seems to simplify the problem a bit;

#!/usr/bin/env python3

##  calculate spanned column widths;
##
##  ##  run
##
##  ./colwidth.py [tight|loose|empty] [left|right] < {input.txt}
##
##  -   tight: print table with tight tabskips;
##  -   loose: print table with loose tabskips;
##  -   empty: print table with empty tabskips;
##  -   left: cell text is left justified;
##  -   right: cell text is right justified;
##
##  ##  input format
##
##  input format is similar to simplified markdown table format:
##
##  -   use `|` as tabskip;
##
##  -   use `*` after `|` to indicate a col span; the number of cols spanned is
##      1 plus the number of `*`;
##
##      -   `|` is a 1-col span (namely, a normal cell);
##      -   `|*` is a 2-col span;
##      -   `|**` is a 3-col span;
##      -   ...;
##
##  -   cell text must not contain `|` or `*`;
##
##  -   within a cell, spaces surrounding text are ignored;
##
##  -   within a cell, spaces within text are kept;
##
##  ##  input example
##
##  these inputs are equivalent:
##
##      |*Ax xZ|AxxxxZ|**Axx xxx   xZ|AxxxxZ|
##      |*  Ax xZ|  AxxxxZ|**Axx xxx   xZ|AxxxxZ|
##      |*Ax xZ   |  AxxxxZ|**   Axx xxx   xZ|   AxxxxZ |

from collections import Counter
from pprint import pprint
import numpy as np
import sys

##  const: inf;
INF = 1000000

class Table:

    '''
    a table is implemented as a list of rows, with some aux data; each row is a
    list of cells; each cell is a list `[beg, end, str]`; `beg` and `end` are
    beginning and ending col indexes of this cell, inclusive; `str` is the cell
    text; rows and cells are both in sorted order;
    '''

    def __init__(self):

        ##  a list of rows;
        self.rows = []
        ##  number of rows;
        self.R = 0
        ##  number of cols;
        self.C = 0
        ##  col span counter; keyed by col index; counts how many times a cell
        ##  beginning at this column spans to or beyond the next column; for
        ##  example, if a cell has col indexes [4, 6], then increase the count
        ##  for cols 4 and 5;
        self.cnt = Counter()

    def add_row(self, line):

        '''
        parse the input line (given as str) into a row and add it to this table;
        all rows must be added before rendering this table;
        '''

        ##  row data;
        row = []

        ##  char indexes: begin and end, inclusive;
        l = INF
        r = -INF

        ##  cell indexes: begin and end, inclusive;
        i = -1
        j = -1

        ##  iter each char in input line;
        for k, c in enumerate(line):
            if c == ' ':        ##  case: ignore surrounding space;
                pass
            elif c == '|':      ##  case: works like `&` in tex;
                if i != -1:
                    ##  trick: this line works even if cell is empty;
                    row.append([i, j, line[l:r + 1]])
                    l = INF
                    r = -INF
                j += 1
                i = j
            elif c == '*':      ##  case: works like `\span` in tex;
                self.cnt[j] += 1
                j += 1
            else:               ##  case: normal char;
                l = min(l, k)
                r = max(r, k)

        ##  add this row to table;
        assert row
        self.rows.append(row)

        ##  adjust number of cols;
        if self.C == 0:
            self.C = j
        else:
            assert self.C == j

        ##  adjust number of rows;
        self.R += 1

    def merge(self):

        '''
        merge cols; a col is merged with its next col iff its col span counter
        equals to `R`, the number of rows, which means this col always spans to
        the next col;
        '''

        ##  col index mapping;
        mapping = []
        ##  col index offset;
        k = 0

        ##  calc col index mapping;
        for j in range(self.C):
            mapping.append(j - k)
            if self.cnt[j] == self.R:
                k += 1

        ##  adjust col indexes in cells;
        for row in self.rows:
            for cell in row:
                cell[0] = mapping[cell[0]]
                cell[1] = mapping[cell[1]]

        ##  adjust number of cols;
        self.C -= k

    def calc_widths(self, sep):

        '''
        calculate col widths;
        '''

        ##  maximum widths between two cols; init to zeros;
        weights = np.zeros((self.C, self.C), dtype=int)

        ##  iter all cells;
        for row in self.rows:
            for cell in row:
                i, j, s = cell[0], cell[1], cell[2]
                weights[i, j] = max(weights[i, j], len(s))

        ##  widths;
        widths = np.zeros(self.C, dtype=int)
        for j in range(self.C):
            tmp = weights[j, j]
            tot = 0
            for i in range(j - 1, -1, -1):
                tot += widths[i] + len(sep)
                tmp = max(tmp, weights[i, j] - tot)
            widths[j] = tmp

        return widths

    def pprint(self, style='loose', justify='left'):

        '''
        pretty print this table;
        '''

        ##  get left/mid/right cell separators;
        if style == 'loose':
            lsep, sep, rsep = '| ', ' | ', ' |'
        elif style == 'tight':
            lsep, sep, rsep = '|', '|', '|'
        elif style == 'empty':
            lsep, sep, rsep = '', '', ''
        else:
            raise

        ##  calculate col widths;
        widths = self.calc_widths(sep)
        print(widths)

        ##  total width;
        tot = sum(widths) + (self.C - 1) * len(sep) + len(lsep) + len(rsep)

        ##  pretty print all rows;
        for row in self.rows:
            print('-' * tot)
            print(lsep, end='')
            for k, cell in enumerate(row):
                i, j, s = cell[0], cell[1], cell[2]
                w = sum(widths[i:j + 1]) + len(sep) * (j - i)
                if justify == 'left':
                    print(s + ' ' * (w - len(s)), end='')
                elif justify == 'right':
                    print(' ' * (w - len(s)) + s, end='')
                else:
                    raise
                if k != len(row) - 1:
                    print(sep, end='')
                else:
                    print(rsep)
        print('-' * tot)

def read_table():

    '''
    read table from input;
    '''

    table = Table()
    while True:
        try:
            table.add_row(input())
        except EOFError:
            break
    return table

def main():

    '''
    main function;
    '''

    if 'empty' in sys.argv:
        style = 'empty'
    elif 'loose' in sys.argv:
        style = 'loose'
    else:
        style = 'tight'

    if 'right' in sys.argv:
        justify = 'right'
    else:
        justify = 'left'

    table = read_table()

#    print('R = {}'.format(table.R))
#    print('C = {}'.format(table.C))
#    pprint(table.rows)

    table.merge()

#    print('R = {}'.format(table.R))
#    print('C = {}'.format(table.C))
#    pprint(table.rows)

    table.pprint(style, justify)

if __name__ == '__main__':
    main()

to get an idea of how it works, we can feed it with this input:

| AxxxxZ | AxxxxZ | AxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|* AxxxxxxxxxxxxxZ | AxxxxZ |** AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
| AxxxxZ |** AxxxxxxxxxxxxxxxxxxxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|** AxxxxxxxxxxxxxxxxxxxxxxZ |** AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
| AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|*** AxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
| AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|* AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |

save the program as colwidth.py and the input as colwidth.in, then run:

./colwidth.py loose < colwidth.in

the output:

[ 6  6  6  6 15  6]
----------------------------------------------------------------
| AxxxxZ | AxxxxZ | AxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxZ | AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------

the first row says the optimum solution has 6 columns and lists their widths; all other rows print the table itself using these optimum widths;

loose is one of the styles supported by the program; other styles are tight and empty; you can also justify text within a cell using left or right;

you can try with other inputs that conform to the input specification introduced at the beginning of this article; the specification is quite simple, right? this actually reminds me of writing a table in markdown, even though the inspiration for this article comes from the tex world; whatever, we now have a great tool to draw tables in terminal windows;

references

TeX Tables: How TeX Calculates Spanned Column Widths