here is another job a typesetting system has to do: calculate spanned column widths in tables when cells span multiple columns;

## our input

an example speaks best for this problem; normally we have tables like this:

| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |


we have used the symbol | for tabskips; and to make our explanation easier, we have labeled each cell with 2 numbers: its beginning and ending columns, inclusive; for example, cell44 means a cell beginning at column 4 and ending at column 4; because every cell is 1-column wide, these two numbers are always the same;

they become different when we allow cells to span multiple columns; to do so, we first change some | into *, which are called imaginary tabskips in our terminology; they are called imaginary because they have actually been removed, but could be helpful in the analysis of our algorithm if we think they are still there; now the table looks like this:

| cell00 | cell11 | cell22 | cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 | cell22 | cell33 * cell44 * cell55 | cell66 |
| cell00 | cell11 * cell22 * cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 * cell22 | cell33 * cell44 * cell55 | cell66 |
| cell00 | cell11 * cell22 | cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 * cell22 * cell33 | cell44 * cell55 | cell66 |
| cell00 | cell11 * cell22 | cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 | cell22 | cell33 | cell44 * cell55 | cell66 |


then we rename the cell numbers; for example: cell11 * cell22 * cell33 becomes cell13; now the table looks like this:

| cell00 | cell11 | cell22 | cell33 | cell45          | cell66 |
| cell01          | cell22 | cell35                   | cell66 |
| cell00 | cell13                   | cell45          | cell66 |
| cell02                   | cell35                   | cell66 |
| cell00 | cell12          | cell33 | cell45          | cell66 |
| cell03                            | cell45          | cell66 |
| cell00 | cell12          | cell33 | cell45          | cell66 |
| cell01          | cell22 | cell33 | cell45          | cell66 |


can we use this format as input for specifying a column-spanning table? no, because it is ambigious; this format does not say how many columns a cell spans without looking at the numbers; so we need a little more symbols to mark the length of spans; in fact, we could just invite the * symbol back into the table, making it like this:

| cell00 | cell11 | cell22 | cell33 | cell45          | cell66 |
|* cell01         | cell22 |** cell35                 | cell66 |
| cell00 |** cell13                 |* cell45         | cell66 |
|** cell02                 |** cell35                 | cell66 |
| cell00 |* cell12         | cell33 |* cell45         | cell66 |
|*** cell03                         |* cell45         | cell66 |
| cell00 |* cell12         | cell33 |* cell45         | cell66 |
|* cell01         | cell22 | cell33 |* cell45         | cell66 |


what we see?

• | leads a cell that spans 1 column;

• |* leads a cell that spans 2 columns;

• |** leads a cell that spans 3 columns;

• similarly, a | with n * leads a cell that spans n + 1 columns;

this makes our input specification unambigious;

## our goal

our goal is to calculate a width for each column, so that the total width of the table is minimized and all constraints are satisfied (no cell spills over); to formalize, if there are c columns numbered as 0 to c - 1, the optimum widths d[0], …, d[c - 1] should have their sum minimized, while satisfying all constraints, where each constraint is the sum d[i] + ... + d[j] bounded by the length of text in cell [i, j], for each cell [i, j];

## our algorithm

this problem is a linear programming where all coefficients are either 0 or 1; furthermore, if any constraint involves columns i and j where i <= j, then it also involves all k such that i <= k <= j;

this linear programming can be solved iteratively from 0 to c - 1; when we are solving column j, we only consider constraints at or before j:

• when j = 0: we consider all cells that span from column 0 to column 0 (denoted by cell[0, 0]), and we know d[0] must be at least the maximum width of all these cells;

• we claim that there is an optimum solution with d[0] equal to the maximum width of all these cells; because all other constraints involving column 0 would also involve column 1; if there is an optimum solution whose d[0] is greater than the maximum width of all cell[0, 0], then we take that solution, decrease d[0] and increase d[1] by the same amount, until d[0] is equal to the maximum width of all cell[0, 0];

• we know we are not violating any constraint, because:

• d[0] is always big enough for constraints involving only column 0;

• all other constraints have the form d[0] + d[1] + ..., where we have kept d[0] + d[1] constant;

• we know we are not changing the objective function, because:

• the objective function has the form d[0] + d[1] + ..., where we have kept d[0] + d[1] constant;

so finally we get an optimum solution where d[0] is equal to the maximum width of all cell[0, 0];

• we have solved d[0], and the number of free variables is decreased by 1; now we consider d[1], which has three types of constraints:

• those imposed by cell[1, 1] (1-column cell);

• those imposed by cell[0, 1] (2-column cell);

• all other constraints must involve d[2];

because d[0] has been solved, so the first two types of constraints both involve only one free variable d[1]; so we can take the maximum of those constraints as d[1], and by a similar analysis as above, we know there is an optimum solution with such d[1]; so we have solved d[1];

• a similar approach applies to d[2], d[3], etc.; and finally we have solved all of them and we know that our solution is optimum;

### column reduction

there is a caveat in the above algorithm: when we have solved d[0], …, d[j - 1] and work on d[j], we may find there are no constraints for d[j] because there are no cells ending in column j; that is, any cell that spans column j also spans column j + 1; this can happen only when j is not the last column, that is, j != c - 1;

to see an example, look at cell44 in the table given at the beginning of this article; cell44 are followed by * in all the rows; this is exactly the way to detect such columns;

there are two ways to solve this tricky case:

• set d[j] to 0; because all constraints involving d[j] also involve d[j + 1], we know any optimum solution with d[j] > 0 can be converted to an optimum solution where d[j] == 0 by decreasing d[j] and increasing d[j + 1];

• we can use a merge process to reduce the columns so that the reduced table does not contain such column j; the merge process simply merges column j and column j + 1 (and column j + 2 if no cell ends in column j + 1, and column j + 3 if no cell ends in column j + 2, etc.) into new column j, and we know there is a cell ending in the new column j;

## our implementation

our implementation is a lengthy python program; we implemented both ways to solve the tricky case mentioned above; there should be no need to merge, but we keep it there because the merge is employed by the tex algorithm and seems to simplify the problem a bit;

#!/usr/bin/env python3

##  calculate spanned column widths;
##
##  ##  run
##
##  ./colwidth.py [tight|loose|empty] [left|right] < {input.txt}
##
##  -   tight: print table with tight tabskips;
##  -   loose: print table with loose tabskips;
##  -   empty: print table with empty tabskips;
##  -   left: cell text is left justified;
##  -   right: cell text is right justified;
##
##  ##  input format
##
##  input format is similar to simplified markdown table format:
##
##  -   use | as tabskip;
##
##  -   use * after | to indicate a col span; the number of cols spanned is
##      1 plus the number of *;
##
##      -   | is a 1-col span (namely, a normal cell);
##      -   |* is a 2-col span;
##      -   |** is a 3-col span;
##      -   ...;
##
##  -   cell text must not contain | or *;
##
##  -   within a cell, spaces surrounding text are ignored;
##
##  -   within a cell, spaces within text are kept;
##
##  ##  input example
##
##  these inputs are equivalent:
##
##      |*Ax xZ|AxxxxZ|**Axx xxx   xZ|AxxxxZ|
##      |*  Ax xZ|  AxxxxZ|**Axx xxx   xZ|AxxxxZ|
##      |*Ax xZ   |  AxxxxZ|**   Axx xxx   xZ|   AxxxxZ |

from collections import Counter
from pprint import pprint
import numpy as np
import sys

##  const: inf;
INF = 1000000

class Table:

'''
a table is implemented as a list of rows, with some aux data; each row is a
list of cells; each cell is a list [beg, end, str]; beg and end are
beginning and ending col indexes of this cell, inclusive; str is the cell
text; rows and cells are both in sorted order;
'''

def __init__(self):

##  a list of rows;
self.rows = []
##  number of rows;
self.R = 0
##  number of cols;
self.C = 0
##  col span counter; keyed by col index; counts how many times a cell
##  beginning at this column spans to or beyond the next column; for
##  example, if a cell has col indexes [4, 6], then increase the count
##  for cols 4 and 5;
self.cnt = Counter()

'''
parse the input line (given as str) into a row and add it to this table;
all rows must be added before rendering this table;
'''

##  row data;
row = []

##  char indexes: begin and end, inclusive;
l = INF
r = -INF

##  cell indexes: begin and end, inclusive;
i = -1
j = -1

##  iter each char in input line;
for k, c in enumerate(line):
if c == ' ':        ##  case: ignore surrounding space;
pass
elif c == '|':      ##  case: works like & in tex;
if i != -1:
##  trick: this line works even if cell is empty;
row.append([i, j, line[l:r + 1]])
l = INF
r = -INF
j += 1
i = j
elif c == '*':      ##  case: works like \span in tex;
self.cnt[j] += 1
j += 1
else:               ##  case: normal char;
l = min(l, k)
r = max(r, k)

##  add this row to table;
assert row
self.rows.append(row)

if self.C == 0:
self.C = j
else:
assert self.C == j

self.R += 1

def merge(self):

'''
merge cols; a col is merged with its next col iff its col span counter
equals to R, the number of rows, which means this col always spans to
the next col;
'''

##  col index mapping;
mapping = []
##  col index offset;
k = 0

##  calc col index mapping;
for j in range(self.C):
mapping.append(j - k)
if self.cnt[j] == self.R:
k += 1

##  adjust col indexes in cells;
for row in self.rows:
for cell in row:
cell[0] = mapping[cell[0]]
cell[1] = mapping[cell[1]]

self.C -= k

def calc_widths(self, sep):

'''
calculate col widths;
'''

##  maximum widths between two cols; init to zeros;
weights = np.zeros((self.C, self.C), dtype=int)

##  iter all cells;
for row in self.rows:
for cell in row:
i, j, s = cell[0], cell[1], cell[2]
weights[i, j] = max(weights[i, j], len(s))

##  widths;
widths = np.zeros(self.C, dtype=int)
for j in range(self.C):
tmp = weights[j, j]
tot = 0
for i in range(j - 1, -1, -1):
tot += widths[i] + len(sep)
tmp = max(tmp, weights[i, j] - tot)
widths[j] = tmp

return widths

def pprint(self, style='loose', justify='left'):

'''
pretty print this table;
'''

##  get left/mid/right cell separators;
if style == 'loose':
lsep, sep, rsep = '| ', ' | ', ' |'
elif style == 'tight':
lsep, sep, rsep = '|', '|', '|'
elif style == 'empty':
lsep, sep, rsep = '', '', ''
else:
raise

##  calculate col widths;
widths = self.calc_widths(sep)
print(widths)

##  total width;
tot = sum(widths) + (self.C - 1) * len(sep) + len(lsep) + len(rsep)

##  pretty print all rows;
for row in self.rows:
print('-' * tot)
print(lsep, end='')
for k, cell in enumerate(row):
i, j, s = cell[0], cell[1], cell[2]
w = sum(widths[i:j + 1]) + len(sep) * (j - i)
if justify == 'left':
print(s + ' ' * (w - len(s)), end='')
elif justify == 'right':
print(' ' * (w - len(s)) + s, end='')
else:
raise
if k != len(row) - 1:
print(sep, end='')
else:
print(rsep)
print('-' * tot)

'''
'''

table = Table()
while True:
try:
except EOFError:
break
return table

def main():

'''
main function;
'''

if 'empty' in sys.argv:
style = 'empty'
elif 'loose' in sys.argv:
style = 'loose'
else:
style = 'tight'

if 'right' in sys.argv:
justify = 'right'
else:
justify = 'left'

#    print('R = {}'.format(table.R))
#    print('C = {}'.format(table.C))
#    pprint(table.rows)

table.merge()

#    print('R = {}'.format(table.R))
#    print('C = {}'.format(table.C))
#    pprint(table.rows)

table.pprint(style, justify)

if __name__ == '__main__':
main()


to get an idea of how it works, we can feed it with this input:

| AxxxxZ | AxxxxZ | AxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|* AxxxxxxxxxxxxxZ | AxxxxZ |** AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
| AxxxxZ |** AxxxxxxxxxxxxxxxxxxxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|** AxxxxxxxxxxxxxxxxxxxxxxZ |** AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
| AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|*** AxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
| AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|* AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |


save the program as colwidth.py and the input as colwidth.in, then run:

./colwidth.py loose < colwidth.in


the output:

[ 6  6  6  6 15  6]
----------------------------------------------------------------
| AxxxxZ | AxxxxZ | AxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxZ | AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------


the first row says the optimum solution has 6 columns and lists their widths; all other rows print the table itself using these optimum widths;

loose is one of the styles supported by the program; other styles are tight and empty; you can also justify text within a cell using left or right;

you can try with other inputs that conform to the input specification introduced at the beginning of this article; the specification is quite simple, right? this actually reminds me of writing a table in markdown, even though the inspiration for this article comes from the tex world; whatever, we now have a great tool to draw tables in terminal windows;