calculate spanned column widths
here is another job a typesetting system has to do: calculate spanned column widths in tables when cells span multiple columns;
our input
an example speaks best for this problem; normally we have tables like this:
 cell00  cell11  cell22  cell33  cell44  cell55  cell66 
 cell00  cell11  cell22  cell33  cell44  cell55  cell66 
 cell00  cell11  cell22  cell33  cell44  cell55  cell66 
 cell00  cell11  cell22  cell33  cell44  cell55  cell66 
 cell00  cell11  cell22  cell33  cell44  cell55  cell66 
 cell00  cell11  cell22  cell33  cell44  cell55  cell66 
 cell00  cell11  cell22  cell33  cell44  cell55  cell66 
 cell00  cell11  cell22  cell33  cell44  cell55  cell66 
we have used the symbol 
for tabskips; and to make our explanation easier, we
have labeled each cell with 2 numbers: its beginning and ending columns,
inclusive; for example, cell44
means a cell beginning at column 4
and ending
at column 4
; because every cell is 1column wide, these two numbers are always
the same;
they become different when we allow cells to span multiple columns; to do so, we
first change some 
into *
, which are called imaginary tabskips in our
terminology; they are called imaginary because they have actually been removed,
but could be helpful in the analysis of our algorithm if we think they are still
there; now the table looks like this:
 cell00  cell11  cell22  cell33  cell44 * cell55  cell66 
 cell00 * cell11  cell22  cell33 * cell44 * cell55  cell66 
 cell00  cell11 * cell22 * cell33  cell44 * cell55  cell66 
 cell00 * cell11 * cell22  cell33 * cell44 * cell55  cell66 
 cell00  cell11 * cell22  cell33  cell44 * cell55  cell66 
 cell00 * cell11 * cell22 * cell33  cell44 * cell55  cell66 
 cell00  cell11 * cell22  cell33  cell44 * cell55  cell66 
 cell00 * cell11  cell22  cell33  cell44 * cell55  cell66 
then we rename the cell numbers; for example: cell11 * cell22 * cell33
becomes
cell13
; now the table looks like this:
 cell00  cell11  cell22  cell33  cell45  cell66 
 cell01  cell22  cell35  cell66 
 cell00  cell13  cell45  cell66 
 cell02  cell35  cell66 
 cell00  cell12  cell33  cell45  cell66 
 cell03  cell45  cell66 
 cell00  cell12  cell33  cell45  cell66 
 cell01  cell22  cell33  cell45  cell66 
can we use this format as input for specifying a columnspanning table? no,
because it is ambigious; this format does not say how many columns a cell spans
without looking at the numbers; so we need a little more symbols to mark the
length of spans; in fact, we could just invite the *
symbol back into the
table, making it like this:
 cell00  cell11  cell22  cell33  cell45  cell66 
* cell01  cell22 ** cell35  cell66 
 cell00 ** cell13 * cell45  cell66 
** cell02 ** cell35  cell66 
 cell00 * cell12  cell33 * cell45  cell66 
*** cell03 * cell45  cell66 
 cell00 * cell12  cell33 * cell45  cell66 
* cell01  cell22  cell33 * cell45  cell66 
what we see?


leads a cell that spans 1 column; 
*
leads a cell that spans 2 columns; 
**
leads a cell that spans 3 columns; 
similarly, a

withn
*
leads a cell that spansn + 1
columns;
this makes our input specification unambigious;
our goal
our goal is to calculate a width for each column, so that the total width of the
table is minimized and all constraints are satisfied (no cell spills over); to
formalize, if there are c
columns numbered as 0
to c  1
, the optimum
widths d[0]
, …, d[c  1]
should have their sum minimized, while satisfying
all constraints, where each constraint is the sum d[i] + ... + d[j]
bounded by
the length of text in cell [i, j]
, for each cell [i, j]
;
our algorithm
this problem is a linear programming where all coefficients are either 0
or
1
; furthermore, if any constraint involves columns i
and j
where i <= j
,
then it also involves all k
such that i <= k <= j
;
this linear programming can be solved iteratively from 0
to c  1
; when we
are solving column j
, we only consider constraints at or before j
:

when
j = 0
: we consider all cells that span from column0
to column0
(denoted bycell[0, 0]
), and we knowd[0]
must be at least the maximum width of all these cells; 
we claim that there is an optimum solution with
d[0]
equal to the maximum width of all these cells; because all other constraints involving column0
would also involve column1
; if there is an optimum solution whosed[0]
is greater than the maximum width of allcell[0, 0]
, then we take that solution, decreased[0]
and increased[1]
by the same amount, untild[0]
is equal to the maximum width of allcell[0, 0]
;
we know we are not violating any constraint, because:

d[0]
is always big enough for constraints involving only column0
; 
all other constraints have the form
d[0] + d[1] + ...
, where we have keptd[0] + d[1]
constant;


we know we are not changing the objective function, because:
 the objective function has the form
d[0] + d[1] + ...
, where we have keptd[0] + d[1]
constant;
 the objective function has the form
so finally we get an optimum solution where
d[0]
is equal to the maximum width of allcell[0, 0]
; 

we have solved
d[0]
, and the number of free variables is decreased by 1; now we considerd[1]
, which has three types of constraints:
those imposed by
cell[1, 1]
(1column cell); 
those imposed by
cell[0, 1]
(2column cell); 
all other constraints must involve
d[2]
;
because
d[0]
has been solved, so the first two types of constraints both involve only one free variabled[1]
; so we can take the maximum of those constraints asd[1]
, and by a similar analysis as above, we know there is an optimum solution with suchd[1]
; so we have solvedd[1]
; 

a similar approach applies to
d[2]
,d[3]
, etc.; and finally we have solved all of them and we know that our solution is optimum;
column reduction
there is a caveat in the above algorithm: when we have solved d[0]
, …,
d[j  1]
and work on d[j]
, we may find there are no constraints for d[j]
because there are no cells ending in column j
; that is, any cell that spans
column j
also spans column j + 1
; this can happen only when j
is not the
last column, that is, j != c  1
;
to see an example, look at cell44
in the table given at the beginning of this
article; cell44
are followed by *
in all the rows; this is exactly the way
to detect such columns;
there are two ways to solve this tricky case:

set
d[j]
to0
; because all constraints involvingd[j]
also involved[j + 1]
, we know any optimum solution withd[j] > 0
can be converted to an optimum solution whered[j] == 0
by decreasingd[j]
and increasingd[j + 1]
; 
we can use a merge process to reduce the columns so that the reduced table does not contain such column
j
; the merge process simply merges columnj
and columnj + 1
(and columnj + 2
if no cell ends in columnj + 1
, and columnj + 3
if no cell ends in columnj + 2
, etc.) into new columnj
, and we know there is a cell ending in the new columnj
;
our implementation
our implementation is a lengthy python program; we implemented both ways to
solve the tricky case mentioned above; there should be no need to merge
, but
we keep it there because the merge is employed by the tex algorithm and seems to
simplify the problem a bit;
#!/usr/bin/env python3
## calculate spanned column widths;
##
## ## run
##
## ./colwidth.py [tightlooseempty] [leftright] < {input.txt}
##
##  tight: print table with tight tabskips;
##  loose: print table with loose tabskips;
##  empty: print table with empty tabskips;
##  left: cell text is left justified;
##  right: cell text is right justified;
##
## ## input format
##
## input format is similar to simplified markdown table format:
##
##  use `` as tabskip;
##
##  use `*` after `` to indicate a col span; the number of cols spanned is
## 1 plus the number of `*`;
##
##  `` is a 1col span (namely, a normal cell);
##  `*` is a 2col span;
##  `**` is a 3col span;
##  ...;
##
##  cell text must not contain `` or `*`;
##
##  within a cell, spaces surrounding text are ignored;
##
##  within a cell, spaces within text are kept;
##
## ## input example
##
## these inputs are equivalent:
##
## *Ax xZAxxxxZ**Axx xxx xZAxxxxZ
## * Ax xZ AxxxxZ**Axx xxx xZAxxxxZ
## *Ax xZ  AxxxxZ** Axx xxx xZ AxxxxZ 
from collections import Counter
from pprint import pprint
import numpy as np
import sys
## const: inf;
INF = 1000000
class Table:
'''
a table is implemented as a list of rows, with some aux data; each row is a
list of cells; each cell is a list `[beg, end, str]`; `beg` and `end` are
beginning and ending col indexes of this cell, inclusive; `str` is the cell
text; rows and cells are both in sorted order;
'''
def __init__(self):
## a list of rows;
self.rows = []
## number of rows;
self.R = 0
## number of cols;
self.C = 0
## col span counter; keyed by col index; counts how many times a cell
## beginning at this column spans to or beyond the next column; for
## example, if a cell has col indexes [4, 6], then increase the count
## for cols 4 and 5;
self.cnt = Counter()
def add_row(self, line):
'''
parse the input line (given as str) into a row and add it to this table;
all rows must be added before rendering this table;
'''
## row data;
row = []
## char indexes: begin and end, inclusive;
l = INF
r = INF
## cell indexes: begin and end, inclusive;
i = 1
j = 1
## iter each char in input line;
for k, c in enumerate(line):
if c == ' ': ## case: ignore surrounding space;
pass
elif c == '': ## case: works like `&` in tex;
if i != 1:
## trick: this line works even if cell is empty;
row.append([i, j, line[l:r + 1]])
l = INF
r = INF
j += 1
i = j
elif c == '*': ## case: works like `\span` in tex;
self.cnt[j] += 1
j += 1
else: ## case: normal char;
l = min(l, k)
r = max(r, k)
## add this row to table;
assert row
self.rows.append(row)
## adjust number of cols;
if self.C == 0:
self.C = j
else:
assert self.C == j
## adjust number of rows;
self.R += 1
def merge(self):
'''
merge cols; a col is merged with its next col iff its col span counter
equals to `R`, the number of rows, which means this col always spans to
the next col;
'''
## col index mapping;
mapping = []
## col index offset;
k = 0
## calc col index mapping;
for j in range(self.C):
mapping.append(j  k)
if self.cnt[j] == self.R:
k += 1
## adjust col indexes in cells;
for row in self.rows:
for cell in row:
cell[0] = mapping[cell[0]]
cell[1] = mapping[cell[1]]
## adjust number of cols;
self.C = k
def calc_widths(self, sep):
'''
calculate col widths;
'''
## maximum widths between two cols; init to zeros;
weights = np.zeros((self.C, self.C), dtype=int)
## iter all cells;
for row in self.rows:
for cell in row:
i, j, s = cell[0], cell[1], cell[2]
weights[i, j] = max(weights[i, j], len(s))
## widths;
widths = np.zeros(self.C, dtype=int)
for j in range(self.C):
tmp = weights[j, j]
tot = 0
for i in range(j  1, 1, 1):
tot += widths[i] + len(sep)
tmp = max(tmp, weights[i, j]  tot)
widths[j] = tmp
return widths
def pprint(self, style='loose', justify='left'):
'''
pretty print this table;
'''
## get left/mid/right cell separators;
if style == 'loose':
lsep, sep, rsep = ' ', '  ', ' '
elif style == 'tight':
lsep, sep, rsep = '', '', ''
elif style == 'empty':
lsep, sep, rsep = '', '', ''
else:
raise
## calculate col widths;
widths = self.calc_widths(sep)
print(widths)
## total width;
tot = sum(widths) + (self.C  1) * len(sep) + len(lsep) + len(rsep)
## pretty print all rows;
for row in self.rows:
print('' * tot)
print(lsep, end='')
for k, cell in enumerate(row):
i, j, s = cell[0], cell[1], cell[2]
w = sum(widths[i:j + 1]) + len(sep) * (j  i)
if justify == 'left':
print(s + ' ' * (w  len(s)), end='')
elif justify == 'right':
print(' ' * (w  len(s)) + s, end='')
else:
raise
if k != len(row)  1:
print(sep, end='')
else:
print(rsep)
print('' * tot)
def read_table():
'''
read table from input;
'''
table = Table()
while True:
try:
table.add_row(input())
except EOFError:
break
return table
def main():
'''
main function;
'''
if 'empty' in sys.argv:
style = 'empty'
elif 'loose' in sys.argv:
style = 'loose'
else:
style = 'tight'
if 'right' in sys.argv:
justify = 'right'
else:
justify = 'left'
table = read_table()
# print('R = {}'.format(table.R))
# print('C = {}'.format(table.C))
# pprint(table.rows)
table.merge()
# print('R = {}'.format(table.R))
# print('C = {}'.format(table.C))
# pprint(table.rows)
table.pprint(style, justify)
if __name__ == '__main__':
main()
to get an idea of how it works, we can feed it with this input:
 AxxxxZ  AxxxxZ  AxxxxZ  AxxxxZ * AxxxxxxxxxxxxxZ  AxxxxZ 
* AxxxxxxxxxxxxxZ  AxxxxZ ** AxxxxxxxxxxxxxxxxxxxxxxZ  AxxxxZ 
 AxxxxZ ** AxxxxxxxxxxxxxxxxxxxxxxZ * AxxxxxxxxxxxxxZ  AxxxxZ 
** AxxxxxxxxxxxxxxxxxxxxxxZ ** AxxxxxxxxxxxxxxxxxxxxxxZ  AxxxxZ 
 AxxxxZ * AxxxxxxxxxxxxxZ  AxxxxZ * AxxxxxxxxxxxxxZ  AxxxxZ 
*** AxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxZ * AxxxxxxxxxxxxxZ  AxxxxZ 
 AxxxxZ * AxxxxxxxxxxxxxZ  AxxxxZ * AxxxxxxxxxxxxxZ  AxxxxZ 
* AxxxxxxxxxxxxxZ  AxxxxZ  AxxxxZ * AxxxxxxxxxxxxxZ  AxxxxZ 
save the program as colwidth.py
and the input as colwidth.in
, then run:
./colwidth.py loose < colwidth.in
the output:
[ 6 6 6 6 15 6]

 AxxxxZ  AxxxxZ  AxxxxZ  AxxxxZ  AxxxxxxxxxxxxxZ  AxxxxZ 

 AxxxxxxxxxxxxxZ  AxxxxZ  AxxxxxxxxxxxxxxxxxxxxxxZ  AxxxxZ 

 AxxxxZ  AxxxxxxxxxxxxxxxxxxxxxxZ  AxxxxxxxxxxxxxZ  AxxxxZ 

 AxxxxxxxxxxxxxxxxxxxxxxZ  AxxxxxxxxxxxxxxxxxxxxxxZ  AxxxxZ 

 AxxxxZ  AxxxxxxxxxxxxxZ  AxxxxZ  AxxxxxxxxxxxxxZ  AxxxxZ 

 AxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxZ  AxxxxxxxxxxxxxZ  AxxxxZ 

 AxxxxZ  AxxxxxxxxxxxxxZ  AxxxxZ  AxxxxxxxxxxxxxZ  AxxxxZ 

 AxxxxxxxxxxxxxZ  AxxxxZ  AxxxxZ  AxxxxxxxxxxxxxZ  AxxxxZ 

the first row says the optimum solution has 6 columns and lists their widths; all other rows print the table itself using these optimum widths;
loose
is one of the styles supported by the program; other styles are tight
and empty
; you can also justify text within a cell using left
or right
;
you can try with other inputs that conform to the input specification introduced at the beginning of this article; the specification is quite simple, right? this actually reminds me of writing a table in markdown, even though the inspiration for this article comes from the tex world; whatever, we now have a great tool to draw tables in terminal windows;