calculate spanned column widths
here is another job a typesetting system has to do: calculate spanned column widths in tables when cells span multiple columns;
our input
an example speaks best for this problem; normally we have tables like this:
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
| cell00 | cell11 | cell22 | cell33 | cell44 | cell55 | cell66 |
we have used the symbol |
for tabskips; and to make our explanation easier, we
have labeled each cell with 2 numbers: its beginning and ending columns,
inclusive; for example, cell44
means a cell beginning at column 4
and ending
at column 4
; because every cell is 1-column wide, these two numbers are always
the same;
they become different when we allow cells to span multiple columns; to do so, we
first change some |
into *
, which are called imaginary tabskips in our
terminology; they are called imaginary because they have actually been removed,
but could be helpful in the analysis of our algorithm if we think they are still
there; now the table looks like this:
| cell00 | cell11 | cell22 | cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 | cell22 | cell33 * cell44 * cell55 | cell66 |
| cell00 | cell11 * cell22 * cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 * cell22 | cell33 * cell44 * cell55 | cell66 |
| cell00 | cell11 * cell22 | cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 * cell22 * cell33 | cell44 * cell55 | cell66 |
| cell00 | cell11 * cell22 | cell33 | cell44 * cell55 | cell66 |
| cell00 * cell11 | cell22 | cell33 | cell44 * cell55 | cell66 |
then we rename the cell numbers; for example: cell11 * cell22 * cell33
becomes
cell13
; now the table looks like this:
| cell00 | cell11 | cell22 | cell33 | cell45 | cell66 |
| cell01 | cell22 | cell35 | cell66 |
| cell00 | cell13 | cell45 | cell66 |
| cell02 | cell35 | cell66 |
| cell00 | cell12 | cell33 | cell45 | cell66 |
| cell03 | cell45 | cell66 |
| cell00 | cell12 | cell33 | cell45 | cell66 |
| cell01 | cell22 | cell33 | cell45 | cell66 |
can we use this format as input for specifying a column-spanning table? no,
because it is ambigious; this format does not say how many columns a cell spans
without looking at the numbers; so we need a little more symbols to mark the
length of spans; in fact, we could just invite the *
symbol back into the
table, making it like this:
| cell00 | cell11 | cell22 | cell33 | cell45 | cell66 |
|* cell01 | cell22 |** cell35 | cell66 |
| cell00 |** cell13 |* cell45 | cell66 |
|** cell02 |** cell35 | cell66 |
| cell00 |* cell12 | cell33 |* cell45 | cell66 |
|*** cell03 |* cell45 | cell66 |
| cell00 |* cell12 | cell33 |* cell45 | cell66 |
|* cell01 | cell22 | cell33 |* cell45 | cell66 |
what we see?
-
|
leads a cell that spans 1 column; -
|*
leads a cell that spans 2 columns; -
|**
leads a cell that spans 3 columns; -
similarly, a
|
withn
*
leads a cell that spansn + 1
columns;
this makes our input specification unambigious;
our goal
our goal is to calculate a width for each column, so that the total width of the
table is minimized and all constraints are satisfied (no cell spills over); to
formalize, if there are c
columns numbered as 0
to c - 1
, the optimum
widths d[0]
, …, d[c - 1]
should have their sum minimized, while satisfying
all constraints, where each constraint is the sum d[i] + ... + d[j]
bounded by
the length of text in cell [i, j]
, for each cell [i, j]
;
our algorithm
this problem is a linear programming where all coefficients are either 0
or
1
; furthermore, if any constraint involves columns i
and j
where i <= j
,
then it also involves all k
such that i <= k <= j
;
this linear programming can be solved iteratively from 0
to c - 1
; when we
are solving column j
, we only consider constraints at or before j
:
-
when
j = 0
: we consider all cells that span from column0
to column0
(denoted bycell[0, 0]
), and we knowd[0]
must be at least the maximum width of all these cells; -
we claim that there is an optimum solution with
d[0]
equal to the maximum width of all these cells; because all other constraints involving column0
would also involve column1
; if there is an optimum solution whosed[0]
is greater than the maximum width of allcell[0, 0]
, then we take that solution, decreased[0]
and increased[1]
by the same amount, untild[0]
is equal to the maximum width of allcell[0, 0]
;-
we know we are not violating any constraint, because:
-
d[0]
is always big enough for constraints involving only column0
; -
all other constraints have the form
d[0] + d[1] + ...
, where we have keptd[0] + d[1]
constant;
-
-
we know we are not changing the objective function, because:
- the objective function has the form
d[0] + d[1] + ...
, where we have keptd[0] + d[1]
constant;
- the objective function has the form
so finally we get an optimum solution where
d[0]
is equal to the maximum width of allcell[0, 0]
; -
-
we have solved
d[0]
, and the number of free variables is decreased by 1; now we considerd[1]
, which has three types of constraints:-
those imposed by
cell[1, 1]
(1-column cell); -
those imposed by
cell[0, 1]
(2-column cell); -
all other constraints must involve
d[2]
;
because
d[0]
has been solved, so the first two types of constraints both involve only one free variabled[1]
; so we can take the maximum of those constraints asd[1]
, and by a similar analysis as above, we know there is an optimum solution with suchd[1]
; so we have solvedd[1]
; -
-
a similar approach applies to
d[2]
,d[3]
, etc.; and finally we have solved all of them and we know that our solution is optimum;
column reduction
there is a caveat in the above algorithm: when we have solved d[0]
, …,
d[j - 1]
and work on d[j]
, we may find there are no constraints for d[j]
because there are no cells ending in column j
; that is, any cell that spans
column j
also spans column j + 1
; this can happen only when j
is not the
last column, that is, j != c - 1
;
to see an example, look at cell44
in the table given at the beginning of this
article; cell44
are followed by *
in all the rows; this is exactly the way
to detect such columns;
there are two ways to solve this tricky case:
-
set
d[j]
to0
; because all constraints involvingd[j]
also involved[j + 1]
, we know any optimum solution withd[j] > 0
can be converted to an optimum solution whered[j] == 0
by decreasingd[j]
and increasingd[j + 1]
; -
we can use a merge process to reduce the columns so that the reduced table does not contain such column
j
; the merge process simply merges columnj
and columnj + 1
(and columnj + 2
if no cell ends in columnj + 1
, and columnj + 3
if no cell ends in columnj + 2
, etc.) into new columnj
, and we know there is a cell ending in the new columnj
;
our implementation
our implementation is a lengthy python program; we implemented both ways to
solve the tricky case mentioned above; there should be no need to merge
, but
we keep it there because the merge is employed by the tex algorithm and seems to
simplify the problem a bit;
#!/usr/bin/env python3
## calculate spanned column widths;
##
## ## run
##
## ./colwidth.py [tight|loose|empty] [left|right] < {input.txt}
##
## - tight: print table with tight tabskips;
## - loose: print table with loose tabskips;
## - empty: print table with empty tabskips;
## - left: cell text is left justified;
## - right: cell text is right justified;
##
## ## input format
##
## input format is similar to simplified markdown table format:
##
## - use `|` as tabskip;
##
## - use `*` after `|` to indicate a col span; the number of cols spanned is
## 1 plus the number of `*`;
##
## - `|` is a 1-col span (namely, a normal cell);
## - `|*` is a 2-col span;
## - `|**` is a 3-col span;
## - ...;
##
## - cell text must not contain `|` or `*`;
##
## - within a cell, spaces surrounding text are ignored;
##
## - within a cell, spaces within text are kept;
##
## ## input example
##
## these inputs are equivalent:
##
## |*Ax xZ|AxxxxZ|**Axx xxx xZ|AxxxxZ|
## |* Ax xZ| AxxxxZ|**Axx xxx xZ|AxxxxZ|
## |*Ax xZ | AxxxxZ|** Axx xxx xZ| AxxxxZ |
from collections import Counter
from pprint import pprint
import numpy as np
import sys
## const: inf;
INF = 1000000
class Table:
'''
a table is implemented as a list of rows, with some aux data; each row is a
list of cells; each cell is a list `[beg, end, str]`; `beg` and `end` are
beginning and ending col indexes of this cell, inclusive; `str` is the cell
text; rows and cells are both in sorted order;
'''
def __init__(self):
## a list of rows;
self.rows = []
## number of rows;
self.R = 0
## number of cols;
self.C = 0
## col span counter; keyed by col index; counts how many times a cell
## beginning at this column spans to or beyond the next column; for
## example, if a cell has col indexes [4, 6], then increase the count
## for cols 4 and 5;
self.cnt = Counter()
def add_row(self, line):
'''
parse the input line (given as str) into a row and add it to this table;
all rows must be added before rendering this table;
'''
## row data;
row = []
## char indexes: begin and end, inclusive;
l = INF
r = -INF
## cell indexes: begin and end, inclusive;
i = -1
j = -1
## iter each char in input line;
for k, c in enumerate(line):
if c == ' ': ## case: ignore surrounding space;
pass
elif c == '|': ## case: works like `&` in tex;
if i != -1:
## trick: this line works even if cell is empty;
row.append([i, j, line[l:r + 1]])
l = INF
r = -INF
j += 1
i = j
elif c == '*': ## case: works like `\span` in tex;
self.cnt[j] += 1
j += 1
else: ## case: normal char;
l = min(l, k)
r = max(r, k)
## add this row to table;
assert row
self.rows.append(row)
## adjust number of cols;
if self.C == 0:
self.C = j
else:
assert self.C == j
## adjust number of rows;
self.R += 1
def merge(self):
'''
merge cols; a col is merged with its next col iff its col span counter
equals to `R`, the number of rows, which means this col always spans to
the next col;
'''
## col index mapping;
mapping = []
## col index offset;
k = 0
## calc col index mapping;
for j in range(self.C):
mapping.append(j - k)
if self.cnt[j] == self.R:
k += 1
## adjust col indexes in cells;
for row in self.rows:
for cell in row:
cell[0] = mapping[cell[0]]
cell[1] = mapping[cell[1]]
## adjust number of cols;
self.C -= k
def calc_widths(self, sep):
'''
calculate col widths;
'''
## maximum widths between two cols; init to zeros;
weights = np.zeros((self.C, self.C), dtype=int)
## iter all cells;
for row in self.rows:
for cell in row:
i, j, s = cell[0], cell[1], cell[2]
weights[i, j] = max(weights[i, j], len(s))
## widths;
widths = np.zeros(self.C, dtype=int)
for j in range(self.C):
tmp = weights[j, j]
tot = 0
for i in range(j - 1, -1, -1):
tot += widths[i] + len(sep)
tmp = max(tmp, weights[i, j] - tot)
widths[j] = tmp
return widths
def pprint(self, style='loose', justify='left'):
'''
pretty print this table;
'''
## get left/mid/right cell separators;
if style == 'loose':
lsep, sep, rsep = '| ', ' | ', ' |'
elif style == 'tight':
lsep, sep, rsep = '|', '|', '|'
elif style == 'empty':
lsep, sep, rsep = '', '', ''
else:
raise
## calculate col widths;
widths = self.calc_widths(sep)
print(widths)
## total width;
tot = sum(widths) + (self.C - 1) * len(sep) + len(lsep) + len(rsep)
## pretty print all rows;
for row in self.rows:
print('-' * tot)
print(lsep, end='')
for k, cell in enumerate(row):
i, j, s = cell[0], cell[1], cell[2]
w = sum(widths[i:j + 1]) + len(sep) * (j - i)
if justify == 'left':
print(s + ' ' * (w - len(s)), end='')
elif justify == 'right':
print(' ' * (w - len(s)) + s, end='')
else:
raise
if k != len(row) - 1:
print(sep, end='')
else:
print(rsep)
print('-' * tot)
def read_table():
'''
read table from input;
'''
table = Table()
while True:
try:
table.add_row(input())
except EOFError:
break
return table
def main():
'''
main function;
'''
if 'empty' in sys.argv:
style = 'empty'
elif 'loose' in sys.argv:
style = 'loose'
else:
style = 'tight'
if 'right' in sys.argv:
justify = 'right'
else:
justify = 'left'
table = read_table()
# print('R = {}'.format(table.R))
# print('C = {}'.format(table.C))
# pprint(table.rows)
table.merge()
# print('R = {}'.format(table.R))
# print('C = {}'.format(table.C))
# pprint(table.rows)
table.pprint(style, justify)
if __name__ == '__main__':
main()
to get an idea of how it works, we can feed it with this input:
| AxxxxZ | AxxxxZ | AxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|* AxxxxxxxxxxxxxZ | AxxxxZ |** AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
| AxxxxZ |** AxxxxxxxxxxxxxxxxxxxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|** AxxxxxxxxxxxxxxxxxxxxxxZ |** AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
| AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|*** AxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
| AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
|* AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxZ |* AxxxxxxxxxxxxxZ | AxxxxZ |
save the program as colwidth.py
and the input as colwidth.in
, then run:
./colwidth.py loose < colwidth.in
the output:
[ 6 6 6 6 15 6]
----------------------------------------------------------------
| AxxxxZ | AxxxxZ | AxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxZ | AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
| AxxxxxxxxxxxxxZ | AxxxxZ | AxxxxZ | AxxxxxxxxxxxxxZ | AxxxxZ |
----------------------------------------------------------------
the first row says the optimum solution has 6 columns and lists their widths; all other rows print the table itself using these optimum widths;
loose
is one of the styles supported by the program; other styles are tight
and empty
; you can also justify text within a cell using left
or right
;
you can try with other inputs that conform to the input specification introduced at the beginning of this article; the specification is quite simple, right? this actually reminds me of writing a table in markdown, even though the inspiration for this article comes from the tex world; whatever, we now have a great tool to draw tables in terminal windows;