Pretty print multi-byte characters
Many programs need to pretty print a series of data for better view and easier
post processing. For example, we may need a simple program len.sh
which prints
the lengths of its arguments:
#!/bin/bash
for i in "$@"; do
echo "$i" "${#i}"
done
To pretty print, we align each field on the same column by specifying its width:
#!/bin/bash
for i in "$@"; do
printf "%16s %02d\n" "$i" "${#i}"
done
Running this program:
# ./len.sh 'a' 'bb' 'ccc'
Gives output (^
indicating column 16):
^
a 01
bb 02
ccc 03
The multi-byte problem
While the above program seems to work with ascii inputs, it turns out to be
faulty on multi-byte inputs. Here we give some examples (assuming locale is
en_US.UTF-8
):
Example 1
# ./len.sh 'Pret' 'Prêt'
^
Pret 04
Prêt 04
Example 2
# ./len.sh 'Pr$t' 'Pr¢t' 'Pr†t' 'Pr𐍈t'
^
Pr$t 04
Pr¢t 04
Pr†t 04
Pr𐍈t 04
Example 3
# ./len.sh '†††' 'あああ'
^
††† 03
あああ 03
Example 4
# ./len.sh '†' '††' '†††'
^
† 01
†† 02
††† 03
Example 5
# ./len.sh 'あ' 'ああ' 'あああ'
^
あ 01
ああ 02
あああ 03
Even though the program still correctly calculates the lengths, the printed fields are obviously misaligned.
Understand the problem
To understand the problem here, we must first understand how printf
works.
POSIX.1-2008 makes provisions for the printf
utility and the
printf()
function (on which the utility is based). Specifically, it
mentions the field width:
An optional minimum field width. If the converted value has fewer bytes than the field width, it shall be padded with <space> characters by default on the left; it shall be padded on the right if the left-adjustment flag ( ‘-‘), described below, is given to the field width. The field width takes the form of an <asterisk> ( ‘*’ ), described below, or a decimal integer.
The importance here is that the field width counts bytes, not chars.
Because ê
is a 2-byte char and e
is a 1-byte char in UTF-8 encoding, Prêt
is padded with one less space char than Pret
given the same field width. This
explains the 1st example of misalignment.
Understanding the 1st example makes it easier to understand the second example.
The four characters: $
, ¢
, †
, 𐍈
are actually represented by 1, 2, 3, 4
bytes, respectively:
>>> import binascii
>>> [binascii.hexlify(x.encode('utf-8')) for x in ['$', '¢', '†', '𐍈']]
[b'24', b'c2a2', b'e280a0', b'f0908d88']
Hence the staircase in printing.
The 3rd example has something new. Both †
and あ
are 3-byte characters, but
あああ
seems to align better than †††
. In fact, both strings are missing
(3 - 1) * 3 = 6 space chars. However, あ
is a wide char which has a display
width of 2. This makes it right-align at column 16 - 6 + 3 * (2 - 1) = 13,
instead of 16 - 6 = 10 as for †
.
This example is enlightening. It shows the essense of pretty printing multi-byte chars is to match display width with number of bytes.
The 4th and 5th examples further shows the phenomenon. Each †
introduces a
difference of 3 - 1 = 2 between display width and number of bytes. Therefore
it’s shifting by 2 columns each time the number of chars increases. By similar
analysis, あ
is shifting by 3 - 2 = 1 column each.
Solve the problem
The solution is pretty intuitive: Given a multi-byte string, get its number of
bytes (bytes
) and display width (width
), then adjust field width by bytes -
width
.
Now we just need to find how to calculate these numbers. Calculating bytes
is
easy:
bytes() {
echo -n "$1" | wc -c
}
Unfortunately, there doesn’t seem to be a shell builtin to calculate width
.
However, we can leverage an external perl module Text-CharWidth
(Arch, Debian):
width() {
perl -e 'use Text::CharWidth qw(mbswidth);print mbswidth($ARGV[0]);' "$1"
}
Now update our script len.sh
as:
#!/bin/bash
bytes() {
echo -n "$1" | wc -c
}
width() {
perl -e 'use Text::CharWidth qw(mbswidth);print mbswidth($ARGV[0]);' "$1"
}
for i in "$@"; do
b="$(bytes "$i")"
w="$(width "$i")"
printf "%*s %02d\n" "$((16+b-w))" "$i" "${#i}"
done
This script pretty prints multi-byte chars correctly:
# ./len.sh 'a' 'bb' 'ccc'
^
a 01
bb 02
ccc 03
# ./len.sh 'Pret' 'Prêt'
^
Pret 04
Prêt 04
# ./len.sh 'Pr$t' 'Pr¢t' 'Pr†t' 'Pr𐍈t'
^
Pr$t 04
Pr¢t 04
Pr†t 04
Pr𐍈t 04
# ./len.sh '†††' 'あああ'
^
††† 03
あああ 03
# ./len.sh '†' '††' '†††'
^
† 01
†† 02
††† 03
# ./len.sh 'あ' 'ああ' 'あああ'
^
あ 01
ああ 02
あああ 03
Solve the problem in C
We have solved the problem in Bash. However, the same problem exists in other programming languages as well. Here we also give a solution in C:
#define _XOPEN_SOURCE
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
int main(int argc, char **argv) {
setlocale(LC_ALL, "en_US.utf8");
for (int i = 1;i != argc;i++) {
wchar_t wcs[1024];
size_t m = mbstowcs(wcs, argv[i], 1024 - 1);
wcs[m] = L'\0';
size_t c = wcslen(wcs);
int w = wcswidth(wcs, c);
wprintf(L"%*ls %02d\n", 16 + c - w, wcs, c);
}
return 0;
}
Here’s a difference from Bash. POSIX.1-2008 makes provisions for the
wprintf()
function, where it mentions the field width counts
wide chars (instead of bytes):
An optional minimum field width. If the converted value has fewer wide characters than the field width, it shall be padded with <space> characters by default on the left; it shall be padded on the right, if the left-adjustment flag ( ‘-‘ ), described below, is given to the field width. The field width takes the form of an <asterisk> ( ‘*’ ), described below, or a decimal integer.
A special case
There is a special case: If all multi-byte chars have display width 1, then we can simply count number of multi bytes using bitwise operation. This is because:
-
A char in UTF-8 has at most 4 bytes.
-
Byte 1 never starts with 0x10.
-
Byte 2-4 all start with 0x10.
See UTF-8 for charts.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int count(const char *str)
{
int len = 0;
while (*str != '\0') {
if ((*str & 0xc0) == 0x80) len++; str++;
}
return len;
}
int main(int argc, char **argv) {
for (int i = 1;i != argc;i++) {
printf("%*s %02d\n", 16 + count(argv[i]), argv[i], strlen(argv[i]));
}
return 0;
}