Pretty print multi-byte characters

Many programs need to pretty print a series of data for better view and easier post processing. For example, we may need a simple program len.sh which prints the lengths of its arguments:

#!/bin/bash
for i in "$@"; do
    echo "$i" "${#i}"
done

To pretty print, we align each field on the same column by specifying its width:

#!/bin/bash
for i in "$@"; do
    printf "%16s %02d\n" "$i" "${#i}"
done

Running this program:

# ./len.sh 'a' 'bb' 'ccc'

Gives output (^ indicating column 16):

The multi-byte problem

While the above program seems to work with ascii inputs, it turns out to be faulty on multi-byte inputs. Here we give some examples (assuming locale is en_US.UTF-8):

Example 1

# ./len.sh 'Pret' 'Prêt'

               ^
            Pret 04
           Prêt 04

Example 2

# ./len.sh 'Pr$t' 'Pr¢t' 'Pr†t' 'Pr𐍈t'

Example 3

# ./len.sh '†††' 'あああ'

               ^
       ††† 03
       あああ 03

Example 4

# ./len.sh '†' '††' '†††'

Example 5

# ./len.sh 'あ' 'ああ' 'あああ'

Even though the program still correctly calculates the lengths, the printed fields are obviously misaligned.

Understand the problem

To understand the problem here, we must first understand how printf works.

POSIX.1-2008 makes provisions for the printf utility and the printf() function (on which the utility is based). Specifically, it mentions the field width:

An optional minimum field width. If the converted value has fewer bytes than the field width, it shall be padded with <space> characters by default on the left; it shall be padded on the right if the left-adjustment flag ( ‘-‘), described below, is given to the field width. The field width takes the form of an <asterisk> ( ‘*’ ), described below, or a decimal integer.

The importance here is that the field width counts bytes, not chars. Because ê is a 2-byte char and e is a 1-byte char in UTF-8 encoding, Prêt is padded with one less space char than Pret given the same field width. This explains the 1st example of misalignment.

Understanding the 1st example makes it easier to understand the second example. The four characters: $, ¢, †, 𐍈 are actually represented by 1, 2, 3, 4 bytes, respectively:

>>> import binascii
>>> [binascii.hexlify(x.encode('utf-8')) for x in ['$', '¢', '†', '𐍈']]
[b'24', b'c2a2', b'e280a0', b'f0908d88']

Hence the staircase in printing.

The 3rd example has something new. Both † and あ are 3-byte characters, but あああ seems to align better than †††. In fact, both strings are missing (3 - 1) * 3 = 6 space chars. However, あ is a wide char which has a display width of 2. This makes it right-align at column 16 - 6 + 3 * (2 - 1) = 13, instead of 16 - 6 = 10 as for †.

This example is enlightening. It shows the essense of pretty printing multi-byte chars is to match display width with number of bytes.

The 4th and 5th examples further shows the phenomenon. Each † introduces a difference of 3 - 1 = 2 between display width and number of bytes. Therefore it’s shifting by 2 columns each time the number of chars increases. By similar analysis, あ is shifting by 3 - 2 = 1 column each.

Solve the problem

The solution is pretty intuitive: Given a multi-byte string, get its number of bytes (bytes) and display width (width), then adjust field width by bytes - width.

Now we just need to find how to calculate these numbers. Calculating bytes is easy:

bytes() {
    echo -n "$1" | wc -c
}

Unfortunately, there doesn’t seem to be a shell builtin to calculate width. However, we can leverage an external perl module Text-CharWidth (Arch, Debian):

width() {
    perl -e 'use Text::CharWidth qw(mbswidth);print mbswidth($ARGV[0]);' "$1"
}

Now update our script len.sh as:

#!/bin/bash
bytes() {
    echo -n "$1" | wc -c
}

width() {
    perl -e 'use Text::CharWidth qw(mbswidth);print mbswidth($ARGV[0]);' "$1"
}

for i in "$@"; do
    b="$(bytes "$i")"
    w="$(width "$i")"
    printf "%*s %02d\n" "$((16+b-w))" "$i" "${#i}"
done

This script pretty prints multi-byte chars correctly:

# ./len.sh 'a' 'bb' 'ccc'

# ./len.sh 'Pret' 'Prêt'

               ^
            Pret 04
            Prêt 04

# ./len.sh 'Pr$t' 'Pr¢t' 'Pr†t' 'Pr𐍈t'

# ./len.sh '†††' 'あああ'

               ^
             ††† 03
          あああ 03

# ./len.sh '†' '††' '†††'

# ./len.sh 'あ' 'ああ' 'あああ'

Solve the problem in C

We have solved the problem in Bash. However, the same problem exists in other programming languages as well. Here we also give a solution in C:

#define _XOPEN_SOURCE

#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>

int main(int argc, char **argv) {
    setlocale(LC_ALL, "en_US.utf8");

    for (int i = 1;i != argc;i++) {
        wchar_t wcs[1024];
        size_t m = mbstowcs(wcs, argv[i], 1024 - 1);
        wcs[m] = L'\0';

        size_t c = wcslen(wcs);
        int w = wcswidth(wcs, c);
        wprintf(L"%*ls %02d\n", 16 + c - w, wcs, c);
    }

    return 0;
}

Here’s a difference from Bash. POSIX.1-2008 makes provisions for the wprintf() function, where it mentions the field width counts wide chars (instead of bytes):

An optional minimum field width. If the converted value has fewer wide characters than the field width, it shall be padded with <space> characters by default on the left; it shall be padded on the right, if the left-adjustment flag ( ‘-‘ ), described below, is given to the field width. The field width takes the form of an <asterisk> ( ‘*’ ), described below, or a decimal integer.

A special case

There is a special case: If all multi-byte chars have display width 1, then we can simply count number of multi bytes using bitwise operation. This is because:

A char in UTF-8 has at most 4 bytes.
Byte 1 never starts with 0x10.
Byte 2-4 all start with 0x10.

See UTF-8 for charts.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int count(const char *str)
{
    int len = 0;
    while (*str != '\0') {
        if ((*str & 0xc0) == 0x80) len++; str++;
    }
    return len;
}

int main(int argc, char **argv) {
    for (int i = 1;i != argc;i++) {
        printf("%*s %02d\n", 16 + count(argv[i]), argv[i], strlen(argv[i]));
    }
    return 0;
}