Python3: Strings, chars and bytes

Every programmer uses strings. A string consists of a sequence of characters. However, to many people’s surprise, Python 3 doesn’t have a char type at all.

You may then wonder what happens if we index a str in Python. Here’s the result tested with str using Python 3.5.2:

type('foo') == str
type('foo'[0]) == str    # Why not char?

In contrast, here’s the result with bytes:

type(b'foo') == bytes
type(b'foo'[0]) == int   # Why not bytes?

Why does the same operation have different effects on str and bytes?

A short survey of Unicode

Internally, Python uses Unicode character set to represent strings. Unicode defines a codespace of 1,114,112 (or 0x110000) code points in the range 0x0 to 0x10FFFF. It’s obvious that the codespace fits in a 4-byte integer, but not a 2-byte or a 1-byte integer.

Usually, we don’t address Unicode characters directly in our programs. Instead, we use encodings. An encoding is a mapping from code points (index into the Unicode codespace) to code values (values we write in our programs). The most widely used encodings include:

UTF-8: an 8-bit variable-width encoding which maximizes compatibility with ASCII.
UTF-16: a 16-bit, variable-width encoding.
UTF-32: a 32-bit, fixed-width encoding.

There are other encodings as well:

UCS-2: a 16-bit, fixed-width encoding. An obsolete subset of UTF-16.
UCS-4: a 32-bit, fixed-width encoding. Now identical to UTF-32.

What is a char

As it turns out, a char is simply a code value.

Depending on the charset and encoding, it can be 1-byte, 2-byte, 4-byte, etc.

Many people have the illusion that a char is an 8-bit integer, or a byte. This is C legacy. During the time C was invented, the prevalent charset was ASCII. ASCII is a 7-bit encoding supporting 128 different characters. That’s why a char fits nicely in a byte. But things have changed a lot afterwards. More and more characters need to be used and ASCII quickly became insufficient. And this is what motivated the invention of Unicode, which is used internally in Python nowadays.

Strings, chars and bytes

So what is exactly the relationship between strings, chars and bytes?

The answer is:

A string consists of a sequence of chars.
A char consists of a sequence of bytes.

This is actually a 3-tier relationship.

Strings, chars and bytes in C

C equates chars to bytes except for signedness. This was a reasonable design in the ASCII era, as pointed out above. Thus we can safely say C only has 2 tiers: One is string, the other is char or byte, whichever you prefer.

To be precise, C doesn’t have strings, either. What C actually has is a char array. But since a char array contains multiple chars and is used as a string, we still see it as a different tier than chars.

Strings, chars and bytes in Python

Obviously there is a difference between strings and bytes in Python. You can encode str into bytes and decode bytes into str.

But where is char? The answer is: Python designers don’t think programmers need to have a standalone char type. They think what you need is a one-char string. Therefore they shortcut the char type into the str type. Wherever you expect a char, you will encounter a one-char string.

Does this limit Python programmers’ capabilities?

The answer is yes and no:

‘Yes’ because programmers don’t have the capability to nagivate the codespace. For example, C programmers can write 'a' + 1 to get 'b', but Python programmers cannot.
‘No’ because Python is designed to be encoding-agnostic. The Python designers don’t want programmers to utilize their knowledge about the underlying charset and encoding. A Python string is a conceptual string, not to be related with any binary stuff. Therefore programmers are not expected to navigate the codespace. Even if methods such as ord and chr are provided, their usage is discouraged except in a few cases. And if a Python programmer wants to iterate from 'a' to 'z', he is better off using string.ascii_lowercase rather than adding 1 to the current character each time.

From these facts, we can say Python has merged chars into strs as one-char strs. Therefore, Python also have 2 tiers: strings and bytes.

Answer to our questions

Back to our original questions:

type('foo') == str
type('foo'[0]) == str    # Why not char?

The answer is: Because Python has merged char into str as one-char str. Wherever you expect a char, you will encounter a one-char str which can be used as a char. And that one-char str is typed as str.

type(b'foo') == bytes
type(b'foo'[0]) == int   # Why not bytes?

The answer is: Because a bytes is really a series of (range-limited) integers. This question is actually trivial and solely used to provide a contrast to the above one.