unicode normalization

some characters look the same to users but can have different encodings;

for example, both U+00E9 and U+0065U+0301 give the same character é, but their encodings are not considered the same:

>>> a = '\u00e9'
>>> b = '\u0065\u0301'
>>> a == b
    False
>>> a.encode() == b.encode()
    False

this problem is called unicode normalization; the unicode normalization faq covers this topic very well; also the wikipedia article on unicode equivalence;

the solution is easy; many languages have libraries for unicode normalization;

in python, we can write:

>>> a_nfc = unicodedata.normalize('NFC', a)
>>> b_nfc = unicodedata.normalize('NFC', b)
>>> a_nfc == b_nfc
    True
>>> a_nfc.encode() == b_nfc.encode()
    True

NFC stands for Normalization Form canonical Composition, which is the most widely used normalization form and provides a compact representation; in this case it gives U+00E9; if instead we have used NFD, which stands for Normalization Form canonical Decomposition, then we would get U+0065U+0301;

in javascript, see this page;