unicode normalization
some characters look the same to users but can have different encodings;
for example, both U+00E9
and U+0065U+0301
give the same character é
, but
their encodings are not considered the same:
>>> a = '\u00e9'
>>> b = '\u0065\u0301'
>>> a == b
False
>>> a.encode() == b.encode()
False
this problem is called unicode normalization; the unicode normalization faq covers this topic very well; also the wikipedia article on unicode equivalence;
the solution is easy; many languages have libraries for unicode normalization;
in python, we can write:
>>> a_nfc = unicodedata.normalize('NFC', a)
>>> b_nfc = unicodedata.normalize('NFC', b)
>>> a_nfc == b_nfc
True
>>> a_nfc.encode() == b_nfc.encode()
True
NFC
stands for Normalization Form canonical Composition, which is the most
widely used normalization form and provides a compact representation; in this
case it gives U+00E9
; if instead we have used NFD
, which stands for
Normalization Form canonical Decomposition, then we would get U+0065U+0301
;
in javascript, see this page;