some characters look the same to users but can have different encodings;

for example, both U+00E9 and U+0065U+0301 give the same character é, but their encodings are not considered the same:

>>> a = '\u00e9'
>>> b = '\u0065\u0301'
>>> a == b
False
>>> a.encode() == b.encode()
False


this problem is called unicode normalization; the unicode normalization faq covers this topic very well; also the wikipedia article on unicode equivalence;

the solution is easy; many languages have libraries for unicode normalization;

in python, we can write:

>>> a_nfc = unicodedata.normalize('NFC', a)
>>> b_nfc = unicodedata.normalize('NFC', b)
>>> a_nfc == b_nfc
True
>>> a_nfc.encode() == b_nfc.encode()
True


NFC stands for Normalization Form canonical Composition, which is the most widely used normalization form and provides a compact representation; in this case it gives U+00E9; if instead we have used NFD, which stands for Normalization Form canonical Decomposition, then we would get U+0065U+0301;