Unicode Normalization

EmEditor provides support for normalizing Unicode characters and sequences. One example of when text normalization is useful is if you have a dataset containing Unicode inputs from many sources. You may want to normalize all strings to a single form so that matching equivalent characters becomes easier.

UAX #15 Unicode Normalization Forms describes four algorithms for normalizing characters and sequences: canonical composition, canonical decomposition, compatibility composition, and compatibility decomposition.

Decomposition is the process of breaking a character into its smaller units. If we applied canonical decomposition to the single character ñ, a LATIN SMALL LETTER N WITH TILDE, and viewed the Character Code Value (Ctrl+I), it shows that sequence is now two characters, LATIN SMALL LETTER N; COMBINING TILDE. Canonical composition reverses the previous command.

All canonical equivalences are compatible, but not all compatible relations are canonically equivalent. Canonically equivalent forms are identical in appearance and meaning, like in the previous example with ñ.

On the other hand, two compatible forms may look slightly different and they only have the same meaning in certain contexts. ¼ and 1/4 are compatible forms but are not canonically equivalent. ¼ looks slightly different than 1/4. Whereas ¼ means “a quarter,” 1/4 sometimes means “one divided by four,” so they are only interchangeable in certain contexts.

The normalization commands are accessed through Convert > Encode/Decode.