Normalization in HTML and CSS

Quick check

Check for normalization mismatches in id and class names

Look for the "Class & id names" field in the Information table.

Normalization is something you need to be aware of if you are authoring HTML pages with CSS style sheets in UTF-8 (or any other Unicode encoding), particularly if you are dealing with text in a script that uses accents or other diacritics. This article explains what normalization is, and why you need to be aware.

What are normalization forms?

In Unicode it is possible to produce the same text with different sequences of characters. For example, take the Hungarian word világ. The fourth letter could be stored in memory as a precomposed U+00E1 LATIN SMALL LETTER A WITH ACUTE (a single character) or as a decomposed sequence of U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT (two characters).

The Unicode Standard allows either of these alternatives, but requires that both be treated as identical. To improve efficiency, an application will usually normalize text before performing searches or comparisons. Normalization, in this case, means converting the text to use all precomposed or all decomposed characters.

There are four normalization forms specified by the Unicode Standard: NFC, NFD, NFKC and NFKD. The C stands for (pre-)composed, and the D for decomposed. The K stands for compatibility. To improve interoperability, the W3C recommends the use of NFC normalized text on the Web.

What do I need to know about normalization?

Unfortunately, normalization doesn't always take place before content is compared. A particularly important case is the use of selectors and class names or ids in HTML and CSS. If the word világ is used in precomposed form in the HTML (eg. <span class="világ">), but in decomposed form in the CSS (eg. .világ { font-style: italic; }), then the selector won't match the class name.

What this means is that when producing content you should ensure that selectors and class or id names are character-for-character the same. This is particularly likely to be a issue if the markup and the CSS are being authored or maintained by different people.

The best way to ensure that these match is to use one particular Unicode normalization form for all authored content. As we said above, the W3C recommends NFC.

Most keyboards for European languages output text in NFC already, but this is less likely to be the case if dealing with many non-European languages.

In some cases your editor may allow you to save data in a choice of normalization forms. The picture below shows an option for setting a particular normalization form as the default when opening new files in Dreamweaver (NFC is selected). You are shown a similar choice when saving a document.

Unicode normalization form preferences on a dialog panel, showing NFC selected.

How can I check pages for problems?

You can find out whether an HTML page contains class names and id values that are not normalized according to NFC by using the W3C Internationalization Checker.

If you do have problems, you should find an editor or conversion tool that allows you to specify the normalization form, and use that to re-save your page.