Validating HTML5 charset declarations

By: Oli Studholme, 2009-04-21

Summary

Use <meta charset="utf-8">, preferably with a HTTP-header character encoding declaration, and ignore the W3C validator content encoding warnings.

Detail

The W3 validator can be a bit fussy with HTML5, giving several spurious character encoding-related warnings depending on the input method used. The first thing to note is that some of these are due to a bug in an underlying Perl library (W3C Validator bug, Perl HTML::Encoding bug). You can double-check against Validator Nu. Next, the validator is supposed to take character encoding (charset) from:

  1. HTTP Content-Type (headers)
  2. then (if applicable) xml declaration
  3. then look for a meta
  4. then fall back to utf-8

Also only one of these methods is required to set the character encoding in HTML5:

To prevent the W3C validator from showing spurious warnings:

Input method Current warning-free requirements
Validate by URI The W3C validator requires both an Apache HTTP-header of Content-Type: text/html; charset=UTF-8 and a <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> element
Validate by File Upload The W3C validator requires a <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> element

Note that for “Validate by Direct Input” the WC3 validator will always give at least one character encoding-related warning.

Testing results

HTTP-header declaration In-document declaration W3 Validator result
Content-Type: text/html; charset=UTF-8 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  • CheckPass!
Content-Type: text/html; charset=UTF-8 <meta charset="utf-8">
  • InformativeNo Character encoding declared at document level
Content-Type: text/html; charset=UTF-8 None
  • InformativeNo Character encoding declared at document level
None <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  • CheckPass!
None <meta charset="utf-8">
  • WarningNo Character Encoding Found! Falling back to UTF-8
  • InformativeNo Character encoding declared at document level
None None
  • WarningNo Character Encoding Found! Falling back to UTF-8
  • InformativeNo Character encoding declared at document level
  • ErrorNo explicit character encoding declaration has been seen yet (assumed utf-8) but the document contains non-ASCII

Note you can simulate no HTTP-header by using “Validate by File Upload”.

Also note that if you copy & paste into the Direct Input field, character encoding meta elements are ignored and you’ll always get the warning Using Direct Input mode: UTF-8 character encoding assumed:

Unlike the “by URI” and “by File Upload” modes, the “Direct Input” mode of the validator provides validated content in the form of characters pasted or typed in the validator's form field. This will automatically make the data UTF-8, and therefore the validator does not need to determine the character encoding of your document, and will ignore any charset information specified.

Checking and fixing HTTP-headers

For more information about setting a character encoding, and why it’s a good thing, refer to the W3C’s Character Encoding in HTML and I18N FAQ: Setting charset information in .htaccess articles. You can check a page’s HTTP-headers with:

You can add a Content-Type HTTP-header with this .htaccess declaration (W3C I18N FAQ again):

AddType 'text/html; charset=UTF-8' html

FamFamCheck