This section describes the HTML syntax in detail. In places, it also notes differences between the the HTML syntax and the XML syntax, but it does not describe the XML syntax in detail (the XML syntax is instead defined by rules in the [XML] specification and in the [Namespaces in XML] specification).
This section is divided into the following parts:
A doctype (sometimes capitalized as “DOCTYPE”) is an special instruction which, for legacy reasons that have to do with processing modes in browsers, is a required part of any document in the HTML syntax; it must match the characteristics of one of the following three formats:
A normal doctype consists of the following parts, in exactly the following order:
<!DOCTYPE
".HTML
".>
"
character.The following is an example of a conformant normal doctype.
<!DOCTYPE html>
A deprecated doctype consists of the following parts, in exactly the following order:
<!DOCTYPE
".HTML
".PUBLIC
"."
"
character or a
"'
"
character."
"
character or a
"'
"
character)."
"
character or a
"'
"
character."
"
character or a
"'
"
character).>
"
character.A permitted-public-ID-system-ID-combination is any combination of a public ID (the first quoted string in the doctype) and system ID (the second quoted string, if any, in the doctype) such that the combination corresponds to one of the six deprecated doctypes in the following list of deprecated doctypes:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN"> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
A legacy-tool-compatible doctype consists of the following parts, in exactly the following order:
<!DOCTYPE
".HTML
".SYSTEM
"."
"
character or a
"'
"
character.about:legacy-compat
"."
"
character or a
"'
"
character).>
"
character.The following is examples of a conformant legacy-tool-compatible doctype.
<!doctype HTML system "about:legacy-compat">
A character encoding declaration is a mechanism for specifying the character encoding used to store or transmit a document.
The following restrictions apply to character encoding declarations:
Name
or Alias
field
labeled as “preferred MIME name”;
or, if none of the Alias
fields are so labeled, a
case-insensitive match
for a Name
field in the registry.If the document does not start with a
U+FEFF BYTE ORDER MARK (BOM) character, and if its
encoding is not explicitly given by a
Content-Type
HTTP header, then the character
encoding used
must
be an
ASCII-compatible character encoding,
and, in addition, if that encoding isn't US-ASCII itself, then
the encoding
must
be specified using a
meta
element with a
charset
attribute or a meta
element
in the
encoding declaration
state.
If the document contains a meta
element with a
charset
attribute or a meta
element in the
encoding declaration state,
then the character encoding used
must
be an
ASCII-compatible character encoding.
An ASCII-compatible character encoding is one that is a superset of US-ASCII (specifically, ANSI_X3.4-1968) for bytes in the set 0x09, 0x0A, 0x0C, 0x0D, 0x20 - 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A.
Documents should not use UTF-32, JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), encodings based on ISO-2022, or encodings based on EBCDIC.
Documents must not use CESU-8, UTF-7, BOCU-1, or SCSU encodings.
In a document in the XML syntax, the XML declaration, as defined in the XML specification [XML] should be used to provide character-encoding information, if necessary.
An element’s content model defines the element’s structure: What contents (if any) the element can contain, as well as what attributes (if any) the element can have. The HTML elements section of this reference describes the content models for all of elements that are part of the HTML language. An element must not contain contents or attributes that are not part of its content model.
The contents of an element are any elements, character data, and comments that it contains. Attributes and their values are not considered to be the “contents” of an element.
The
text content
of an element is the value of the textContent
IDL
attribute of the element, as defined in
[DOM4].
A void element is an element whose content model never allows it to have contents under any circumstances. Void elements can have attributes.
The following is a complete list of the void elements in HTML:
The following list describes syntax rules for the the HTML syntax. Rules for the the XML syntax are defined in the XML specification [XML].
0–9
,
a–z
,
and
A–Z
.<
"
character./
"
character, which may be present only if the element is a
void element.>
"
character.<
"
character./
"
character>
"
character.If an element has both a start tag and an end tag, its end tag must be contained within the contents of the same element in which its start tag is contained. An end tag that is not contained within the same contents as its start tag is said to be a misnested tag.
In the following example, the
"</i>
"
end tag
is a
misnested tag,
because it is not contained
within the
contents
of the
b
element that contains its corresponding
"<i>
"
start tag.
<b>foo <i>bar</b> baz</i>
attributes for an element are expressed inside the element’s start tag. Attributes have a name and a value.
There must never be two or more attributes on the same start tag whose names are a case-insensitive match for each other.
The following list describes syntax rules for attributes in documents in the HTML syntax. Syntax rules for attributes in documents in the XML syntax. are defined in the XML specification [XML].
"
",
"'
",
">
",
"/
",
"=
",
the control characters,
and any characters that are not defined by Unicode.Name
production defined in
the XML specification [XML]
and that contain no
":
"
characters, and whose first three characters are not a
case-insensitive match
for the string "xml
".In the the HTML syntax, attributes can be specified in four different ways:
Certain attributes may be specified by providing just the attribute name, with no value.
In the following example, the
disabled
attribute is given with the empty attribute
syntax:
<input disabled>
Note that empty attribute syntax is exactly equivalent to specifying the empty string as the value for the attribute, as in the following example.
<input disabled="">
An unquoted attribute value is specified by providing the following parts in exactly the following order:
=
"
characterIn addition to the general requirements for attribute values, an unquoted attribute value has the following restrictions:
"
",
"'
",
"=
",
">
",
"<
",
or
"`
",
charactersIn the following example, the
value
attribute is given with the unquoted attribute value
syntax:
<input value=yes>
If the value of an attribute using the unquoted
attribute syntax is followed by a
"/
"
character, then there
must
be at least one
space character
after the value and before the
"/
"
character.
A single-quoted attribute value is specified by providing the following parts in exactly the following order:
=
"
character'
"
character'
"
character.In addition to the general requirements for attribute values, a single-quoted attribute value has the following restriction:
'
"
charactersIn the following example, the
type
attribute
is given with the single-quoted attribute value
syntax:
<input type='checkbox'>
A double-quoted attribute value is specified by providing the following parts in exactly the following order:
=
"
character"
"
character"
"
characterIn addition to the general requirements for attribute values, a double-quoted attribute value has the following restriction:
"
"
charactersIn the following example, the
title
attribute is
given with the double-quoted attribute value syntax:
<code title="U+003C LESS-THAN SIGN"><</code>
text in element contents (including in comments) and attribute values must consist of Unicode characters, with the following restrictions:
character data contains text, in some cases in combination with character references, along with certain additional restrictions. There are three types of character data that can occur in documents:
Certain elements contain normal character data. Normal character data can contain the following:
Normal character data has the following restrictions:
<
"
charactersIn documents in the HTML syntax, the title and textarea elements can contain replaceable character data. Replaceable character data can contain the following:
<
"
characters
Replaceable character data has the following restrictions:
</
"
followed by characters that are a
case-insensitive match
for the tag name of the element containing the
replaceable character data (for example,
"</title
" or
"</textarea
"),
followed by a
space character,
">
",
or
"/
".Replaceable character data, as described in this reference, is a feature of the HTML syntax that is not available in the XML syntax. Documents in the XML syntax must not contain replaceable character data as described in this reference; instead they must conform to all syntax constraints described in the XML specification [XML].
In documents in the HTML syntax, the script, and style elements can contain non-replaceable character data. Non-replaceable character data can contain the following:
<
"
characters
Non-replaceable character data has the following restrictions:
</
",
followed by characters that are a
case-insensitive match
for the tag name of the element containing the
replaceable character data (for example,
"</script
"
or
"</style
",
followed by a
space character,
">
",
or
"/
".Non-replaceable character data, as described in this reference, is a feature of the HTML syntax that is not available in the XML syntax. Documents in the XML syntax must not contain non-replaceable character data as described in this reference; instead they must conform to all syntax constraints defined in the XML specification [XML].
character references are a form of markup for representing single individual characters. There are three types of character references:
Named character references consist of the following parts in exactly the following order:
&
"
character.;
"
character.For further information about named character references, see [XML Entities].
The following is an example of a named character
reference for the character
"†
"
(U+2020 DAGGER).
†
Decimal numerical character references consist of the following parts, in exactly the following order.
&
"
character.#
"
character.0–9
,
representing a base-ten integer that itself is a Unicode
code point that is not
U+0000,
U+000D,
in the range U+0080–U+009F,
or in the range 0xD8000–0xDFFF (surrogates).;
"
character.The following is an example of a decimal numeric
character reference for the character
"†
"
(U+2020 DAGGER).
†
Hexadecimal numeric character references consist of the following parts, in exactly the following order.
&
"
character.#
"
character.x
"
character
or a
"X
"
character.0–9
,
a–f
,
and
A–F
,
representing a base-sixteen integer that itself is a
Unicode code point that is not
U+0000,
U+000D,
in the range U+0080–U+009F,
or in the range 0xD800–0xDFFF (surrogates).;
"
character.The following is an example of a hexadecimal numeric
character reference for the character
"†
"
(U+2020 DAGGER).
†
Character references are not themselves text, and no part of a character reference is text.
An
ambiguous ampersand
is an
"&
"
character followed by one or more characters in the range
"0
"
to
"9
",
the range
"a
"
to
"z
",
or the range
"A
"
to
"Z
",
followed by a
";
"
(semicolon)
character, where these characters do not match any of the names given
in the “Named character references” section of the HTML5 specification
[HTML5].
SVG and MathML elements are elements from the SVG and MathML namespaces.
The
math
element from the MathML namespace
and the
svg
element from the SVG namespace
are allowed in documents wherever
phrasing content is allowed.
SVG and MathML elements can be used both in documents in the HTML syntax and in documents in the XML syntax. Syntax rules for SVG and MathML elements in documents in the XML syntax are defined in the XML specification [XML]. The following list describes additional syntax rules that specifically apply to SVG and MathML elements in documents in the HTML syntax.
/
"
character before the closing
">
"
character are said to be
marked as self-closing.CDATA sections in SVG and MathML contents in documents in the HTML syntax consist of the following parts, in exactly the following order:
<![CDATA[
"]]>
“]]>
"CDATA sections are allowed only in the contents of elements from the SVG and MathML namespaces.
The following shows an example of a CDATA section.
<annotation encoding="text/latex"> <![CDATA[\documentclass{article} \usepackage{amsmath} \begin{document} The absolute value of $x$: \[ \left|x\right|= \begin{cases}-x& \text{if $x<0$}\\ x& \text{otherwise}\end{cases} \] \end{document}]]> </annotation>
4.7. Comments # T
comments consist of the following parts, in exactly the following order:
<!--
"-->
"The text part of comments has the following restrictions:
>
" character->
"--
"-
" characterThe following is an example of a comment.