Based on Character Model for the World Wide Web 1.0: Fundamentals [CharMod], this Architectural Specification provides authors of specifications, software developers, and content developers with a common reference on string identity matching on the Web. The goal of this specification is to improve interoperable text manipulation on the World Wide Web.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This version of this document represents a significant edit and replacement from the previous edition of Character Model. Much of the content is changed and the recommendations are significantly altered.
This is an updated W3C Working Draft of this document. The main difference from previous versions of this document is that it no longer proposes to rely exclusively on Early Uniform Normalization. Work to address the comments received during the second Last Call relevant to this document is still ongoing. We ask reviewers with outstanding comments to wait for us to finalise the disposition of their comments. A list of last call comments with their status can be found in the disposition of comments (public version, Members only version). Comments may be submitted by email to firstname.lastname@example.org (public archive).
This document is published as part of the W3C Internationalization Activity by the Internationalization Core Working Group. The Working Group expects to advance this Working Draft to Recommendation Status (see W3C document maturity levels).
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced under the 5 February 2004 W3C Patent Policy. The Working Group maintains a public list of patent disclosures relevant to this document; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy
EDNOTE: This TOC is out of date and does not reflect document structure.
1.1 Goals and Scope
1.3 Terminology and Notation
3.1.1 Why do we need character normalization?
3.1.2 The choice of Normalization Form
3.2 Definitions for W3C Text Normalization
3.2.1 Normalizing Transcoder
3.2.2 Unicode-normalized text
3.2.3 Include-normalized text
3.2.4 Fully-normalized text
3.2.5 Normalization-sensitive operations
3.2.6 Text-processing component
3.2.7 Certified and suspect text
3.3.1 General examples
3.3.2 Examples of XML in a Unicode encoding form
3.3.3 Examples of restrictions on the use of combining characters
3.4 Responsibility for Normalization
4 String Identity Matching
The goal of the Character Model for the World Wide Web is to facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of universal access. One basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well-understood way.
The main target audience of this specification is W3C specification developers. This specification and parts of it can be referenced from other W3C specifications. It defines conformance criteria for W3C specifications as well as other specifications.
Other audiences of this specification include software developers, content developers, and authors of specifications outside the W3C. Software developers and content developers implement and use W3C specifications. This specification defines some conformance criteria for implementations (software) and content that implement and use W3C specifications. It also helps software developers and content developers to understand the character-related provisions in W3C specifications.
The character model described in this specification provides authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text manipulation on the World Wide Web. Working together, these three groups can build a globally accessible Web.
This part of the Character Model for the World Wide Web covers string identity matching in content, which includes considerations for case folding and the use of Unicode normalization.
At the core of the character model is the Universal Character Set (UCS), defined jointly by the Unicode Standard [Unicode] and ISO/IEC 10646 [ISO/IEC 10646]. In this document, Unicode is used as a synonym for the Universal Character Set. The model will allow Web documents authored in the world's scripts (and on different platforms) to be exchanged, read, and searched by Web users around the world.
This section provides some historical background on the topics addressed in this specification.
Starting with Internationalization of the Hypertext Markup Language [RFC 2070], the Web community has recognized the need for a character model for the World Wide Web. The first step towards building this model was the adoption of Unicode as the document character set for HTML.
The choice of Unicode was motivated by the fact that Unicode:
is the only universal character repertoire available,
provides a way of referencing characters independent of the encoding of the text,
is being updated/completed carefully,
is widely accepted and implemented by industry.
W3C adopted Unicode as the document character set for HTML in [HTML 4.0]. The same approach was later used for specifications such as XML 1.0 [XML 1.0] and CSS2 [CSS2]. W3C specifications and applications now use Unicode as the common reference character set.
When data transfer on the Web remained mostly unidirectional (from server to browser), and where the main purpose was to render documents, the use of Unicode without specifying additional details was sufficient. However, the Web has grown:
Data transfers among servers, proxies, and clients, in all directions, have increased.
Non-ASCII characters [ISO/IEC 646] are being used in more and more places.
Data transfers between different protocol/format elements (such as element/attribute names, URI components, and textual content) have increased.
More and more APIs are defined, not just protocols and formats.
In short, the Web may be seen as a single, very large application (see [Nicol]), rather than as a collection of small independent applications.
While these developments strengthen the requirement that Unicode be the basis of a character model for the Web, they also create the need for additional specifications on the application of Unicode to the Web. Some aspects of Unicode that require additional specification for the Web include:
Choice of Unicode encoding forms (UTF-8, UTF-16, UTF-32).
Counting characters, measuring string length in the presence of variable-length character encodings and combining characters.
Duplicate encodings of characters (e.g. precomposed vs decomposed).
Use of control codes for various purposes (e.g. bidirectionality control, symmetric swapping, etc.).
It should be noted that such aspects also exist for legacy encodings (where legacy encoding is taken to mean any character encoding not based on Unicode), and in many cases have been inherited by Unicode in one way or another from such legacy encodings.
The remainder of this specification presents additional requirements to ensure an interoperable character model for the Web, taking into account earlier work (from W3C, ISO and IETF).
The first few chapters of the Unicode Standard [Unicode] provide very useful background reading. The policies adopted by the IETF for on the use of character sets on the Internet are documented in [RFC 2277].
For information about the requirements that informed the development of important parts of this specification, see Requirements for String Identity Matching and String Indexing [CharReq].
For the purpose of this specification, the producer of text data is the sender of the data in the case of protocols, and the tool that produces the data in the case of formats. The recipient of text data is the software module that receives the data.
NOTE: A software module may be both a recipient and a producer.
Unicode code points are denoted as U+hhhh, where "hhhh" is a sequence of at least four, and at most six hexadecimal digits.
Characters have been used in various examples that will not appear as intended unless you have the appropriate font. Care has been taken to ensure that the examples nevertheless remain understandable.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC 2119].
NOTE: RFC 2119 makes it clear that requirements that use SHOULD are not optional and must be complied with unless there are specific reasons not to: "This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course."
This specification places conformance criteria on specifications, on software and on Web content. To aid the reader, all conformance criteria are preceded by '[X]' where 'X' is one of 'S' for specifications, 'I' for software implementations, and 'C' for Web content. These markers indicate the relevance of the conformance criteria and allow the reader to quickly locate relevant conformance criteria by searching through this document.
Specifications conform to this document if they:
do not violate any conformance criteria preceded by [S],
document the reason for any deviation from criteria where the imperative is SHOULD, SHOULD NOT, or RECOMMENDED,
make it a conformance requirement for implementations to conform to this document,
make it a conformance requirement for content to conform to this document.
Software conforms to this document if it does not violate any conformance criteria preceded by [I].
Content conforms to this document if it does not violate any conformance criteria preceded by [C].
NOTE: Requirements placed on specifications might indirectly cause requirements to be placed on implementations or content that claim to conform to those specifications.
Where this specification contains a procedural description, it is to be understood as a way to specify the desired external behavior. Implementations can use other means of achieving the same results, as long as observable behavior is not affected.
This chapter discusses how string matching and string search should be implemented in formal languages and content, and, consequently, how Unicode normalization and Unicode case folding affect such processing.
One important operation that depends on the use of consistent character sequences is string identity matching [CharReq], which is a subset of the more general problem of string matching. There are various degrees of specificity for string matching, from approximate matching such as regular expressions or phonetic matching, to more specific matches such as case-insensitive or accent-insensitive matching and finally to identity matching.
Examples of string identity matching abound: parsing element and attribute names in Web documents, matching CSS selectors to the nodes in a document, matching font names in a style sheet to the names known to the operating system, matching URI pieces to the resources in a server, matching strings embedded in an ECMAScript program to strings typed in by a Web form user, matching the parts of an XPath expression (element names, attribute names and values, content, etc.) to what is found in an XML instance, etc.
String identity matching can take several forms. The most trivial of these is performed by comparing two strings code unit-by-code unit (in multbyte encodings, this is byte-by-byte). The existence on the Web of multiple means of representing characters means that this process is actually less trivial than it appears. Binary comparison does not work if the strings are not in the same character encoding (e.g. an EBCDIC style sheet being directly applied to an ASCII document, or a font specification in a Shift_JIS style sheet directly used on a system that maintains font names in UTF-16) or if they are in the same character encoding but if the 'same' string contains variations allowed for by the use of combining characters or by the constructs of Web languages. Even when a consistent character encoding based on Unicode is used, defining string equality is hampered by the availability of multiple Unicode representations of the same canonical character or grapheme.
Incorrect string identity matching can have far reaching consequences,
including the creation of security holes. Consider a contract, encoded in XML,
for buying goods: each item sold is described in a
unfortunately, "Stück" is subject to different representations
in the character encoding of the contract. Suppose that the contract is viewed
and signed by means of a user agent that looks for
extracts them (matching on the element name), presents them to the user and
adds up their prices. If different instances of the
happen to be represented differently in a particular contract, then the buyer
and seller may see (and sign) different contracts if their respective user
agents perform string identity matching differently, which is fairly likely in
the absence of a well-defined specification for string matching. The absence of
a well-defined specification would also mean that there would be no way to
resolve the ensuing contractual dispute.
Ideally identity would only occur if the compared strings contained no user-identifiable distinctions. That is, strings do not match when they differ in case or accentuation, but do match when they differ only in non-semantically significant ways such as character encoding, use of character escapes (of potentially different kinds), or use of precomposed vs. decomposed character sequences.
One main consideration in string identity matching is whether the comparison is case sensitive or case insensitive. Case sensitive matching is RECOMMENDED as the default for new specifications and formats. However, cases exist in which case-insensitivity is desirable.
Case sensitive matching compares Unicode code points and consists of the following steps:
Step 1 ensures 1) that the identity matching process can produce correct results using the next three steps and 2) that a minimum of effort is spent on solving the problem.
NOTE: The expansion of character escapes and includes (step 3 above) is
dependent on context, i.e. on which markup or programming language is
considered to apply when the string matching operation is performed. Consider a
search for the string 'suçon' in an XML document containing
suçon but not
suçon. If the search is performed in a plain text editor, the context is plain text (no markup or programming language applies), the
ç character escape is not recognized, hence not expanded and the
search fails. If the search is performed in an XML browser, the context is XML, the character escape (defined by XML) is expanded and the
An intermediate case would be an XML editor that purposefully provides a view of an XML document with entity references left unexpanded. In that case, a search over that pseudo-XML view will deliberately not expand entities: in that particular context, entity references are not considered includes and need not be expanded
Case-insensitive matching is sometimes useful in contexts where case may vary in a way that is not semantically meaningful or in which case distinctions cannot be controlled by the user. Case insensitive matching takes several forms:
One non-semantically significant distinction that can occur in Unicode text is variation in the character sequence due to the use of combining marks or compatibility characters. Version 1.0 of this document focused on defining Early Uniform Normalization as a means of ensuring that string matching processes on the Web would be straightforward, performant, and consistent. Ideally, Unicode normalization would be used to ensure that character sequences that Unicode defines as canonically equivalent are also treated as canonically equivalent ("match") when used on the Web.
However, since W3C specifications and implementations generally do not enforce Unicode normalization, it is up to content authors to supply text that uses a consistent representation in order to obtain predictable results. Using or applying a Unicode normalization form for text is one way to accomplish this.
When text is encoded in a computer system, it often can use one of many different character encodings. In
addition, some character encodings allow multiple representations for the
'same' string. Web languages and format also have escape mechanisms that
introduce even more equivalent representations. For instance, in ISO 8859-1 the
letter 'ç' can only be represented as the single character E7
'ç', but in a Unicode encoding it can be represented as the single
character U+00E7 'ç'
or the sequence U+0063
'c' U+0327 '¸'. In HTML it could be additionally
ç or even using the combining mark
ç (and so forth).
The basis for most W3C document formats is text that includes some form of structural markup. This includes formats in diverse forms, such as HTML, CSS, JSON, RDF, and so forth. When processing document formats based on text, operations such as string matching, indexing, searching, sorting, regular expression matching, selection (and so forth) become sensitive to the different ways in which text might be represented in the document: the proper functioning of the Web (and of much other software) depends to a large extent on string matching.
Because the document formats, specifications, and implementations on the Web do not supply Unicode normalization of the content being exchanged or in their string matching algorithms, users and content must provide a consistent representation in order to avoid problems later. Since users can, on their own, have difficulty assuring that content uses a consistent representation, implementations and tools should consider the difficulties experienced by users who expect that visually and logically equivalent strings that are not considered to match and to provide a means for users to see differences and/or canonicalize them as appropriate.
Given that there are many character sequences that content authors or applications could choose when inputting or exchanging text, what form is most appropriate for content on the Web? When providing text in a normalized form, there are different options for the normalization form to be used.
The Unicode Consortium provides four standard normalization forms (see Unicode Normalization Forms [UTR #15]). These forms differ in 1) whether they normalize towards decomposed characters (NFD, NFKD) or precomposed characters (NFC, NFKC) and 2) whether the normalization process erases compatibility distinctions (NFKD, NFKC) or not (NFD, NFC).
For use on the Web, it is important not to lose compatibility distinctions, which are often important to the content (see [UXML] Chapter 5 for a discussion). The NFKD and NFKC normalization forms are therefore excluded. Among the remaining two forms, NFC has the advantage that almost all legacy data (if transcoded trivially, one-to-one, to a Unicode encoding), as well as data created by current software, is already in this form; NFC also has a slight compactness advantage and is a better match to user expectations with respect to the character vs. grapheme issue. This document therefore recommends, when possible, that all content be stored and exchanged in Unicode Normalization Form C (NFC).
NOTE: Roughly speaking, NFC is defined such that each combining character sequence (a base character followed by one or more combining characters) is replaced, as far as possible, by a canonically equivalent precomposed character. Text in a Unicode encoding form is said to be in NFC if it doesn't contain any combining sequence that could be replaced and if any remaining combining sequence is in canonical order.
In the Web environment, where multiple character encodings are used to represent strings, including some character encodings which allow multiple representations for the same thing, it's important to establish a consistent process for evaluating string identity.
C3xx [S][I] Case sensitive string identity matching MUST be performed as if the following steps were followed:
To avoid unnecessary conversions and, more importantly, to ensure predictability and correctness, it is necessary for all components of the Web to use the same identity testing mechanism. Conformance to the rule that follows meets this requirement and supports the above definition of identity.
Where caseless matching is desired, there are several implementation choices that a formal language needs to consider. If the namespace of strings to be compared is limited to the Basic Latin (ASCII) subset of Unicode, ASCII-case-insensitivity can be used.
If the namespace of strings to be compared is not limited, then Unicode case folding should be applied. Unicode case folding can take several forms. Unicode defines the "common" (C) case foldings for "simple" 1:1 mappings of a character to a case folded form. However, some characters have a 1:many mapping. For compatibility with certain types of implementation, Unicode defines both a "simple" (S) case fold (that is always 1:1) and a "full" case fold.
Because the "simple" case fold mapping removes information that can be important to forming an identity match, the "Common plus Full" (or CF) case fold is preferred for Unicode case insensitive matching.
A namespace is considered to be "ASCII-only" if and only if all tokens and identifiers are defined by the specification directly and these identifiers or tokens use only the Basic Latin subset of Unicode. If user-defined identifiers are permitted, the full range of Unicode characters (limited, as appropriate, for security or interchange concerns, see UAX#xx) SHOULD be allowed and Unicode case insensitivity used for identity matching.
These requirements pertain to content and are intended as guidelines for content authors.
[C] Text content SHOULD be stored and exchanged in Unicode Normalization Form C (NFC).
NOTE: In order to be processed correctly content must use a consistent sequence of code points to represent text. While content can be in any normalization form or may use a de-normalized (but valid) Unicode character sequence, inconsistency of representation will cause implementations to treat the different sequence as "different". The best way to ensure consistent selection, access, extraction, processing, or display is to always use NFC.
[I] Implementations which transcode text from a legacy encoding to a Unicode encoding form SHOULD use a normalizing transcoder that produces Unicode Normalization Form C (NFC).
[C] Content developers SHOULD avoid composing characters at the beginning of constructs that might be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup.
[C] Identifiers in content SHOULD use consistent case (upper, lower, mixed case) to facilitate matching, even if case-insensitive matching is supported by the format or implementation.
The requirements pertain to specifications for document formats or programming/scripting languaguages and their implementations.
[S][I] Specifications and implementations that define string matching as part of the definition of a format, protocol, or formal language (which might include operations such as parsing, matching, tokenizing, etc.) MUST define the criteria and matching forms used. These MUST be one of:
[S] Specifications of text-based formats and protocols MAY specify that the format or protocol requires content to be normalized using Unicode Normalization Form C (NFC).
Specificiations are generally discouraged from including a normalization requirement unless there are specific, clear reasons why the additional requirement is necessary. As other content on the Web is not generally normalized, content authors may rely on denormalized character sequences and a normalization step could negatively affect such content.
NOTE: specifying NFC requires additional care on the part of the spec developer, as content on the Web generally is not in a known normalization state. Boundary and error conditions for denormalized content ought to be carefully considered and well specified in these cases.
[S] Specifications SHOULD NOT specify case-insensitive comparison of strings.
[S] Specifications that specify case-insensitive comparison for non-ASCII namespaces SHOULD specify Unicode C+F case folding.
In some limited cases, locale- or language-specific tailoring might also be appropriate. However, such cases are generally linked to natural language processing operations. Because they produce potentially different results from the generic case folding rules, these should be avoided in formal languages, where predictability is at a premium.
[S] Specifications that specify case-insensitive comparison for ASCII-only namespaces MAY specify ASCII-only case-insensitive comparison.
This requirement applies to formal languages whose keywords are all ASCII and which do not allow user-defined names or identifiers. An example of this is HTML, which defines the use of ACI comparison for element and attribute names defined by the HTML specification.
[S][I] Specifications and implementations MUST NOT specify ASCII-only case-sensistive or case-insensitive (ASC or ACI) forms for values or constructs that permit non-ASCII characters.
These requirements pertain to any specification that specifies normalization explicitly (which should include all new specifications):
[S] Specifications that do not normalize MUST document or provide a health-warning if canonically equivalent but disjoint Unicode character sequences represent a security issue.
[S][I] Specifications and implementations MUST NOT assume that content is in any particular normalization form.
The normalization form or lack of normalization for any given content has to be considered intentional in these cases.
[S][I] For namespaces and values that are not restricted to Basic Latin (ASCII), case-insensitive matching MUST specify either UniCF or locale-sensitive string comparison.
[I] Implementations MUST NOT alter the normalization form of content being exchanged, read, parsed, or processed except when required to do so as a side-effect of transcoding the content to a Unicode character encoding, as content might depend on the de-normalized representation.
[S] Specifications MUST specify that string matching takes the form of "code point-by-code point" comparison of the Unicode character sequence, or, if a specific Unicode character encoding is specified, code unit-by-code unit comparison of the sequences.
For specifications of text-based formats and protocols that define Unicode Normalization as a requirement, the following requirements apply:
[S] Specifications of text-based formats and protocols that, as part of their syntax definition, require that the text be in normalized form MUST define string matching in terms of normalized string comparison and MUST define the normalized form to be NFC.
[S] [I] A normalizing text-processing component which receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.
[I] A normalizing text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.
[S] Specifications of text-based languages and protocols SHOULD define precisely the construct boundaries necessary to obtain a complete definition of full-normalization. These definitions SHOULD include at least the boundaries between markup and character data as well as entity boundaries (if the language has any include mechanism) , SHOULD include any other boundary that may create denormalization when instances of the language are processed, but SHOULD NOT include character escapes designed to express arbitrary characters.
[I] Authoring tool implementations for a formal language that does not mandate full-normalization SHOULD either prevent users from creating content with composing characters at the beginning of constructs that may be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup, or SHOULD warn users when they do so.
[S] Where operations can produce denormalized output from normalized text input, specifications of API components (functions/methods) that implement these operations MUST define whether normalization is the responsibility of the caller or the callee. Specifications MAY state that performing normalization is optional for some API components; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off. Specifications SHOULD NOT make the implementation of normalization optional.
[S] Specifications that define a mechanism (for example an API or a defining language) for producing textual data object SHOULD require that the final output of this mechanism be normalized.
For use on the Web, this document defines Web-related text normalization forms by starting with Unicode Normalization Form C (NFC), and additionally addressing the issues of legacy encodings, character escapes, includes, and character and markup boundaries. Examples illustrating some of these definitions can be found in 3.3 Examples. This section defines various useful terms and concepts that can be applied to normalization of Web content, particularly when implementing more complex string processing that is normalization sensitive.
A normalizing transcoder is a transcoder that converts from a legacy encoding to a Unicode encoding form and ensures that the result is in Unicode Normalization Form C (see 3.2.2 Unicode-normalized text). For most legacy encodings, it is possible to construct a normalizing transcoder (by using any transcoder followed by a normalizer); it is not possible to do so if the encoding's repertoire contains characters not represented in Unicode.
Text is, for the purposes of this specification, Unicode-normalized if it is in a Unicode encoding form and is in Unicode Normalization Form C, according to a version of Unicode Standard Annex #15: Unicode Normalization Forms [UTR #15] at least as recent as the oldest version of the Unicode Standard that contains all the characters actually present in the text, but no earlier than version 3.2 [Unicode 3.2].
Markup languages, style languages and programming languages often offer facilities for including a piece of text inside another. An include is an instance of a syntactic device specified in a language to include text at the position of the include, replacing the include itself. Examples of includes are entity references in XML, @import rules in CSS and the #include preprocessor statement in C/C++. Character escapes are a special case of includes where the included entity is predetermined by the language.
Text is include-normalized if:
NOTE: A consequence of this definition is that legacy text (i.e. text in a legacy encoding) is always include-normalized unless i) a normalizing transcoder cannot exist for that encoding (e.g. because the repertoire contains characters not in Unicode) or ii) the text contains character escapes or includes which, once expanded, result in un-normalized text.
NOTE: The specification of include-normalization relies on the syntax for character escapes and includes defined by the (computer) language in use. For plain text (no character escapes or includes) in a Unicode encoding form, include-normalization and Unicode-normalization are equivalent.
Formal languages define
constructs, which are identifiable pieces, occurring in instances
of the language, such as comments, identifiers, element tags, processing
instructions, runs of character data,
etc. During the normal processing of include-normalized text, these various
constructs may be moved, removed (e.g. removing comments) or merged (e.g.
merging all the character data within an
element as done by the
string() function of XPath), creating opportunities for text to become
denormalized. The software performing those operations, or other software down the line that needs to perform normalization-sensitive operations, then has to re-normalize
the result, which is a burden. One way to avoid such denormalization is to make
sure that the various important constructs never begin with a character such
that appending that character to a normalized string can cause the string to
become denormalized. A composing character is a character that is
one or both of the following:
the second character in the canonical decomposition mapping of some character that is not listed in the Composition Exclusion Table defined in [UTR #15], or
of non-zero canonical combining class as defined in [Unicode] .
Please consult Appendix B Composing Characters for a discussion of composing characters, which are not exactly the same as Unicode combining characters.
Text is fully-normalized if:
the text is in a Unicode encoding form, is include-normalized and none of the constructs comprising the text begin with a composing character or a character escape representing a composing character; or
NOTE: Full-normalization is specified against the context of a (computer) language (or the absence thereof), which specifies the form of character escapes and includes and the separation into constructs. For plain text (which has no includes, no constructs, and no character escapes) in a Unicode encoding form, full-normalization and Unicode-normalization are equivalent.
Identification of the constructs that should be prohibited from beginning with a composing character (the relevant constructs) is language-dependent.
Although Unicode normalization is not required of W3C specifications, .... (health warning)
As specified in 3.4 Responsibility for Normalization, it is the responsibility of the specification for a language to specify exactly what constitutes a relevant construct. This may be done by specifying important boundaries, taking into account which operations would benefit the most from being protected against denormalization. The relevant constructs are then defined as the spans of text between the boundaries. At a minimum, for those languages which have these notions, the important boundaries are entity (include) boundaries as well as the boundaries between most markup and character data. Many languages will benefit from defining more boundaries and therefore finer-grained full-normalization constructs.
NOTE: In general, it will be advisable not to include character escapes designed to express arbitrary characters among the relevant constructs; the reason is that including them would prevent the expression of combining sequences using character escapes (e.g. 'q̌' for q-caron), which is especially important in legacy encodings that lack the desired combining marks.
NOTE: Full-normalization is closed under concatenation: the concatenation of two fully-normalized strings is also fully-normalized. As a result, a side benefit of including entity boundaries in the set of boundaries important for full-normalization is that the state of normalization of a document that includes entities can be assessed without expanding the includes, if the included entities are known to be fully-normalized. If all the entities are known to be include-normalized and not to start with a composing character, then it can be concluded that including the entities would not denormalize the document.
An operation is normalization-sensitive if its output(s) are different depending on the state of normalization of the input(s); if the output(s) are textual, they are deemed different only if they would remain different were they to be normalized. These operations are any that involve comparison of characters or character counting, as well as some other operations such as ‘delete first character’ or ‘delete last character’.
EXAMPLE: Consider the string
normalisé, where the 'é' may be a single
character (in NFC) or two. The following are three examples of normalization-sensitive operations involving this string. Counting the number of characters may yield either 9 or 10, depending
on the state of normalization. Deleting the last character may yield either
normalise (no accent).
matches if both are in the same state of normalization, but doesn't match otherwise.
EXAMPLE: Examples of operations that are not normalization-sensitive are normalization, and the copying or deletion of an entire document.
A text-processing component is a component that recognizes data as text. This specification does not specify the boundaries of a text-processing component, which may be as small as one line of code or as large as a complete application. A text-processing component may receive text, produce text, or both.
In the following definitions, the word 'normalized' may stand for either 'include-normalized' or 'fully-normalized', depending on which is most appropriate for the specification or implementation under consideration.
Certified text is text which satisfies at least one of the following conditions:
it has been confirmed through inspection that the text is in normalized form
the source of the text (a text-processing component ) is known to produce only normalized text.
Suspect text is text which is not certified.
NOTE: To normalize text, it is in general sufficient to store the last seen character, but in certain cases (a sequence of combining marks) a buffer of theoretically unlimited length is necessary. However, for normalization checking no such buffer is necessary, only a few variables. C Resources for Normalization points to some compact code that shows how to check normalization without an expanding buffer.
In some of the following examples, '¸' is used to depict the character U+0327 COMBINING CEDILLA, for the purposes of illustration. Had a real U+0327 been used instead of this spacing (non-combining) variant, some browsers might combine it with a preceding 'c', resulting in a display indistinguishable from a U+00E7 'ç' and a loss of understandability of the examples. In addition, if the sequence c + combining cedilla were present, this document would not be include-normalized.
It is also assumed that the example strings are relevant constructs for the purposes of full-normalization.
suçon (U+0073 U+0075 U+00E7 U+006F U+006E) encoded in a Unicode encoding form, is Unicode-normalized, include-normalized and fully-normalized. The same
string encoded in a legacy encoding for
which there exists a normalizing transcoder would be both include-normalized
and fully-normalized but not Unicode-normalized (since not in a Unicode
In an XML or HTML context, the string
suçon is also include-normalized, fully-normalized and, if encoded in a
Unicode encoding form, Unicode-normalized. Expanding ç yields
suçon as above, which contains no replaceable combining sequence.
suc¸on (U+0073 U+0075 U+0063 U+0327 U+006F U+006E), where U+0327
is the COMBINING CEDILLA, encoded in a Unicode encoding form, is
not Unicode-normalized (since the combining sequence 'c¸' (U+0063
U+0327) should appear instead as the precomposed 'ç' (U+00E7)). As a
consequence this string is neither include-normalized (since in a Unicode
encoding form but not Unicode-normalized) nor fully-normalized (since not
include-normalized). Note however that the string
sub¸on (U+0073 U+0075 U+0062 U+0327 U+006F U+006E) in a Unicode
encoding form is Unicode-normalized since there is no precomposed form
of 'b' plus cedilla. It is also include-normalized and
In plain text the string
suçon is Unicode-normalized, since plain text doesn't recognize that
̧ represents a character in XML or HTML and considers it just a
sequence of non-replaceable characters.
In an XML or HTML context, however, expanding ̧ yields
suc¸on (U+0073 U+0075 U+0063 U+0327 U+006F U+006E) which is not
Unicode-normalized ('c¸' is
replaceable by 'ç'). As a consequence the string is neither
include-normalized nor fully-normalized. As another example, if the entity
&word-end; refers to an entity containing
¸on (U+0327 U+006F U+006E), then the string
suc&word-end; is not include-normalized for the same reasons.
In an XML or HTML context, expanding ̧ in the string
sub̧on yields the string
sub¸on which is Unicode-normalized since there is no precomposed
character for 'b cedilla' in NFC. This string is therefore also
include-normalized. Similarly, the string
&word-end; as above) is include-normalized, for the same reasons.
In an XML or HTML context, the strings
¸on (U+0327 U+006F U+006E) and
̧on are not fully-normalized, as they begin with a composing character
(after expansion of the character escape for the second). However, both are
Unicode-normalized (if expressed in a Unicode encoding form) and
The following table consolidates the above examples. Normalized forms are indicated using 'Y', a hyphen means 'not normalized'.
Here is another summary table, with more examples but limited to XML in a Unicode encoding form. The following list describes what the entities contain and special character usage. Normalized forms are indicated using 'Y'. There is no precomposed 'b with cedilla' in NFC.
"ç" LATIN SMALL LETTER C WITH CEDILLA
"¸la;" CEDILLA (combining)
"&c;" LATIN SMALL LETTER C
"&b;" LATIN SMALL LETTER B
"¸" CEDILLA (combining)
"/" (immediately before 'on' in last example) COMBINING LONG SOLIDUS OVERLAY
NOTE: From the last example in the table above, it follows that it is impossible to produce a normalized XML or HTML document containing the character U+0338 COMBINING LONG SOLIDUS OVERLAY immediately following an element tag, comment, CDATA section or processing instruction, since the U+0338 '/' combines with the '>' (yielding U+226F NOT GREATER-THAN). It is noteworthy that U+0338 COMBINING LONG SOLIDUS OVERLAY also combines with '<', yielding U+226E NOT LESS-THAN. Consequently, U+0338 COMBINING LONG SOLIDUS OVERLAY should remain excluded from the initial character of XML identifiers.
Include-normalization and full-normalization create restrictions on the use of combining characters. The following examples discuss various such potential restrictions and how they can be addressed.
Full-normalization prevents the markup of an isolated combining
mark, for example for styling it differently from its base character (
Benoi<span style='color: blue'>^</span>t, where '^' represents a combining circumflex). However,
the equivalent effect can be achieved by assigning a class to the accents in an
SVG font or using equivalent technology.
View an example using SVG (SVG-enabled
Full-normalization prevents the use of entities for expressing
composing characters. This limitation can be circumvented by using character
escapes or by using entities representing complete combining character
sequences. With appropriate entity definitions, instead of
Á (or better, use 'Á' directly).
As specified in 3.2.4 Fully-normalized text, a composing character is any character that is
the second character in the canonical decomposition mapping of some character that is not listed in the Composition Exclusion Table defined in [UTR #15], or
of non-zero canonical combining class (as defined in [Unicode]).
These two categories are highly but not exactly overlapping. The first category includes a few class-zero characters that do compose with a previous character in NFC; this is the case for some vowel and length marks in Brahmi-derived scripts, as well as for the modern non-initial conjoining jamos of the Korean Hangul script. The second category includes some combining characters that do not compose in NFC, for the simple reason that there is no precomposed character involving them. They must nevertheless be taken into account as composing characters because their presence may make reordering of combining marks necessary, to maintain normalization under concatenation or deletion. Therefore, composing characters as defined in 3.2.4 Fully-normalized text include all characters of non-zero canonical combining class plus the following (as of Unicode 3.2):
|09BE||া||BENGALI VOWEL SIGN AA|
|09D7||ৗ||BENGALI AU LENGTH MARK|
|0B3E||ା||ORIYA VOWEL SIGN AA|
|0B56||ୖ||ORIYA AI LENGTH MARK|
|0B57||ୗ||ORIYA AU LENGTH MARK|
|0BBE||ா||TAMIL VOWEL SIGN AA|
|0BD7||ௗ||TAMIL AU LENGTH MARK|
|0CC2||ೂ||KANNADA VOWEL SIGN UU|
|0CD5||ೕ||KANNADA LENGTH MARK|
|0CD6||ೖ||KANNADA AI LENGTH MARK|
|0D3E||ാ||MALAYALAM VOWEL SIGN AA|
|0D57||ൗ||MALAYALAM AU LENGTH MARK|
|0DCF||ා||SINHALA VOWEL SIGN AELA-PILLA|
|0DDF||ෟ||SINHALA VOWEL SIGN GAYANUKITTA|
|102E||ီ||MYANMAR VOWEL SIGN II|
|1161||ᅡ||HANGUL JUNGSEONG A|
|1175||ᅵ||HANGUL JUNGSEONG I|
|Hangul trailing consonants|
|11A8||ᆨ||HANGUL JONGSEONG KIYEOK|
|11C2||ᇂ||HANGUL JONGSEONG HIEUH|
NOTE: The characters in the second column of the above table may or may not appear, or may appear as blank rectangles, depending on the capabilities of your browser and on the fonts installed in your system.
The W3C Internationalization Working Group and Interest Group, as well as others, provided many comments and suggestions.
Asmus Freytag and, in early stages, Ian Jacobs provided significant help in the authoring and editing process of the first version of the Character Model and their work thus contributes to this document.
The previous version of this document was edited by: