I have a problem where ASP.NET sometimes turns an ' ' into ' ' when working with server based controls. I have created the code below to provide a simple example though it is in a different much more complex situation. The prevention of alcohol problems has a long history.Many initiatives have been undertaken for policy and legislative strategies such as taxation, age limits, and retail regulations, which are by far the most successful interventions when reducing excessive alcohol consumption and related harms at low cost.This paper focuses on the strategy of educational alcohol.
Entity and Character References
June 2, 2004
XSLT stylesheet developers often ask how they can leave entity references in the source document unchanged as the stylesheet passes them to the result document. For example, they want an
entity reference in the source document to still be
in the result document. The usual answer is that you shouldn't need to do this, because the substitution of the entity values for the entity references shouldn't make any difference.
It shouldn't, if everyone always played by the rules, but not everyone does. I have my own schema for presentation slides, and I once wrote a stylesheet to convert the XML versions of my slides into HTML that Microsoft PowerPoint would recognize and import so that I could then save the slides as a binary PPT file. PowerPoint's import didn't treat code point 160 the same way it treated the entity reference
, so I absolutely had to have the entity reference in the HTML created by my stylesheet.
In the March 2001 Transforming XML column, I explained why source document entity references can't be preserved in the transformation: the XML parser that an XSLT processor depends on to read in the source document (and the stylesheet itself) converts any entity references to their entity values as it reads them in, before it puts the source document in the source tree where the XSLT processor can get at it. Replacing entity references with entity values is part of the XML parser's job.
That column went on to demonstrate the use of the disable-output-escaping
attribute to add an entity reference to the output when you absolutely must. The use of this attribute, like the use of CDATA sections, is usually a bit kludgy. Also, the example I gave in that column showed how to add an entity reference to the result tree, but it didn't show how to convert something from the source document into an entity reference.
Many new features in XSLT 2.0 respond to wishes expressed by XSLT developers since 1.0 was released. The wish to leave entity references alone can't be granted completely, because redefining an XML parser's responsibilities is outside of the scope of the W3C XSL Working Group's responsibilities. They have added a great new feature, though, called character maps, that makes conversion of specific source document characters to entity references (or to any strings you like) very simple.
A character map lets you tell the XSLT processor 'when this character is on its way to the result tree, put this string there instead.' If I'd had this when I was trying to create PPT files from the XML of my slide presentation, I could have used this to map code point 160 to the string ' ' in the HTML that I was preparing for import into PowerPoint. The following XSLT 2.0 stylesheet does this and maps three other characters to strings.
The stylesheet's single template rule does a verbatim copy of the source tree to the result tree. The new XSLT 2.0 parts are the xsl:character-map
element, which defines the mappings to execute, and the use-character-maps
attribute of the xsl:output
element.
The xsl:output
element is an old friend from XSLT 1.0 that's let us perform tasks such as the setting of an output file's encoding, the inclusion or omission of the XML declaration from the result document, and the setting of a DOCTYPE declaration for the result document. The new use-character-maps
attribute lets you name one or more characters maps, with their names separated by spaces, to use in converting characters bound for the result tree into alternative strings.
The ability to use more than one character map lets you group mappings into modules and mix and match them for different output media. For example, imagine a stylesheet that was nothing but xsl:character-map
elements, each one being a set of character mappings for a particular purpose. Other stylesheets could use xsl:include
to reference that file and then name which sets of mappings they wanted to use in the xsl:output
element's use-character-maps
attribute. For example:
Html Nbsp
The xsl:character-map
element has the name
attribute to store the name used to reference it from the xsl:output
element's use-character-maps
attribute and a collection of xsl:output-character
child elements. Each of these children has a character
attribute to identify the character to map and a string
attribute storing the string to map it to. Because an XML parser that sees the string ' ' in any attribute value interprets it as a non-breaking space character and not as the entity reference string for this character, this must be written as ' ' in the xsl:output-character
element's string
attribute value to show that we want the entity reference added to the result tree.
The first two xsl:output-character
elements specify their character
values with a numeric character reference. Remember, just because something begins with an ampersand and ends with a semicolon, that doesn't make it an entity reference. An entity is a name that is declared and referenced, whether it's predeclared like the lt
or amp
entities or declared in a DTD as one might do with nbsp
or ntilde
. A numeric character reference is a way to indicate a character by using its code point number; no declaration is necessary, so it's not a named unit of storage, and therefore not an entity.
For the third xsl:output-character
element, instead of a numeric character reference, I just entered the character 'é' in there to show which character I wanted mapped. The fourth xsl:output-character
element maps the character for the punctuation mark known as the 'em dash'—a punctuation mark that I probably use too much—to a pair of hyphens, which is the traditional way to represent an em dash when all you have are 7-bit ASCII characters.
Of the four strings that this character map can add to the result tree, only is an entity reference. If the document created from the result tree is XML, a parser that reads that document will choke on ' ' if it never saw a declaration for it, so the stylesheet should include an xsl:output
top-level element with a doctype-system
attribute so that the result document points to a DTD with the appropriate declaration. If the result is HTML, though, this isn't necessary, because all web browsers understand the entity reference.
The second xsl:output-character
element almost looks like it's mapping '&233;' to itself. It isn't, though; the XML parser that hands this stylesheet to the XSLT processor will turn the character
attribute value into the character itself as it puts in on the source tree, and then the XSLT processor, as it reads the source tree and creates data for the result tree, will convert that character to the string specified in that xsl:output-character
element's string
attribute.
As a test, I ran the stylesheet on this source document:
What Is Ascii For Non Breaking Space
I often use the word 'côté' as an example for English-speaking people who don't take accented characters seriously enough, because the presence or absence of each of these accents can turn it into three different words. Using Saxon 7, the stylesheet turns the document into this:
The two vowels in 'côté' have been converted to numeric character references, both in the doc
element's word
attribute and in the content of the first p
element. The em dash was converted to two hyphens, and the non-breaking space was converted to an entity reference for it.
Why would you need XSLT 2.0's character mapping feature? The most common problem it solves is the mangling of non-ASCII characters by processes that can't handle the encoding of a file that they're reading. The PowerPoint example above is one example. Confusion of some processes between the UTF-8 encoding, in which characters such as French vowels with accents are represented with two bytes, and Latin-1, which represents them with one byte, has often led to two strange bytes showing up in my browser window where I expected to see one accented vowel. (One more example as I finish up the last draft of this column: I pulled up the XHTML that I wrote into Microsoft Word to see if its spell checker would catch anything that I missed, and despite the XML declaration at the top of this file indicating that it's in UTF-8, Word thinks it's Latin 1, and shows the foreign characters and em dashes as two garbage bytes each.) Mapping these characters to their numeric character references, in which any Unicode code point can be represented using 7-bit ASCII (an ampersand followed by a pound sign, the code point number for the character, and a semicolon) can help to maintain the integrity of the representation all the way to the delivery application.
A straight mapping of a character to a 7-bit ASCII representation, as we saw with the em dash example, can also provide a compromise between a typographically slick representation of something and the possibility that garbage character(s) will appear in its place. Ultimately, XSLT 2.0 character mapping gives us more control over how our characters look and get represented, and it will be very handy.