Questions about Atom and text escaping

I spent part of this exciting Father’s Day Eve studying the Atom specification for a project I’m working on. The following questions arose. I make some conjectures below but I’m hoping some knowledgable folks can chime in.
First, may content marked with mode=”escaped” be entity encoded?
So far the examples I’ve seen that use mode=”escaped”, such as Movable Type’s default template, put their content in a CDATA. In section 3.1.2 the Atom specification states that

  A mode attribute with the value “escaped” indicates that the element’s content is an escaped string.

but does not provide a definition or pointer for the term “escaped string”. Assuming the same usage as the XML specification, I believe that entity encoding is allowed so long as non-default entities are property declared (the defaults are: amp, lt, gt, apos, quot). Is this correct?
Second, should mode=”escaped” be interpreted as redundant information when combined with CDATA?
If I read the XML spec correctly, CDStart indicates unambiguously how to treat the text that follows. If mode=”escaped” is redundant, then consuming code should take the text inside the CDATA section and run with it. However, if mode=”escaped” is not redundant, consuming code should take the text inside the CDATA section and unescape it. But the latter approach seems too complicated so I’m pretty sure mode=”escaped” is redundant when CDATA is used. Is this correct?
Any answers, pointers or clarifications would be greatly appreciated.

8 thoughts on “Questions about Atom and text escaping”

  1. I’m not an expert, but I sometimes understand when they try to explain things to me. Here’s how they’ve explained the two parts of that:

    CDATA versus entity escaping: the best way to think of it is that there is *absolutely no difference*. You aren’t interested in how it looks in an XML file, you are interested in what comes out of your XML parser, and what comes out of it is exactly the same thing, either way. It makes a difference when you are creating a feed, since you either escape less-than and ampersand, or you escape greater-than if it appears after two right square brackets, but other than that? No difference (pace using a regex to parse broken XML). Make yourself a couple of simple XML documents, one with CDATA and one with entity escaping, point your XML parser at them, and dump out the text that comes out: you’ll see it’s exactly the same either way (or, you’ll have a bug to report).

    Atom’s mode attribute talks about what you will get coming out of the parser (or, looked at another way, it’s “what I did to it before I put it in there”): mode=”escaped” means that the stuff you get from that open tag until the close tag will be the type that the type attribute says, just like it comes out of the parser. If type=”text/html”, other than risky content issues, take the string your XML parser hands you, and throw it at a browser. If type=”text/plain” then you need to escape the things you would escape while putting text into HTML (less-than and ampersands). If mode=”xml” then unless you are using a DOM parser with innerXML or XPath, you have to stick what comes out of the parser back together to get something that’s the type it says it is: with a SAX parser, you look at the first element inside entry, usually dump it because it’s an html:body or an extra html:div to hang the namespace off, and from then on build up a string of the element names with pointy brackets around them and the character data between them, cursing all the while (not a big fan of processing inline XML, me). If mode=”base64″, you take the string your XML parser hands you, unencode it, and you’ve got whatever the type attribute says you should have.

    Like

  2. Okay, so suppose I’m using an XML parser to process the content element of an Atom feed. If the mode attribute exists and is set to “escaped”, I tell the parser to give me the unescaped payload. If mode=”base64″, I tell the parser to give me the raw or unescaped payload (doesn’t matter) and I decode it. If mode is absent or mode=”xml”, I tell the parser to hand me the raw payload (which must be valid XML). Next I look at the type attribute to decide how to render what I’ve got. Sound about right?

    If that’s true, mode=”escaped” is not only not redundant with CDATA, it’s actually required for correct behavior.

    Like

  3. Close. “escaped” and “base64” mean your application will only receive Character Data. “xml” means your application will receive a mixture of XML elements and Character Data.

    With “base64”, you can immediately take one more step and convert the Character Data to octets.

    At this point, the “decoding” of the mode is complete and how you display is based on the type.

    If you are sending output to a browser, and type=”text/html”, you can send the content directly to the browser to be interpreted (noting security issues).

    If you are sending output to a browser, and type=”text/plain”, you must re-encode characters that the browser might interpret as markup, ie. replacing ‘&’ and ‘<‘.

    If you are sending output to a text display, you’ll likely want to strip element markup from type=”text/html”, while displaying the characters directly for type=”text/plain”.

    Like Phil said, most XML parsers won’t tell you when the XML instance used CDATA or character entities to represent Character Data.

    Like

  4. Maybe a less abstract situation would help: what (or what sort, what API) XML parser are you using? I only really use SAX, so “tell it to give me” doesn’t come into play: it gives, I deal with. Sounds like you might be using a DOM parser?

    Like

  5. Pingback: house of warwick
  6. I’m not parsing. Initially, as it turns out, I’ll be looking at things from a generator’s pov. As for the example above, which looks at the feed from a parser’s pov, it just provided a handy way to talk about how the spec is supposed to work. The revelation for me, so far, is that you really do need mode=”escaped”.

    Like

Comments are closed.

%d bloggers like this: