Friday, March 25, 2011

Scala XML Gotchas

Scala's built-in XML support is perhaps flawed, but still offers very convenient syntax for simple XML manipulation. Even ignoring performance concerns and concurrency issues, there are still weird gotchas that the average user may need to deal with...

CDATA magically escaped

scala> val xml = <xml><test><![CDATA[a < b]]></test></xml>
xml: scala.xml.Elem = <xml><test>a &lt; b</test></xml>  <-- WTF?
Same when loading from a String:
scala> val xml = XML.loadString("<xml><test><![CDATA[a < b]]></test></xml>")
xml: scala.xml.Elem = <xml><test>a &lt; b</test></xml>
This is not what you want. The stuff in the CDATA is meant to be left alone. Instead, it seems that the CDATA is eaten and its contents magically escaped. This causes lots of grief if the contents of the CDATA are Javascript, for example.
One workaround is to use the built-in ConstructingParser to load XML.
scala> val xml2 = ConstructingParser.fromSource(Source.fromString("<xml><test><![CDATA[a < b]]></test></xml>"), preserveWS = true).document.docElem
xml2: scala.xml.Node = <xml><test><![CDATA[a < b]]></test></xml>
Looks good.
You can also use <xml:unparsed>. Check out this Scala XML faq for more.

XML Comments eaten

When loading XML from a string, XML comments disappear. Example:
scala> val looksGood = <xml><test><!-- comment --></test></xml>
looksGood: scala.xml.Elem = <xml><test><!-- comment --></test></xml>

scala> val wtf = XML.loadString("<xml><test><!-- comment --></test></xml>")
wtf: scala.xml.Elem = <xml><test></test></xml>
Again, ConstructingParser can fix this:
scala> val correct = ConstructingParser.fromSource(Source.fromString("<xml><test><!-- comment --></test></xml>"), preserveWS = true).document.docElem
correct: scala.xml.Node = <xml><test><!-- comment --></test></xml>
There are some alternatives if you run into these issues.

  • As described above, use scala.xml.parsers.ConstructingParser to load XML
  • Use the Lift web framework's PCDataMarkupParser (extends Scala's built-in MarkupParser with various improvments)
  • Daniel Spiewak's Anti-XML project looks promising
  • Use any of the million Java XML parsers that are out there (but give up the convenient scala.xml syntax

No comments: