Friday, March 25, 2011

Scala XML Gotchas

Scala's built-in XML support is perhaps flawed, but still offers very convenient syntax for simple XML manipulation. Even ignoring performance concerns and concurrency issues, there are still weird gotchas that the average user may need to deal with...

CDATA magically escaped

scala> val xml = <xml><test><![CDATA[a < b]]></test></xml>
xml: scala.xml.Elem = <xml><test>a &lt; b</test></xml>  <-- WTF?
Same when loading from a String:
scala> val xml = XML.loadString("<xml><test><![CDATA[a < b]]></test></xml>")
xml: scala.xml.Elem = <xml><test>a &lt; b</test></xml>
This is not what you want. The stuff in the CDATA is meant to be left alone. Instead, it seems that the CDATA is eaten and its contents magically escaped. This causes lots of grief if the contents of the CDATA are Javascript, for example.
One workaround is to use the built-in ConstructingParser to load XML.
scala> val xml2 = ConstructingParser.fromSource(Source.fromString("<xml><test><![CDATA[a < b]]></test></xml>"), preserveWS = true).document.docElem
xml2: scala.xml.Node = <xml><test><![CDATA[a < b]]></test></xml>
Looks good.
You can also use <xml:unparsed>. Check out this Scala XML faq for more.

XML Comments eaten

When loading XML from a string, XML comments disappear. Example:
scala> val looksGood = <xml><test><!-- comment --></test></xml>
looksGood: scala.xml.Elem = <xml><test><!-- comment --></test></xml>

scala> val wtf = XML.loadString("<xml><test><!-- comment --></test></xml>")
wtf: scala.xml.Elem = <xml><test></test></xml>
Again, ConstructingParser can fix this:
scala> val correct = ConstructingParser.fromSource(Source.fromString("<xml><test><!-- comment --></test></xml>"), preserveWS = true).document.docElem
correct: scala.xml.Node = <xml><test><!-- comment --></test></xml>
There are some alternatives if you run into these issues.

  • As described above, use scala.xml.parsers.ConstructingParser to load XML
  • Use the Lift web framework's PCDataMarkupParser (extends Scala's built-in MarkupParser with various improvments)
  • Daniel Spiewak's Anti-XML project looks promising
  • Use any of the million Java XML parsers that are out there (but give up the convenient scala.xml syntax

ssh client config: hosts

Slightly embarrassed that in more than a decade of daily ssh use I've never made use of ssh client config to simplify connecting to commonly used hosts. The idea is you can just ssh foo rather than ssh -p 12345 fluffy@foo.blah-blah.on.ca .... This is especially useful if you connect to a lot of EC2 hosts frequently and don't want to remember their ugly names (or setup DNS). Best understood by example:

Contents of ~/.ssh/config:
host ec2-webserver
    hostname ec2-123-456-78-90.compute-1.amazonaws.com
    user root
    identityfile ~/my-ec2-key.pem
    compression yes
    protocol 2

host home
    hostname my.place.com
    port 51000
    user fluffy
    identityfile ~/.ssh/id_dsa
    ServerAliveInterval 15
    ServerAliveCountMax 4
    compression yes
    protocol 2

After this is setup, you can simply type ssh ec2-webserver or ssh home rather than the full ssh command. There are a million other ssh client config options you can set, too. As expected, all the ssh tools like scp honour these settings.