DOMDocument whitespace text nodes

Posted on Mar 31, 2008 in PHP | 5 comments

The DOMDocument is a convenient way of manipulating an XML file. While this is for PHP, I’ve come across some great parsers for use on a business iphone using lib2xml but we leave that for another day. One issue I ran into was the fact that when loading an XML file, DOMDocument treats the tabs and spaces which make the XML readable as empty text nodes. This presents a problem when you try and traverse the DOM by using attributes like firstChild and nextChild. For example:

Here is the XML file, “example.xml”:


Ideally, to get the value within <tag> we would want to do this :

$dom = new DOMDocument();
$tag = $dom->firstChild->firstChild->nodeValue;
echo $tag;
//Expected : "Value"
//Returned : ""

The reason this happens is because the space between the <root> and <tag> tags is treated as a text node element, and so what you do is choose that text node first. This is annoying, and I spent some time to no avail looking for any setting that may ignore this “pretty”(but useless) whitespace. So to get around this, I just used some good old Regular Expression to trim the spaces:

//Put the XML file into a string
$XmlFile = fopen("example.xml", 'r');
$XmlFileText = fread($XmlFile, filesize("example.xml"));
//Replace the whitespace with nothing
$XmlFileText = preg_replace("/>s+</", "><", $XmlFileText);
//Do some DOM magic
$dom = new DOMDocument();
$tag = $dom->firstChild->firstChild->nodeValue;
echo $tag;
//Expected : "Value"
//Returned : "Value"

What happens here is you load the file first, trim whitespace between tags, and then load the string into the DOMDocument object. Now you can traverse without worrying about the annoying whitespace! It would make sense to extend DOMDocument, or create a function that does this every time you create a new document, but this is pretty much what you’d have to do….