Processing XML with a single PHP call - Generic XML parser class package blog

XML is a Pain but we still Need to Deal with it

PHP XML support

XML Parser class Solution to Validate and Extract Data in a Single Call

Implementing Custom Validation Rules

Conclusion

XML is a Pain but we still Need to Deal with it

XML is a format created in the late 1990s with the goal to be used in applications that need to interchange information in a format that is human-readable and is independent operating systems on which the applications run. XML was like HTML, which everybody was aware, but imposes more strict structure rules.

Many data formats were created around XML. However, XML is still a pain to write manually. Personally, one of the things I find more painful is to have to write every tag name twice: one for opening a tag and another for closing the tag.

A few years ago there was an attempt to define a new specification for the eventual XML 2.0. That attempt seems to have died. I recall that I proposed to allow tags to have a short close notation to avoid repeating the tag name. It would be like this:

<tag>data</>

The proposal was rejected. It seems that the organizers had no interest in avoiding one of the XML pain points. Given that, I am not surprised if XML 2.0 is really dead.

Nowadays, developers tend to use simpler formats like YAML and JSON to exchange data in a human-readable format. Ten years ago they would have used XML instead.

Despite the change in the mentality of the developers, once in a while we still need to deal with XML for some reason.

For instance, a few years ago I needed to develop a client and server of OpenID protocol. That is a single sign-on protocol, i.e. it allows you to login in multiple sites using the same account. It was used to let users login in the JSClasses site with the same account you use in the PHPClasses site.

OpenID itself is yet another story of a protocol design that is a pain to implement, but I will leave that story for a later article after I publish the OpenID classes that I developed.

What matters is that OpenID uses the XRDS protocol, which is meant to allow clients to discover the addresses and the features supported by the servers they need to access. XRDS is based in XML too.

Another situation for which XML may be better suited than other more simple formats, is when you need to represent certain entities that do not map to the data types used in your programming language, otherwise JSON would probably be a better choice.

For instance when you need to represent templates with special placeholders, you can use special XML tags to represent the placeholders. This is not a common situation but sometimes you may need to deal with it and XML may still be a good option.

PHP XML support

PHP XML Nowadays PHP has so many extensions to deal with XML, that it is even hard to figure which ones to use for each purpose.

There are several extensions for parsing XML but your applications do not need just to parse the documents. They also need to determine if the documents are valid according to what is expected, as well to extract data from the documents.

And when I say that you need to determine if the document is valid, I do not mean to figure just if the document is well-formed, but also that the values in the document are valid according to application specific rules.

The DOMDocument extension provides functions for validating XML documents based on a DTD or in a RelaxNG definition (another XML format). Having to write a DTD or a XML format to validate is like dealing with pain with more pain. This is certainly something that developers (or anybody else) would enjoy.

For extracting data from XML documents, the original Expat based XML parser extension has the function xml_parse_into_struct function that returns an array with the XML document elements. That is not very straightforward way to extract data from the XML documents.

Newer PHP XML extensions can return XML documents as nested objects. While this is a more useful way to browse data in XML documents, you still have to write a lot of custom code to process the documents in the XML format you need.

The DOMDocument extension provides XPath support. This means that you can search and extract document values if you can anticipate their location in the document. This usually makes things even more complicated then they are. It also does not avoid the need to write custom code to validate and extract data from your XML documents.

XML Parser class Solution to Validate and Extract Data in a Single Call

This XML parser class development started a long time ago still in 1999 in the PHP 3 days only when the Expat based PHP XML parser based extension existed.

What it does is to parse a XML document and build a single associative array that contains all the document elements.

It does not build an hierarchic data structure of nested elements, as some approaches do because that makes it impossible to access any page element immediately.

This class uses an approach that consists in encoding the path of each element as a string of comma separate numbers. The numbers are the order of an element inside the parent element. So, the root tag element has the path '0'. The first child element of the root tag has the path '0,0'. The second child element path is '0,1', and so on.

Over the years I used this class to parse many types of XML formats. I always had to write custom code to traverse this XML document structure array, validate the data as I expected it to be in the correct format, and extract the data values.

This is still a tedious approach but fortunately I realized there was a pattern in the code that I have written around this parser class to validate and extract data from all XML formats that I tried. Therefore, I decided to build a more generic solution to do it all in a single call.

This solution consists in passing an array with the meta-data about the structure of the XML document that I expect. I pass it to a function of the class and, if the document is valid, the function returns an array with all the extracted document data. It cannot be simpler than this.

Let me give you a practical example. Lets say you want to parse a simple XML file named person.xml with a format like this:


 <person>
  <name>Some Name</name>
  <address>Address line 1</address>
  <address>Address line 2</address>
  <age>33</age>
 </person>

To process this XML document you need to implement two simple steps: 1) parsing and 2) extracting the data.

Parsing is simple.You just need to call the class function named ParseFile. You can also use the ParseData function to parse a XML document from a string.


 $file_name = 'person.xml';
 $parser = new xml_parser_class;
 $error = $parser->ParseFile($file_name);
 if(strlen($error))
  die('Error while parsing the file '.$file_name.': '.$error);

Validating and extracting the data from the XML document is done in a single call. You need to pass a nested array that defines the structure of the XML that you expect starting from the root tag.


 $types = array(
   'person'=>array(
     'type'=>'hash',
     'types'=>array(
       'name'=>array(
         'type'=>'text'
       ),
       'address'=>array(
         'type'=>'text',
         'maximum'=>'*'
       ),
       'age'=>array(
         'type'=>'integer',
         'minimumvalue'=>'18'
       )
     )
   )
 );

Let me explain this in more detail:


 $types = array(
   'person'=>array(

You will want to return tags with child elements as associative arrays, so you need to set the 'type' parameter for the root tag to 'hash'.


     'type'=>'hash',

The 'types' parameter must be specified to define the types of the child element tags.


     'types'=>array(

For each child element tag there needs to be an entry that defines the type of element, as well as other type specific parameters. Several built-in basic types are supported: 'integer', 'text', 'date' and 'decimal'.

The types 'hash' and 'array' represent alternative forms to return child tags of a parent tag.

A special type named 'hash' can be used to return the path of the element rather than its value.


       'name'=>array(
         'type'=>'text'
       ),

A child tag element may appear multiple times within its parent tag. The 'minimum' and 'maximum' parameters tell the class how many times the tag can appear. The default minimum and maximum times is 1. A minimum of 0 means the tag is optional. A maximum of '*' means there is no maximum.


       'address'=>array(
         'type'=>'text',
         'maximum'=>'*'
       ),

Each type may have specific parameters. In the case of the 'age' tag, I can require that the minimum value be 18.


       'age'=>array(
         'type'=>'integer',
         'minimumvalue'=>'18'
       )
     )
   )
 );

Now you just need to call the ExtractElementData function to validate and extract the parsed data all at once.


 $start_element_path = ''; // start from the root element
 $type_of_document = 'person'; // used just in error messages
 $hash = true; // return the results in an associative array

 $error = $parser->ExtractElementData(
   $start_element_path,
   $type_of_document,
   $types,
   $hash,
   $person);

 if(strlen($error))
 {
   die('Error while extracting parsed data from the file '.
     $file_name.': '.$error);
 }

 var_dump($person);

This script would output something like this:


 array(1) {
   ["person"]=>
   array(3) {
     ["name"]=>
     string(9) "Some Name"
     ["address"]=>
     array(2) {
       [0]=>
       string(9) "Address line 1"
       [1]=>
       string(9) "Address line 2"
     }
     ["age"]=>
     int(33)
   }
 }

As you may see it is very easy to define the XML structure you want to parse and extract data from the document.

You can define more complex structures supporting nested tags setting the 'type' parameter to 'hash' and defining the child tags parameters using the 'types' parameter on the definition of the tag that has child tags.

Implementing Custom Validation Rules

As you may have noticed in this example, the processed document uses only basic data types that can be validated with just the supported built-in rules.

More common types and validation rules may be added in the future. However, in more realistic XML documents you may always need to implement validation rules that are not supported by the class and probably will never be because those rules are usually very application specific.

The way to address that problem is to create a sub-class and override the function ValidateElementData. That function exists specifically for the purpose of implementing custom validation rules.

Lets say you wanted to process a XML document with multiple people like this below and you needed to disallow processing several people with the same name.


 <people>

  <person>
   <name>Some Name</name>
   <address>Address line 1</address>
   <address>Address line 2</address>
   <age>33</age>
  </person>

  <person>
   <name>Some Name</name>
   <address>Address line 1</address>
   <address>Address line 2</address>
   <age>66</age>
  </person>

 </people>

First you need to an extension sub-class of the xml_parser_class to implement the function ValidateElementData. Then you need to create a parser object of the sub-class that you defined.


 class my_custom_xml_parser_class extends xml_parser_class
 {
   // keep track of the names of the people
   var $people = array();

   Function ValidateElementData($validation, $path, &$value,
     &$result)
   {
     switch($validation)
     {
       case 'unique person':
         // check if another person was set with the same name
         if(IsSet($this->people[$value]))
         {
           // set the error entry of the return result parameter
           $result['error'] =
             'multiple people were defined with the name "'.
             $value.'"';
         }
         else
         {
           $this->people[$value] = $path;
         }
         // return an empty string when there was
         return('');
       default:
         return($validation.
           ' is not a supported type of validation');
     }
   }
 };

 $parser = new my_custom_xml_parser_class;

Then you would define the types to make the class call the new sub-class to validate each person name with this custom code. Notice the 'validation' parameter set to 'unique person' for the 'name' tag.


 $types = array(
   'people'=>array(
     'type'=>'hash',
     'types'=>array(
       'person'=>array(
         'type'=>'hash',
         'maximum'=>'*',
         'types'=>array(
           'name'=>array(
             'type'=>'text',
             'validation'=>'unique person'
           ),
           'address'=>array(
             'type'=>'text',
             'maximum'=>'*'
           ),
           'age'=>array(
             'type'=>'integer',
             'minimumvalue'=>'18'
           )
         )
       )
     )
   )
 );

Conclusion

Processing XML documents can always be a pain. This class tries to minimize that pain by avoiding the need to write custom code traverse XML document nodes to validate and extract data from any XML format.

The class implements other features that were not covered here like XML attribute extraction, XML namespaces and error handling. Feel free to check the available example scripts the documentation to learn more about those details.

For other questions and comments, feel free to post a comment to this blog post or ask for support in this class discussion forum.

Processing XML with a single PHP call - Generic XML parser class package blog

Contents

XML is a Pain but we still Need to Deal with it

PHP XML support

XML Parser class Solution to Validate and Extract Data in a Single Call

Implementing Custom Validation Rules

Conclusion

XML is a Pain but we still Need to Deal with it

PHP XML support

XML Parser class Solution to Validate and Extract Data in a Single Call

Implementing Custom Validation Rules

Conclusion

You need to be a registered user or login to post a comment

Login Immediately with your account on: