Java Fundamentals Tutorial: XML Processing

18. XML Processing

Parsing XML with Java

18.1. Parsing XML with Java

  • XML parsers for Java developers

    • Core XML functionality spread across packages: javax.xml, org.w3c.dom, and org.xml.sax
    • IBM’s XML for Java EA2
    • NanoXML – a lightweight XML parser
    • JDOM - simplifies XML generation and parsing
  • Toolkits and libraries for XML processing fall into two categories

    • Event-driven processors
    • Object model construction processors

18.2. Event-Driven Approach

  • A parser reads the data and notifies specialized handlers (call-backs) that undertake some actions when different parts of the XML document are encountered:

    • Start/end of document
    • Start/end of element
    • Text data
    • Processing instruction, comment, etc.
  • SAX (Simple API for XML) – a standard event-driven XML API

For example, consider the following XML document:

<?xml version="1.0"?>
<dvd>
  <title>
      Matrix, The
  </title>
</dvd>
  • The following events are triggered:

    1. Start document
    2. Start element (dvd)
    3. Characters (line feed and spaces)
    4. Start element (title)
    5. Characters (Matrix, The)
    6. End element (title)
    7. Characters (line feed)
    8. End element (dvd)
    9. End document

18.3. Overview of SAX

  • Initially developed for Java
  • XML is processed sequentially (event-based)

    • Small memory foot-print
    • Potentially more work for the developer who may need to keep track of the document processing state
  • SAX is read-only and it cannot be used to change a document
  • DOM parsers may use SAX in order to build internal models
  • SAX events are handled by a special object called a content handler that developers implement

The way processors notify applications about elements, attributes, character data, processing instructions and entities is parser-specific and can greatly influence the programming style of the XML-related modules.

SAX is important because of the standardization of interfaces and classes that are used during the parsing process. javax.xml.parsers.SAXParserFactory javax.xml.parsers.SAXParser org.xml.sax.ContentHandler org.xml.sax.DTDHandler org.xml.sax.EntityResolver org.xml.sax.ErrorHandler org.xml.sax.helpers.DefaultHandler org.xml.sax.Attributes org.xml.sax.SAXException

18.4. Parsing XML with SAX

  • Create a SAX factory:

    SAXParserFactory factory = SAXParserFactory.newInstance();
  • Create a SAX parser:
SAXParser parser = factory.newSAXParser();

+ * Parse input (File, InputStream, etc) through a custom instance of a DefaultHanlder object:

parser.parse(in, myDvdHandler);

public class MyDvdHandler extends DefaultHandler {
  private Dvd dvd = null;
  private StringBuilder currentText;
  public Dvd getDvd() { return this.dvd; }

  @Override
  public void startElement(String uri, String localName,
    String qName, Attributes attributes) throws SAXException {
    if (qName.equals("dvd")) { this.dvd = new DVD(); }
    this.currentText = null;
  }

  @Override
  public void characters(char[] ch, int start, int length)
    throws SAXException {
    this.currentText.append(ch, start, length);
  }

  @Override
  public void endElement(String uri, String localName,
    String qName) throws SAXException {
    if (qName.equals("title")) {
      this.dvd.setTitle(this.currentText.toString().trim());
    }
  }
}

18.5. Object-Model Approach

  • Based on the idea of parsing the whole document and constructing an object representation of it in memory

    • A series of parent-child nodes of different types
  • DOM (Document Object Model)

    • Standard, language-independent specification written in OMG’s IDL for the constructed object model
  • Also known as the tree-based approach

For example, consider the following XML document:

<?xml version="1.0"?>
<dvd>
  <title>
      Matrix, The
  </title>
</dvd>

The following DOM object tree is constructed:

Document
  Element Node “dvd”
    Text Node (white-space)
    Element Node “title”
      Text Node “Matrix, The” (trimmed white-space)
    Text Node (white-space)

18.6. Overview of DOM

  • DOM defines a series of interfaces for working with XML documents

    • DOM is an object model not a data model, with corresponding actions
    • DOM can be updated in memory (unlike SAX)
  • Application manipulates DOM after it is constructed, as opposed to handle XML events during the parsing time
  • Slower than SAX (uses more memory) but easier to use (random access to XML data)

  • DOM is a language and platform independent interface that allows programs to dynamically access and update the content structure and style of documents
  • DOM parser implementations are free to choose whatever internal representation they like, as long as they comply with the DOM interfaces
  • Compared to SAX, there is basically a trade-off between memory consumption and fast multiple data accesses after parsing
  • DOM factory and parser are defined in javax.xml.parsers package as DocumentBuilderFactory and DocumentBuilder
  • DOM interfaces are defined in org.w3c.dom package
  • Fundamental core interfaces: Node, Document, DocumentFragment, Element, DOMImplementation, NodeList, NamedNodeMap, CharacterData, DOMException, Attr, Text, and Comment
  • Extended interfaces: CDATASection, DocumentType, EntityReference, ProcessingInstruction

18.7. Parsing XML with DOM

  • Create a DOM factory:

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
  • Create a DOM builder:

    DocumentBuilder builder = factory.newDocumentBuilder();
  • Parse Document from some XML input:

    Document document = builder.parse(in); // use document to access parsed DOM

private Node findNodeByName(Node node, String name) {
  if (name.equals(node.getNodeName())) {
    return node;
  } else {
    for (Node n = node.getFirstChild(); n != null;
              n = n.getNextSibling()) {
      Node found = findNodeByName(n, name);
      if (found != null) {
        return found;
      }
    }
  }
  return null;
}

...
Node dvdNode = findNodeByName(document, “dvd”);
if (dvdNode != null) {
  Dvd dvd = new Dvd();
  Node titleNode = findNodeByName(dvdNode, “title”);
  if (titleNode != null) {
    dvd.setTitle(titleNode.getTextContent());
  }
}
...