Parsing XML with Java
XML parsers for Java developers
-
Core XML functionality spread across packages:
javax.xml
,org.w3c.dom
, andorg.xml.sax
- IBM’s XML for Java EA2
- NanoXML – a lightweight XML parser
- JDOM - simplifies XML generation and parsing
-
Core XML functionality spread across packages:
Toolkits and libraries for XML processing fall into two categories
- Event-driven processors
- Object model construction processors
A parser reads the data and notifies specialized handlers (call-backs) that undertake some actions when different parts of the XML document are encountered:
- Start/end of document
- Start/end of element
- Text data
- Processing instruction, comment, etc.
- SAX (Simple API for XML) – a standard event-driven XML API
For example, consider the following XML document:
<?xml version="1.0"?> <dvd> <title> Matrix, The </title> </dvd>
The following events are triggered:
- Start document
- Start element (dvd)
- Characters (line feed and spaces)
- Start element (title)
- Characters (Matrix, The)
- End element (title)
- Characters (line feed)
- End element (dvd)
- End document
- Initially developed for Java
XML is processed sequentially (event-based)
- Small memory foot-print
- Potentially more work for the developer who may need to keep track of the document processing state
- SAX is read-only and it cannot be used to change a document
- DOM parsers may use SAX in order to build internal models
- SAX events are handled by a special object called a content handler that developers implement
The way processors notify applications about elements, attributes, character data, processing instructions and entities is parser-specific and can greatly influence the programming style of the XML-related modules.
SAX is important because of the standardization of interfaces and classes that are used during the parsing process.
javax.xml.parsers.SAXParserFactory
javax.xml.parsers.SAXParser
org.xml.sax.ContentHandler
org.xml.sax.DTDHandler
org.xml.sax.EntityResolver
org.xml.sax.ErrorHandler
org.xml.sax.helpers.DefaultHandler
org.xml.sax.Attributes
org.xml.sax.SAXException
Create a SAX factory:
SAXParserFactory factory = SAXParserFactory.newInstance();
- Create a SAX parser:
SAXParser parser = factory.newSAXParser();
+
* Parse input (File
, InputStream
, etc) through a custom instance of a DefaultHanlder
object:
parser.parse(in, myDvdHandler);
public class MyDvdHandler extends DefaultHandler { private Dvd dvd = null; private StringBuilder currentText; public Dvd getDvd() { return this.dvd; } @Override public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { if (qName.equals("dvd")) { this.dvd = new DVD(); } this.currentText = null; } @Override public void characters(char[] ch, int start, int length) throws SAXException { this.currentText.append(ch, start, length); } @Override public void endElement(String uri, String localName, String qName) throws SAXException { if (qName.equals("title")) { this.dvd.setTitle(this.currentText.toString().trim()); } } }
Based on the idea of parsing the whole document and constructing an object representation of it in memory
- A series of parent-child nodes of different types
DOM (Document Object Model)
- Standard, language-independent specification written in OMG’s IDL for the constructed object model
- Also known as the tree-based approach
For example, consider the following XML document:
<?xml version="1.0"?> <dvd> <title> Matrix, The </title> </dvd>
The following DOM object tree is constructed:
Document Element Node “dvd” Text Node (white-space) Element Node “title” Text Node “Matrix, The” (trimmed white-space) Text Node (white-space)
DOM defines a series of interfaces for working with XML documents
- DOM is an object model not a data model, with corresponding actions
- DOM can be updated in memory (unlike SAX)
- Application manipulates DOM after it is constructed, as opposed to handle XML events during the parsing time
- Slower than SAX (uses more memory) but easier to use (random access to XML data)
- DOM is a language and platform independent interface that allows programs to dynamically access and update the content structure and style of documents
- DOM parser implementations are free to choose whatever internal representation they like, as long as they comply with the DOM interfaces
- Compared to SAX, there is basically a trade-off between memory consumption and fast multiple data accesses after parsing
-
DOM factory and parser are defined in
javax.xml.parsers
package asDocumentBuilderFactory
andDocumentBuilder
-
DOM interfaces are defined in
org.w3c.dom
package -
Fundamental core interfaces:
Node
,Document
,DocumentFragment
,Element
,DOMImplementation
,NodeList
,NamedNodeMap
,CharacterData
,DOMException
,Attr
,Text
, andComment
-
Extended interfaces:
CDATASection
,DocumentType
,EntityReference
,ProcessingInstruction
Create a DOM factory:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Create a DOM builder:
DocumentBuilder builder = factory.newDocumentBuilder();
Parse Document from some XML input:
Document document = builder.parse(in); // use document to access parsed DOM
private Node findNodeByName(Node node, String name) { if (name.equals(node.getNodeName())) { return node; } else { for (Node n = node.getFirstChild(); n != null; n = n.getNextSibling()) { Node found = findNodeByName(n, name); if (found != null) { return found; } } } return null; } ... Node dvdNode = findNodeByName(document, “dvd”); if (dvdNode != null) { Dvd dvd = new Dvd(); Node titleNode = findNodeByName(dvdNode, “title”); if (titleNode != null) { dvd.setTitle(titleNode.getTextContent()); } } ...