13c6626674baaffdd529e08f90e8f1d7.ppt
- Количество слайдов: 76
Processing XML with Java CS 236369, Spring 2010 1
Parsers n What is a parser? Formal grammar Input Parser Analyzed Data The structure(s) of the input, according to the atomic elements and their relationships (as described in the grammar)
XML-Parsing Standards n n We will consider two parsing methods that implement W 3 C standards for accessing XML DOM q q n convert XML into a tree of objects “random access” protocol SAX q q “serial access” protocol event-driven parsing 3
XML Examples 4
root element world. xml " src="https://present5.com/presentation/13c6626674baaffdd529e08f90e8f1d7/image-5.jpg" alt=" xml version="1. 0"? > root element world. xml " />
xml version="1. 0"? > root element world. xml
XML Tree Model element attribute simple content
world. dtd
sales. xml
DOM – Document Object Model 10
DOM Parser n n n DOM = Document Object Model Parser creates a tree object out of the document User accesses data by traversing the tree q n The tree and its traversal conform to a W 3 C standard The API allows for constructing, accessing and manipulating the structure and content of XML documents 11
The DOM Tree
Using a DOM Tree XML File DOM Parser DOM Tree A P I Application in memory 14
Creating a DOM Tree n A DOM tree is generated by a Document. Builder n The builder is generated by a factory, in order to be implementation independent n The factory is chosen according to the system configuration Document. Builder. Factory factory = Document. Builder. Factory. new. Instance(); Document. Builder builder = factory. new. Document. Builder(); Document doc = builder. parse("world. xml"); 15
Configuring the Factory n n The methods of the document-builder factory enable you to configure the properties of the document building For example q q factory. set. Validating(true) factory. set. Ignoring. Comments(false) Read more about Document. Builder. Factory Class, Document. Builder Class 16
The Node Interface n The nodes of the DOM tree include q a special root (denoted document) n The Document interface retrieved by builder. parse(…) actually extends the Node Interface element nodes q text nodes and CDATA sections q attributes q comments q and more. . . Every node in the DOM tree implements the Node interface q n 17
Figure as appears in : “The XML Companion” - Neil Bradley Interfaces in a DOM Tree Document. Fragment Document Character. Data Attr Node Text CDATASection Comment Element Document. Type Notation Node. List Entity Named. Node. Map Entity. Reference Processing. Instruction Document. Type
Interfaces in the DOM Tree Document Type Attribute Text Attribute Element Comment Element Entity Reference Element Text 19
Node Navigation n Every node has a specific location in tree n Node interface specifies methods for tree navigation q Node get. First. Child(); q Node get. Last. Child(); q Node get. Next. Sibling(); q Node get. Previous. Sibling(); q Node get. Parent. Node(); q Node. List get. Child. Nodes(); q Named. Node. Map get. Attributes() 20
Node Navigation (cont) get. Previous. Sibling() get. First. Child() get. Child. Nodes() get. Parent. Node() get. Last. Child() get. Next. Sibling() 21
Node Properties n n n Every node has q a type q a name q a value q attributes The roles of these properties differ according to the node types Nodes of different types implement different interfaces (that extend Node) 22
Names, Values and Attributes Interface node. Name node. Value attributes name of attribute value of attribute null "#cdata-section" content of the Section null Comment "#comment" content of the comment null Document "#document" null "#document-fragment" null doc-type name null tag name null Node. Map entity name null name of entity referenced null notation name null target entire content null "#text" content of the text node null Attr CDATASection Document. Fragment Document. Type Element Entity. Reference Notation Processing. Instruction Text 23
Node Types - get. Node. Type() ELEMENT_NODE = 1 PROCESSING_INSTRUCTION_NODE = 7 ATTRIBUTE_NODE = 2 COMMENT_NODE = 8 TEXT_NODE = 3 DOCUMENT_NODE = 9 CDATA_SECTION_NODE = 4 DOCUMENT_TYPE_NODE = 10 ENTITY_REFERENCE_NODE = 5 DOCUMENT_FRAGMENT_NODE = 11 ENTITY_NODE = 6 NOTATION_NODE = 12 if (my. Node. get. Node. Type() == Node. ELEMENT_NODE) { //process node … } Read more about Node Interface 24
import org. w 3 c. dom. *; import javax. xml. parsers. *; public class Echo. With. Dom { public static void main(String[] args) throws Exception { Document. Builder. Factory factory = Document. Builder. Factory. new. Instance(); factory. set. Ignoring. Element. Content. Whitespace(true); Document. Builder builder = factory. new. Document. Builder(); Document doc = builder. parse(“world. xml"); new Echo. With. Dom(). echo(doc); }
private int depth = 0; private String[] NODE_TYPES = { "", "ELEMENT", "ATTRIBUTE", "TEXT", "CDATA", "ENTITY_REF", "ENTITY", "PROCESSING_INST", "COMMENT", "DOCUMENT_TYPE", "DOCUMENT_FRAG", "NOTATION" }; private void print(Node n) { for (int i = 0; i < depth; i++) System. out. print(" "); System. out. print(NODE_TYPES[n. get. Node. Type()] + ": "); System. out. print("Name: "+ n. get. Node. Name()); System. out. print(" Value: "+ n. get. Node. Value()+"n"); }}
private void echo(Node n) { print(n); if (n. get. Node. Type() == Node. ELEMENT_NODE) { Named. Node. Map atts = n. get. Attributes(); ++depth; for (int i = 0; i < atts. get. Length(); i++) echo(atts. item(i)); --depth; } depth++; for (Node child = n. get. First. Child(); child != null; child = child. get. Next. Sibling()) echo(child); depth--; }
Normalizing the DOM Tree n Normalizing a DOM Tree has two effects: q q n Combine adjacent textual nodes Eliminate empty textual nodes Created by node manipulation… To normalize, apply the normalize() method to the document element 28
Node Manipulation n Children of a node in a DOM tree can be manipulated - added, edited, deleted, moved, copied, etc. To constructs new nodes, use the methods of Document q create. Element, create. Attribute, create. Text. Node, create. CDATASection etc. To manipulate a node, use the methods of Node q append. Child, insert. Before, remove. Child, replace. Child, set. Node. Value, clone. Node(boolean deep) etc. 29
Figure as appears in “The XML Companion” - Neil Bradley Node Manipulation (cont) Old New replace. Child deep = 'false' clone. Node deep = 'true' 30
SAX Parser n n SAX = Simple API for XML is read sequentially When a parsing event happens, the parser invokes the corresponding method of the corresponding handler The handlers are programmer’s implementation of standard Java API (i. e. , interfaces and classes) 31
. . . SAX Parser When you see" src="https://present5.com/presentation/13c6626674baaffdd529e08f90e8f1d7/image-42.jpg" alt="SAX Parsers xml version="1. 0"? >. . . SAX Parser When you see" />
SAX Parsers xml version="1. 0"? >. . . SAX Parser When you see the start of the document do … When you see the start of an element do … When you see the end of an element do … 42
Used to create a SAX Parser Handles document events: start tag, end tag, etc. Handles Parser Errors Handles DTD Handles Entities
Creating a Parser n n n The SAX interface is an accepted standard There are many implementations of many vendors We would like to be able to change the implementation used, without changing any code in the program q How this is done? 44
System Properties n n Properties are configuration values managed as key/value pairs. In each pair, the key and value are both String values. The System class maintains a Properties object that describes the configuration of the current working environment. q n System properties include information about the current user, the current version of the Java runtime, and the character used to separate components of a file path name. Java standard System Properties. 45
Factory Design Pattern n Have a “factory” class that creates the actual parsers q n n org. xml. sax. helpers. XMLReader. Factory The factory checks configurations, mainly the value of a system property, that specify the implementation In order to change the implementation, simply change the system property Read more about XMLReader. Factory Class 46
The factory will look for this… Creating a SAX Parser import org. xml. sax. *; import org. xml. sax. helpers. *; public class Echo. With. Sax { public static void main(String[] args) throws Exception { System. set. Property("org. xml. sax. driver", "org. apache. xerces. parsers. SAXParser"); XMLReader reader = XMLReader. Factory. create. XMLReader(); reader. parse("world. xml"); } } Read more about XMLReader Interface, Xerces SAXParser class 47
Implementing the Content Handler n n A SAX parser invokes methods such as start. Document, start. Element and end. Element of its content handler as it runs In order to react to parsing events we must: q q implement the Content. Handler interface set the parser’s content handler with an instance of our Content. Handler implementation 48
Content. Handler Methods n n n n start. Document - parsing begins end. Document - parsing ends start. Element - an opening tag is encountered end. Element - a closing tag is encountered characters - text (CDATA) is encountered ignorable. Whitespace - white spaces that should be ignored (according to the DTD) and more. . . Read more about Content. Handler Interface 49
The Default Handler n The class Default. Handler implements all handler interfaces (usually, in an empty manner) q i. e. , Content. Handler, Entity. Resolver, DTDHandler, Error. Handler n An easy way to implement the Content. Handler interface is to extend Default. Handler Read more about Default. Handler Class 50
A Content Handler Example import org. xml. sax. helpers. Default. Handler; import org. xml. sax. *; public class Echo. Handler extends Default. Handler { int depth = 0; public void print(String line) { for(int i=0; i
A Content Handler Example public void start. Document() throws SAXException { print("BEGIN"); } public void end. Document() throws SAXException { print("END"); } public void start. Element(String ns, String local. Name, String q. Name, Attributes attrs) throws SAXException { We will discuss this print("Element " + q. Name + "{"); interface later… ++depth; for (int i = 0; i < attrs. get. Length(); ++i) print(attrs. get. Local. Name(i) + "=" + attrs. get. Value(i)); } 52
A Content Handler Example public void end. Element(String ns, String l. Name, String q. Name) throws SAXException { --depth; print("}"); } public void characters(char buf[], int offset, int len) throws SAXException { String s = new String(buf, offset, len). trim(); ++depth; print(s); --depth; } } 53
Fixing The Parser public class Echo. With. Sax { public static void main(String[] args) throws Exception { XMLReader reader = XMLReader. Factory. create. XMLReader(); reader. set. Content. Handler(new Echo. Handler()); reader. parse("world. xml"); } } What would happen without this line? 54
Empty Elements n What do you think happens when the parser parses an empty element?
Attributes Interface n n The Attributes interface provides an access to all attributes of an element q get. Length(), get. QName(i), get. Value(i), get. Type(i), get. Value(qname), etc. #attributes The following are possible types for attributes: CDATA, IDREF, IDREFS, NMTOKENS, ENTITY, ENTITIES, NOTATION Read more about Attributes Interface 56
Error. Handler Interface n We implement Error. Handler to receive error events (similar to implementing Content. Handler) n Default. Handler implements Error. Handler in an empty fashion, so we can extend it (as before) n An Error. Handler is registered with q n reader. set. Error. Handler(handler); Three methods: q void error(SAXParse. Exception ex); q void fatal. Error(SAXParser. Excpetion ex); q void warning(SAXParser. Exception ex); 57
Parsing Errors n Fatal errors disable the parser from continuing parsing q n n For example, the document is not well formed, an unknown XML version is declared, etc. Errors (that is recoverable ones) occur for example when the parser is validating and validity constrains are violated Warnings occur when abnormal (yet legal) conditions are encountered q For example, an entity is declared twice in the DTD 58
Entity. Resolver and DTDHandler n The interface Entity. Resolver enables the programmer to specify a new source for e. g. external translation of external entities DTD n The interface DTDHandler enables the programmer to react to notations (unparsed entities) declarations inside the DTD ( Notation Example ) Usually appear with external non-xml resources and describe their type Read more about Entity. Resolver Interface, DTDHandler Interface 59
Features and Properties n SAX parsers can be configured by setting their features and properties n Syntax: q q n reader. set. Feature("feature-url", boolean) reader. set. Property("property-url", Object) Standard feature URLs have the form: http: //xml. org/sax/features/feature-name n Standard property URLs have the form http: //xml. org/sax/properties/prop-name 60
Feature/Property Examples n Features: q namespaces - are namespaces supported? validation - does the parser validate (against the declared DTD) ? q http: //apache. org/xml/features/nonvalidating/load-external-dtd q Read more about Features n Properties: q lexical-handler - see the next slide. . . Read more about Properties 61
Lexical Events n Lexical events have to do with the way that a document was written and not with its content n Examples: q q n A comment is a lexical event () The use of an entity is a lexical event (> ) These can be dealt with by implementing the Lexical. Handler interface, and setting the lexicalhandler property to an instance of the handler 62
Lexical. Handler Methods n comment(char[] ch, int start, int length) start. CDATA() end. CDATA() start. Entity(java. lang. String name) end. Entity(java. lang. String name) n and more. . . n n Read more about Lexical. Handler Interface 63
Different Approaches – SAX vs. DOM 64
The DOM Approach n n Tree-based API : map an XML document into an internal tree structure, then allow an application to navigate that tree. The application is active. Provides random-access. Remember that it is possible to construct a parse tree using an event-based API, and it is possible to use an event-based API to traverse an in-memory tree. (Actually DOM is a level above SAX) 65
The DOM Tree
The SAX Approach Event based API. n The Application is passive. n Provides Serial access parser. Given the following XML document: n This XML document, when passed through a SAX parser, will generate the following sequence of events (Pushing)… 68
n n n n XML Processing Instruction, named xml, with attributes version equal to "1. 0" and encoding equal to "UTF-8" XML Element start, named Root. Element, with an attribute param equal to "value" XML Element start, named First. Element XML Text node, with data equal to "Some Text" (note: text processing, with regard to spaces, can be changed) XML Element end, named First. Element XML Element start, named Second. Element, with an attribute param 2 equal to "something" XML Text node, with data equal to "Pre-Text" XML Element start, named Inline XML Text node, with data equal to "Inlined text" XML Element end, named Inline XML Text node, with data equal to "Post-text. " XML Element end, named Second. Element XML Element end, named Root. Element
Pull vs. Push n SAX is known as a push framework q q n the parser has the initivative the programmer must react to events An alternative is a pull framework q q the programmer has the initiative the parser must react to requests 70
Parser Efficiency n The DOM object built by DOM parsers is usually complicated and requires more memory storage than the XML file itself q q n n A lot of time is spent on construction before use For some very large documents, this may be impractical SAX parsers store only local information that is encountered during the serial traversal Hence, programming with SAX parsers is, in general, more efficient 71
Programming using SAX is Difficult n In some cases, programming with SAX is difficult: q q q How can we find, using a SAX parser, elements e 1 with ancestor e 2? How can we find, using a SAX parser, elements e 1 that have a descendant element e 2? How can we find the element e 1 referenced by the IDREF attribute of e 2? 72
Node Navigation n n SAX parsers do not provide access to elements other than the one currently visited in the serial (DFS) traversal of the document In particular, q q n n They do not read backwards They do not enable access to elements by ID or name DOM parsers enable any traversal method Hence, using DOM parsers is usually more comfortable 73
More DOM Advantages n You can save time and effort if you send and receive DOM objects instead of XML files q n But, DOM object are generally larger than the source DOM parsers provide a natural integration of XML reading and manipulating q e. g. , “cut and paste” of XML fragments 74
Which should we use? DOM vs. SAX n n n If your document is very large and you only need a few elements – use SAX If you need to manipulate (i. e. , change) the XML – use DOM If you need to access the XML many times – use DOM (assuming the file is not too large) 75
Resources used for this presentation n n n The Hebrew University of Jerusalem – CS Faculty. An Introduction to XML and Web Technologies – Course’s Literature Wikipedia An Introduction to XML and Web Technologies – Course’s Literature. http: //www. saxproject. org/event. html http: //www. xml. com/ 76


