Manual
This page is dedicated to those who are new to EzXML.jl. It is recommended to read this page before reading other pages to grasp the concepts of the package first. Once you have read it, the reference page would be a better place to find necessary functions. The developer notes page is for developers and most users do not need to read it.
In this manual, we use using EzXML
to load the package for brevity. However, it is recommended to use import EzXML
or something similar for non-trivial scripts or packages because EzXML.jl exports a number of names to your environment. These are useful in an interactive session but easily conflict with other names. If you would like to know the list of exported names, please go to the top of src/EzXML.jl, where you will see a long list of type and function names.
EzXML.jl is built on top of libxml2, a portable C library compliant to the XML standard. If you are no familiar with XML itself, the following links offer good resources to learn the basic concents of XML:
Data types
There are two types that constitute an XML document and its components: Document
and Node
, respectively. The Document
type represents a whole XML document. A Document
object points to the topmost node of the XML document, but note that it is different from the root node you see in an XML file. The Node
type represents almost everything in an XML document; elements, attributes, texts, CDATAs, comments, documents, etc. are all Node
type objects. These two type names are not exported from EzXML.jl because their names are very general and easily conflict with other names exported from other packages. However, the user can expect them as public APIs and use them with the EzXML.
prefix.
Here is an example to create an empty XML document using the XMLDocument
constructor:
julia> using EzXML
julia> doc = XMLDocument()
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fd9f1f14370>))
julia> typeof(doc)
EzXML.Document
julia> doc.node
EzXML.Node(<DOCUMENT_NODE@0x00007fd9f1f14370>)
julia> typeof(doc.node)
EzXML.Node
julia> print(doc) # print an XML-formatted text
<?xml version="1.0" encoding="UTF-8"?>
The text just before the @
sign shows the node type (in this example, DOCUMENT_NODE
), and the text just after @
shows the pointer address (0x00007fd9f1f14370
) to a node struct of libxml2.
Let's add a root node to the document and a text node to the root node:
julia> elm = ElementNode("root") # create an element node
EzXML.Node(<ELEMENT_NODE[root]@0x00007fd9f2a1b5f0>)
julia> setroot!(doc, elm)
EzXML.Node(<ELEMENT_NODE[root]@0x00007fd9f2a1b5f0>)
julia> print(doc)
<?xml version="1.0" encoding="UTF-8"?>
<root/>
julia> txt = TextNode("some text") # create a text node
EzXML.Node(<TEXT_NODE@0x00007fd9f2a81ee0>)
julia> link!(elm, txt)
EzXML.Node(<TEXT_NODE@0x00007fd9f2a81ee0>)
julia> print(doc)
<?xml version="1.0" encoding="UTF-8"?>
<root>some text</root>
Finally you can write the document object to a file using the write
function:
julia> write("out.xml", doc)
62
julia> print(String(read("out.xml")))
<?xml version="1.0" encoding="UTF-8"?>
<root>some text</root>
A Node
object has some properties. The most important one would be the type
property, which we already saw in the example above. Other properties (name
, path
, content
and namespace
) are demonstrated in the following example. The value of a property will be nothing
when there is no corresponding value.
julia> elm = ElementNode("element")
EzXML.Node(<ELEMENT_NODE[element]@0x00007fd9f44122f0>)
julia> println(elm)
<element/>
julia> elm.type
ELEMENT_NODE
julia> elm.name
"element"
julia> elm.path
"/element"
julia> elm.content
""
julia> elm.namespace === nothing
true
julia> elm.name = "ELEMENT" # set element name
"ELEMENT"
julia> println(elm)
<ELEMENT/>
julia> elm.content = "some text" # set content
"some text"
julia> println(elm)
<ELEMENT>some text</ELEMENT>
julia> txt = TextNode(" text ")
EzXML.Node(<TEXT_NODE@0x00007fd9f441f3f0>)
julia> println(txt)
text
julia> txt.type
TEXT_NODE
julia> txt.name
"text"
julia> txt.path
"/text()"
julia> txt.content
" text "
addelement!(<parent>, <child>, [<content>])
is handy when you want to add a child element to an existing node:
julia> user = ElementNode("User")
EzXML.Node(<ELEMENT_NODE[User]@0x00007fd9f427c510>)
julia> println(user)
<User/>
julia> addelement!(user, "id", "167492")
EzXML.Node(<ELEMENT_NODE[id]@0x00007fd9f41ad580>)
julia> println(user)
<User><id>167492</id></User>
julia> addelement!(user, "name", "Kumiko Oumae")
EzXML.Node(<ELEMENT_NODE[name]@0x00007fd9f42942d0>)
julia> println(user)
<User><id>167492</id><name>Kumiko Oumae</name></User>
julia> prettyprint(user)
<User>
<id>167492</id>
<name>Kumiko Oumae</name>
</User>
DOM
The DOM (Document Object Model) API regards an XML document as a tree of nodes. There is a root node at the top of a document tree and each node has zero or more child nodes. Some nodes (e.g. texts, attributes, etc.) cannot have child nodes.
For the demonstration purpose, save the next XML in "primates.xml" file.
<?xml version="1.0" encoding="UTF-8"?>
<primates>
<genus name="Homo">
<species name="sapiens">Human</species>
</genus>
<genus name="Pan">
<species name="paniscus">Bonobo</species>
<species name="troglodytes">Chimpanzee</species>
</genus>
</primates>
readxml(<filename>)
reads an XML file and builds a document object in memory. Likewise, parsexml(<string or byte array>)
parses an XML string or a byte array in memory and builds a document object:
julia> doc = readxml("primates.xml")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fd9f410a5f0>))
julia> data = String(read("primates.xml"));
julia> doc = parsexml(data)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fd9f4051f80>))
Before traversing a document we need to get the root of the document tree. The .root
property returns the root element (if any) of a document:
julia> primates = doc.root # get the root element
EzXML.Node(<ELEMENT_NODE[primates]@0x00007fd9f4086880>)
julia> genus = elements(primates) # `elements` returns all child elements.
2-element Array{EzXML.Node,1}:
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f4041a40>)
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f40828e0>)
julia> genus[1].type, genus[1].name
(ELEMENT_NODE, "genus")
julia> genus[2].type, genus[2].name
(ELEMENT_NODE, "genus")
Attribute values can be accessed by its name like a dictionary; haskey
, getindex
, setindex!
and delete!
are overloaded for element nodes. Qualified name, which may or may not have the prefix of a namespace, can be used as a key name:
julia> haskey(genus[1], "name") # check whether an attribute exists
true
julia> genus[1]["name"] # get a value as a string
"Homo"
julia> genus[2]["name"] # same above
"Pan"
julia> println(genus[1]) # print a "genus" element before updating
<genus name="Homo">
<species name="sapiens">Human</species>
</genus>
julia> genus[1]["taxonID"] = "9206" # insert a new attribute
"9206"
julia> println(genus[1]) # the "genus" element has been updated
<genus name="Homo" taxonID="9206">
<species name="sapiens">Human</species>
</genus>
In this package, a Node
object is regarded as a container of its child nodes. This idea is reflected on its property and function names; for example, a property returning the first child node is named as .firstnode
instead of .firstchildnode
. All properties and functions provided by the EzXML
module are named in this way, and the tree traversal API of a node works on its child nodes by default. Properties (functions) with a direction prefix work on that direction; for example, .nextnode
returns the next sibling node and .parentnode
returns the parent node.
Distinction between nodes and elements is what every user should know about before using the DOM API. There are good explanations on this topic: http://www.w3schools.com/xml/dom_nodes.asp, http://stackoverflow.com/questions/132564/whats-the-difference-between-an-element-and-a-node-in-xml. Some properties (functions) have a suffix like node
or element
that indicate a node type the property (function) is interested in. For example, .firstnode
returns the first child node (if any), which may be a text node, but .firstelement
always returns the first element node (if any):
julia> primates.firstnode
EzXML.Node(<TEXT_NODE@0x00007fd9f409f200>)
julia> primates.firstelement
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f4041a40>)
julia> primates.firstelement == genus[1]
true
julia> primates.lastnode
EzXML.Node(<TEXT_NODE@0x00007fd9f404bec0>)
julia> primates.lastelement
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f40828e0>)
julia> primates.lastelement === genus[2]
true
Tree traversal properties return nothing
when there is no corresponding node:
julia> primates.firstelement.nextelement === primates.lastelement
true
julia> primates.firstelement.prevelement === nothing
true
Here is the list of tree traversal properties:
- The
Document
type:.root
.dtd
- The
Node
type:.document
.parentnode
.parentelement
.firstnode
.firstelement
.lastelement
.lastnode
.nextnode
.nextelement
.nextnode
.prevnode
If you would like to iterate over child nodes or elements, you can use the eachnode(<parent node>)
or the eachelement(<parent node>)
function. The eachnode
function generates all nodes including texts, elements, comments, and so on, while eachelement
selects only element nodes. nodes(<parent node>)
and elements(<parent node>)
are handy functions that return a vector of nodes and elements, respectively:
julia> for node in eachnode(primates)
@show node
end
node = EzXML.Node(<TEXT_NODE@0x00007fd9f409f200>)
node = EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f4041a40>)
node = EzXML.Node(<TEXT_NODE@0x00007fd9f4060f70>)
node = EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f40828e0>)
node = EzXML.Node(<TEXT_NODE@0x00007fd9f404bec0>)
julia> for node in eachelement(primates)
@show node
end
node = EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f4041a40>)
node = EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f40828e0>)
julia> nodes(primates)
5-element Array{EzXML.Node,1}:
EzXML.Node(<TEXT_NODE@0x00007fd9f409f200>)
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f4041a40>)
EzXML.Node(<TEXT_NODE@0x00007fd9f4060f70>)
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f40828e0>)
EzXML.Node(<TEXT_NODE@0x00007fd9f404bec0>)
julia> elements(primates)
2-element Array{EzXML.Node,1}:
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f4041a40>)
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fd9f40828e0>)
XPath
XPath is a query language for XML. You can retrieve target elements using a short query string. For example, "//genus/species"
selects all "species" elements just under a "genus" element.
The findall
, findfirst
and findlast
functions are overloaded for XPath query and return a vector of selected nodes:
julia> primates = readxml("primates.xml")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fbeddc2a1d0>))
julia> findall("/primates", primates) # Find the "primates" element just under the document
1-element Array{EzXML.Node,1}:
EzXML.Node(<ELEMENT_NODE[primates]@0x00007fbeddc1e190>)
julia> findall("//genus", primates)
2-element Array{EzXML.Node,1}:
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fbeddc12c50>)
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fbeddc16ea0>)
julia> findfirst("//genus", primates)
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fbeddc12c50>)
julia> findlast("//genus", primates)
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fbeddc16ea0>)
julia> println(findfirst("//genus", primates))
<genus name="Homo">
<species name="sapiens">Human</species>
</genus>
If you would like to change the starting node of a query, you can pass the node as the second argument of find*
:
julia> genus = findfirst("//genus", primates)
EzXML.Node(<ELEMENT_NODE[genus]@0x00007fbeddc12c50>)
julia> println(genus)
<genus name="Homo">
<species name="sapiens">Human</species>
</genus>
julia> println(findfirst("species", genus))
<species name="sapiens">Human</species>
find*(<xpath>, <node>)
automatically registers namespaces applied to <node>
, which means prefixes are available in the XPath query. This is especially useful when an XML document is composed of elements originated from different namespaces.
There is a caveat on the combination of XPath and namespaces: if a document contains elements with a default namespace, you need to specify its prefix to the find*
function. For example, in the following example, the root element and its descendants have a default namespace "http://www.foobar.org", but it does not have its own prefix. In this case, you need to assign a prefix to the namespance when finding elements in the namespace:
julia> doc = parsexml("""
<parent xmlns="http://www.foobar.org">
<child/>
</parent>
""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fdc67710030>))
julia> findall("/parent/child", doc.root) # nothing will be found
0-element Array{EzXML.Node,1}
julia> namespaces(doc.root) # the default namespace has an empty prefix
1-element Array{Pair{String,String},1}:
"" => "http://www.foobar.org"
julia> ns = namespace(doc.root) # get the namespace
"http://www.foobar.org"
julia> findall("/x:parent/x:child", doc.root, ["x"=>ns]) # specify its prefix as "x"
1-element Array{EzXML.Node,1}:
EzXML.Node(<ELEMENT_NODE[child]@0x00007fdc6774c990>)
Streaming API
In addition to the DOM API, EzXML.jl provides a streaming reader of XML files. The streaming reader processes, as the name suggests, a stream of XML data by incrementally reading data from a file instead of reading a whole XML tree into the memory. This enables processing extremely large files with limited memory.
Let's use the following XML file (undirected.graphml) that represents an undirected graph in the GraphML format (slightly simplified for brevity):
<?xml version="1.0" encoding="UTF-8"?>
<graphml>
<graph edgedefault="undirected">
<node id="n0"/>
<node id="n1"/>
<node id="n2"/>
<node id="n3"/>
<node id="n4"/>
<edge source="n0" target="n2"/>
<edge source="n1" target="n2"/>
<edge source="n2" target="n3"/>
<edge source="n3" target="n4"/>
</graph>
</graphml>
The API of a streaming reader is quite different from the DOM API. The first thing you needs to do is to create an EzXML.StreamReader
object using the open
function:
julia> reader = open(EzXML.StreamReader, "undirected.graphml")
EzXML.StreamReader(<READER_NONE@0x00007f9fe8d67340>)
The stream reader is stateful and parses components by pulling them from the stream. For example, when it reads an element from the stream, it changes the state to READER_ELEMENT
and some information becomes accessible. Its reading state is advanced by the iterate(reader)
method:
julia> reader.type # the initial state is READER_NONE
READER_NONE
julia> iterate(reader); # advance the reader's state
julia> reader.type # now the state is READER_ELEMENT
READER_ELEMENT
julia> reader.name # the reader has just read a "<graphml>" element
"graphml"
julia> iterate(reader);
julia> reader.type # now the state is READER_SIGNIFICANT_WHITESPACE
READER_SIGNIFICANT_WHITESPACE
julia> reader.name
"#text"
julia> iterate(reader);
julia> reader.type
READER_ELEMENT
julia> reader.name # the reader has just read a "<graph>" element
"graph"
julia> reader["edgedefault"] # attributes are accessible
"undirected"
While reading data, a stream reader provides the following properties:
.type
: node type it has read.depth
: depth of the current node.name
: name of the current node.content
: content of the current node.namespace
: namespace of the current node
iterate(reader)
returns nothing
to indicate that there are no more data available from the file. When you finished reading data, you need to call close(reader)
to release allocated resources:
julia> reader = open(EzXML.StreamReader, "undirected.graphml")
EzXML.StreamReader(<READER_NONE@0x00007fd642e80d90>)
julia> while (item = iterate(reader)) != nothing
@show reader.type, reader.name
end
(reader.type, reader.name) = (READER_ELEMENT, "graphml")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_ELEMENT, "graph")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_ELEMENT, "node")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_ELEMENT, "node")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_ELEMENT, "node")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_ELEMENT, "node")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_ELEMENT, "node")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_ELEMENT, "edge")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_ELEMENT, "edge")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_ELEMENT, "edge")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_ELEMENT, "edge")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_END_ELEMENT, "graph")
(reader.type, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(reader.type, reader.name) = (READER_END_ELEMENT, "graphml")
julia> reader.type, reader.name
(READER_NONE, nothing)
julia> close(reader) # close the reader
The open(...) do ... end
pattern can be written as:
julia> open(EzXML.StreamReader, "undirected.graphml") do reader
# do something
end
EzXML.jl overloads the Base.iterate
function to make a streaming reader iterable via the for
loop. Therefore, you can iterate over all components without explicitly calling iterate
as follows:
julia> reader = open(EzXML.StreamReader, "undirected.graphml")
EzXML.StreamReader(<READER_NONE@0x00007fd642e9a6b0>)
julia> for typ in reader
@show typ, reader.name
end
(typ, reader.name) = (READER_ELEMENT, "graphml")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_ELEMENT, "graph")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_ELEMENT, "node")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_ELEMENT, "node")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_ELEMENT, "node")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_ELEMENT, "node")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_ELEMENT, "node")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_ELEMENT, "edge")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_ELEMENT, "edge")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_ELEMENT, "edge")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_ELEMENT, "edge")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_END_ELEMENT, "graph")
(typ, reader.name) = (READER_SIGNIFICANT_WHITESPACE, "#text")
(typ, reader.name) = (READER_END_ELEMENT, "graphml")
julia> close(reader)