XML: eXtensible Markup Language

CS 321 2007 Lecture, Dr. Lawlor

XML (eXtensible Markup Language) is an international-standard way of representing data in an ASCII-style markup language. It's extremely similar to HTML, but unlike HTML, XML lets you make up your own tags, like  a new "crap" tag you use for describing statements you don't believe: "<crap>I am an egg</crap>". 

A begin-end tag pair can contain either plain ASCII text or arbitrarily complicated other tags inside of it, which makes XML another example of a recursive-list structure like we described last class.

XML is usually parsed by a dedicated generic XML parsing library.  Libraries are built into JavaScript and .NET; but for C/C++ you have to download one.  I recommend "expat" 1.0 for a small and simple one-tag-at-a-time parser (see below).  There's also a new standard called the DOM for accessing XML files as trees, used by for example libxml2.  See the examples below.

The list of valid tags that can be used in a document, and the fashion in which tags can be nested, can be stored in a Data Type Descriptor, or DTD.   A "Validating" XML parser can check the tags against the DTD, and give good error messages if the tags don't make sense--this means you don't have to do as much error checking in your own code.

There's also a standard for tag-to-tag converting XML documents to HTML (or other XML-style formats) called XSL, the eXtensible Style Language.

Simple Example of XML

<foo>This is foo data <bar>And bar nested inside foo</bar></foo>
Be sure to close all your tags!  You can usually get away with leaving tags open in HTML, but not XML!

Example of Real Data Stored in XML

For a project Dr. Hay and I were working on involving airspaces, we chose to store the airspace descriptions in XML format.  The "<!-- stuff -->" tags are human-readable comments.  Everything else is machine-readable XML.
<!-- An Airspace is a region of space where airplanes might fly. 
You can have any number of airspaces, which may overlap. -->
<Airspace>
<name>Fairbanks International Airport</name> <!-- Human-readable name; also listed in Restriction forAirspace blocks -->

<!-- Altitude range for this airspace. AGL is 'above ground level'. MSL is above 'mean sea level'. -->
<bottom>100ft AGL</bottom>
<top>2000ft MSL</top>

<!-- Ground outline of airspace -->
<Outline>
<!-- x,y map projection coordinates of points on outline -->
<coordinates>
460184.00,7190200.00
461256.00,7189160.00
458248.00,7185712.00
457056.00,7186720.00
458664.00,7188720.00
</coordinates>
<!-- Map projection used for specifying outline.
UTM-6, UTM zone 6 (150W-144W) is the only allowed value for now! -->
<projection>UTM-6</projection>
<!-- Map coordinate units: feet or meters -->
<units>meters</units>
</Outline>
</Airspace>

Parsing XML with Expat

"Expat" is a simple XML parser library.  To use the library, you build an XML_Parser object and register a set of functions for the library to call when it:
Your "start", "data", and "end" functions can do anything with the XML--save it to a tree structure, look through until they find the data they're looking for, or just print or convert the stuff as it goes by.

Here's a complete example:
/**
Demo of how to call the "expat" library on
an XML file. I recommend expat 1.0, from
http://www.jclark.com/xml/expat.html

This file must be linked with "-lexpat".

Orion Sky Lawlor, olawlor@acm.org, 2007/04/10 (Public Domain)
*/
#include <iostream>
#include <fstream>
#include "expat.h" /* libexpat XML parser */

/** Accumulates data from parsed XML.
Currently simple, since we just print stuff out,
but in a real application would be more complex... */
class myStuff {
public:
/* Tag we're currently working on */
std::string tag;
/* Number of nested XML tags so far */
int level;
/* Put spaces before output stream */
std::ostream &print(void) {
for (int i=0;i<level;i++)
std::cout<<" ";
return std::cout;
}
myStuff() {level=0;}
};

/** Called on a start tag, like <foo bar="baz"> */
void myStart(myStuff *s,
const XML_Char *name, /* foo */
const XML_Char **atts) /* bar = "baz" */
{
s->tag=name;
s->print()<<"Starting tag: '"<<s->tag<<"'\n";
for (int i=0;atts[i];i+=2) {
s->print()<<"-Attr '"<<
atts[i]<<"'='"<<atts[i+1]<<"'\n";
}
s->level++; /* entering a new tag */
}


/* Remove whitespace from start and end of string */
std::string squeeze(const std::string &str) {
unsigned int start=0,end=str.size();
while (start<end && isspace(str[start])) start++;
while (start<end && isspace(str[end-1])) end--;
return std::string(str,start,end-start);
}

/** Called on character data found in the file */
void myData(myStuff *s,
const XML_Char *data,int len)
{
std::string str=squeeze(std::string(data,len)); /*<- silly: expat will give you whitespace too */
if (str.size()>0u)
s->print()<<"Character data: '"<<str<<"'\n";
}

/** Called on an end tag, like </foo> */
void myEnd(myStuff *s,
const XML_Char *name) /* foo */
{
s->level--; /* leaving a tag */
s->print()<<"Ending tag: '"<<name<<"'\n";
}



int main(int argc,char *argv[]) {
const char *src="in.xml";
if (argc>1) src=argv[1];
XML_Parser p=XML_ParserCreate(NULL);
myStuff s;
XML_SetUserData(p,&s); /* pass s to our routines */
/* Register our routines to get called as
the file is parsed */
XML_SetElementHandler(p,
(XML_StartElementHandler)myStart,
(XML_EndElementHandler)myEnd);
XML_SetCharacterDataHandler(p,
(XML_CharacterDataHandler)myData);

/* Parse the XML data (e.g., read from file, network, etc) */
std::string xmldata="<foo>This is foo data <bar>And bar nested inside foo</bar></foo>";
XML_Parse(p,&xmldata[0],xmldata.size(),0); /* parse n bytes */
XML_Parse(p,0,0,1); /* end of input */
return 0;
}
(executable NetRun link)

You can also call XML_Parse several times, with pieces of the XML file.  This is useful if you're reading in the file in pieces.

Parsing XML with libxml

The GNOME XML parsing library, libxml, takes a different approach.  Rather than handing you every tag as it arrives ("here's a begin.  OK, now here's some data.  Now here's an end."), libxml uses the DOM, which reads the whole XML file and stores it in memory as a tree of "xmlNode" structs.

(WARNING: lots of the documentation on the web applies to libxml2, but libxml is all that's installed on most machines!)

Here's how it works.  It is a lot simpler than libexpat, but it's also less efficient, and for real projects, I don't know if it's really that much better.
/**
Demo of how to call the "libxml" library on
an XML string.

This file must be linked with "-lxml".

Orion Sky Lawlor, olawlor@acm.org, 2007/04/10 (Public Domain)
*/
#include <iostream>
#include <fstream>

#include <gnome-xml/tree.h> /* see http://xmlsoft.org/ */
#include <gnome-xml/parser.h>
#include <gnome-xml/xpath.h>

void print_node(xmlNode *n) {
std::cout<<"named '"<<n->name<<"'";
const xmlChar *content=xmlNodeGetContent(n);
if (content)
std::cout<<" with content '"<<content<<"'";
std::cout<<"\n";
}

int main(int argc,char *argv[]) {
/* you can also xmlParseFile("foo.xml") to parse a file */
xmlDocPtr doc=xmlParseDoc((xmlChar *)
"<foo>This is foo data <bar>And bar nested inside foo</bar></foo>"
);

xmlNode *root = xmlDocGetRootElement(doc);
std::cout<<"The root node is";print_node(root);

for (xmlNode *cur=root->childs;cur!=NULL;cur=cur->next) {
std::cout<<" Child node: "; print_node(cur);
}

xmlFreeDoc(doc);

return 0;
}
(executable NetRun link)