XML

CS 321 Lecture, Dr. Lawlor, 2006/04/14

XML (eXtensible Markup Language) is an international-standard way of representing data in an ASCII-style markup language. It's suspiciously similar to HTML, but unlike HTML, XML lets you make up your own tags, like  a new "crap" tag you use for describing statements you don't believe: "<crap>I am an egg</crap>". 

XML is usually parsed by a dedicated generic XML parsing library.  Libraries are built into JavaScript and .NET; but for C/C++ you have to download one.  I recommend "expat" 1.0, because it's a small and simple parser.  There's a new not-quite-standard called the DOM for parsing XML, but it's not used very often yet.

The list of valid tags that can be used in a document, and the fashion in which tags can be nested, can be stored in a Data Type Descriptor, or DTD.   A "Validating" XML parser can check the tags against the DTD, and give good error messages if the tags don't make sense--this means you don't have to do as much error checking in your own code.

There's also a standard for tag-to-tag converting XML documents to HTML (or other XML-style formats) called XSL, the eXtensible Style Language.

Parsing XML with Expat

"Expat" is a simple XML parser library.  To use the library, you build an XML_Parser object and register a set of functions for the library to call when it:
Your "start", "data", and "end" functions can do anything with the XML--save it to a tree structure, look through until they find the data they're looking for, or just print or convert the stuff as it goes by.

Here's a tiny example (Directory, Zip, Tar-gzip)