String Parsing

CS 321 Lecture, Dr. Lawlor, 2006/03/22

Reading text files can be quite painful.  The problem is that you've often got to slog through the contents of the file yourself. 

Check out these string input examples (Directory, Zip, Tar-gzip).

For example, to read a std::string from the standard input, you could do (like in this example):
	std::string s;
std::cin>>s;
But this stops reading at a space character.  If you want to allow spaces, and read all the way to, say, a semicolon, then somebody's got to walk through the characters in a little loop until you hit the semicolon.  Sometimes it's possible to find a library to do this--for example, std::istream::getline takes a "terminator" character you can set to semicolon, although it reads into a bare "char *", not a string.  So sometimes you have to build the little loop yourself like in this example.

The loop is nasty, and hence it's a good idea to hide it inside a subroutine, like in this example.  Putting stuff into subroutines usually means generalizing the stuff you could have otherwise hardcoded, so in this case we pass a list of terminator characters as a string.  If this string gets long, checking each character against all possible terminators would be slow.

So it sometimes makes sense to build a little table, to speed up this character checking.  The idea is to index the table by the next character, which immediately tells you if you should stop reading or not.  See this example.

It's possible to use this table-driven approach to parse really complicated languages--take CS 331 (computer languages), or look at the parser code generated by YACC (yet another compiler compiler) to see how this is done.

International Text

So far, we've treated strings as arrays of bytes, and assumed characters were the same as bytes.  That's fine for English (which is nowadays always encoded using ASCII), but some languages use accent characters, and others use little idiographic pictures, and so can't fit all possible characters into a single byte.  Hence the invention of Unicode, which uses one int per character, often called a "wchar_t".  To work with normal text files, they've developed a way to encode Unicode characters into 8-bit chunks called UTF-8. UTF-8 is defined in such a way that plain old ASCII files work as expected (one byte, one character), but high ASCII is redefined to allow multi-byte characters.  This means you can mostly get away with ignoring other character sets, and just treat all text as ASCII.  This only causes problems when rendering one character at a time, or doing operations on each character.

Some systems support "wide" character types like "wchar_t" (for Unicode characters), "std::wstring", "std::wcin", and "std::wcout" (for Unicode input and output).  The idea is that using wide characters would allow you to treat all Unicode and ASCII characters in the same way.  Sadly, these don't seem to work on my Linux machines yet...