File I/O in C++, C, UNIX, and Windows

CS 321 2007 Lecture, Dr. Lawlor

So a file really is just a big array of bytes.  This is suprisingly simple concept--very 1970's.  Of course, you almost never want bytes, you want something useful.

ASCII: Bytes as European Letters

The ASCII (American Standard Code for Information Interchange) table is a way of interpreting bytes as letters.

ASCII x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
0x
        tab, \t
\n
\r
 
1x            esc
   
2x space
! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~ 
8x
ƒ ˆ Š Œ
Ž
9x
˜ š œ
ž Ÿ
Ax   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­
® ¯
Bx ° ± ² ³ ´ µ · ¸ ¹ º » ¼ ½ ¾ ¿
Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
Ex à á â ã ä å æ ç è é ê ë ì í î ï
Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
ASCII table.  Horizontal axis gives the low hex digit, vertical axis the high hex digit, and the entry is ASCII for that hex digit.  E.g., "A" is 0x41. 

"Standard" portable ASCII characters are shown in red.

On a UNIX box, you can show the ASCII and hex interpretations of a string or file using the "od" command:
> echo -n "foo" | od -t cx1
0000000 f o o
66 6f 6f
0000003

UNICODE: Bytes as Complicated Characters

So say your name is 是要找.  That's not listed in the ASCII table.  How do you represent your name?  As UNICODE, usually in "UTF-8" encoding.

A UNICODE character with bits xxxxyyyyyyzzzzzz is stored as three UTF-8 bytes that have bits 1110xxxx 10yyyyyy 10zzzzzz.

For example,
> echo -n "是" | od -t x1
0000000 e6 98 af
0000003
The single character 是 (which hopefully isn't offensive or something!) is stored as three bytes: 0xe6, 0x98, and 0xaf. 

Raw Binary I/O Interfaces

So generally speaking, the "true" contents of a file are a bunch of bytes--anything else is subject to interpretation.  Luckily, you can read the raw bytes in a file using any OS:
Name
Header
Open Binary File
Read Binary Data
Write Binary Data
Seek
Close File
C Standard I/O
#include <stdio.h>
FILE *f=fopen("foo","rb");
int n=fread(buf,4,1,f);
int n=fwrite(buf,4,1,f);
fseek(f,0, SEEK_SET);
fclose(f);
C++ STL I/O
#include <fstream>
std::ifstream s("foo",std::ifstream::bin);
s.read(buf,4);
s.write(buf,4);
s.seekg(0);
/* automatic */
UNIX I/O
#include <fcntl.h>
int fd=open("demo",O_RDONLY);
int n=read(fd,buf,4);
int n=read(fd,buf,4); lseek(fd,0, SEEK_SET);
close(fd);
Windows I/O
#include <windows.h>
HANDLE h=CreateFile(.....);
int n=ReadFile(h,buf,...);
int n=WriteFile(h,buf,...);
SetFilePointer(h, 0,0, FILE_BEGIN);
CloseHandle(h);

See examples of all four I/O methods: Directory.

The Stupidity of Newlines

In the ASCII table above, there are two bytes that could legitimately be treated as indicating "this line of text is over.  start a new one.":
Sadly, every major computer system nowdays treats newlines differently.

In DOS/Windows, '\r' only moves the cursor horizontally to the start of the line, and '\n' only moves the cursor vertically down.  This means to really start a new line of text in a DOS/Windows file, you need to use "\r\n", like this:
me> cat newline_win.txt
Hello
There!
me> od -t c newline_win.txt
0000000 H e l l o \r \n T h e r e ! \r \n
In UNIX, '\n' starts a new line.  '\r' can be used to overwrite the previous line if you really want to.
me> od -t c newline_unix.txt 
0000000   H   e   l   l   o  \n   T   h   e   r   e   !  \n
On Macs, '\r' starts a new line.
me> od -t c newline_mac.txt 
0000000 H e l l o \r T h e r e ! \r
So the same text file written on three different machines will contain three different sets of bytes at the end of a line!

Luckily, more and more programs are accepting *anything* as indicating the end of a line:
Be careful, though!  Sometimes a compiler will choke on a particular line that looks fine in the editor because there's a stray foreign newline character there.  The editor might hide the foreign newline, but still write it out when the file is saved.  A binary file dump tool like "od" (on UNIX, like above) can be useful.

Newline Conversion

If you open a file in "text" mode, the Windows file output routines will silently insert "\r\n" in the file whenever your program says "\n".  The input routines will also silently translate "\r\n" into just "\n" when reading from text files.  No such conversion happens on any other machine, or if the file is opened in "binary mode".  In UNIX, there's no difference between opening a file in "binary" and "text" mode.

If you FTP a file in "text" mode from one machine to another, the FTP program will silently replace your newlines with whatever the other side needs.  Of course, if you're transferring a binary file (like a program, zip archive, etc.) that just happens to contain 0x0D 0x0A, this same replacement will corrupt the file, so you've got to be careful to transfer in "binary" mode.