Chapter 17: Data Files

Many programs need to read and write data files. A program might read data files to initialize its configuration or to receive data from another program; a program might write data files to save its state or to send data to another program. In this chapter we'll explore techniques for reading and writing data files, and for designing data file formats so that they are functional, useful, and convenient.

What makes a good data file? There are many desirable attributes which we might want to achieve or to trade off against one another. We might want data files to be small (to save disk space) or to be efficient to read and write (to save computer time) or to easy to read and write (that is, easy to implement, to save programmer time). We might want them to be human readable, to make them easier to debug or modify, or so that they could be created "by hand" (i.e. all using standard file-manipulation tools). On the other hand, if the files are to contain sensitive data, we might prefer that they not be human-readable. We might want the files to be portable across different machine architectures (if we will be moving data files from machine to machine). We might want to ensure that if the data file format ever changes (perhaps to add new information), newer versions of our software (that is, the software that reads and writes the data files) can still read the old files, and perhaps even that old versions of the software can at least partially read the new files. We'll see ways of achieving all of these attributes.

Roughly speaking, there are two large classes of data file formats: "text" and "binary". Text files, as their name implies, contain human-readable text; that is, if you were to read one into a text editor or dump one to your screen, it would consist of strings of printable characters, arranged into lines. (By "printable characters" we mean characters which display nicely on the screen, as opposed to "control characters." Generally speaking, the only control characters a text file will contain will be CR or LF or CRLF combinations to mark the ends of lines, and perhaps horizontal tabs. C represents the end-of-line character(s) by \n, and tabs by \t.)

Binary files, on the other hand, contain arbitrary patterns of bits and bytes, arranged for the computer's convenience, not the human's. The bytes making up a binary file are not intended to be interpreted as characters or text; if you dump one to the screen, you get all sorts of garbage. Some of the bit patterns will happen to represent printable characters, but others will be control characters, others may be special graphics characters, and still others may end up representing sequences which will switch the display into inverse video, clear the screen, etc. (Depending on your display environment, printing arbitrary binary characters may confuse the display so badly that it becomes unusable and must be reset.)

In a text file, we might represent the integer 12345 as the five characters 1 2 3 4 5 (that is, as the text string "12345"). In a binary file, on the other hand, we might represent it as two bytes with values 0x30 and 0x39, since 12345 base 10 is 3039 base 16. (Just to confuse the issue, it happens that in the ASCII character set the values 0x30 and 0x39 represent the characters '0' and '9', but this is sheer coincidence; the character values 0 and 9 of course have no meaningful relationship to the value 12345 that we're storing.)

17.1: Text Data Files

17.2: Binary Data Files


Read sequentially: prev next top

17.1: Text Data Files

Text data files, it must be admitted, are not always as compact or as efficient to read and write as binary files. It can be a bit more work to set up the code which reads and writes them. But they have some powerful advantages: any time you need to, you can look at them using ordinary text editors and other tools. If program A is writing a data file which program B is supposed to be able to read but cannot, you can immediately look at the file to see if it's in the correct format and so determine whether it's program A's or B's fault. If program A has not been written yet, you can easily create a data file by hand to test program B with. Text files are automatically portable between machines, even those where integers and other data types are of different sizes or are laid out differently in memory. Because they're not expected to have the rigid formats of binary files, it tends to be more natural to arrange text files so that as the data file format changes slightly, newer (or older) versions of the software can read older (or newer) versions of the data file. Text data files are the focus of this chapter; they're what I use all the time, and they're what I recommend you use unless you have compelling reasons not to.

When we're using text data files, we acknowledge that the internal and external representations of our data are quite different. For example, a value of type int will usually be represented internally as a 2- or 4-byte (16- or 32-bit) piece of memory. Externally, though, that integer will be represented as a string of characters representing its decimal or hexadecimal value. Converting back and forth between the internal and external representations is easy enough. To go from the internal representation to the external, we'll almost always use printf or fprintf; for example, to convert an int we might use %d or %x format. To convert from the external representation back to the internal, we could use scanf or fscanf, or read the characters in some other way and then use functions like atoi, strtol, or sscanf.

We have a great many options when it comes to performing this mapping, that is, when converting between the internal and external representations. Our choice may be determined by the layout we want the data file to have, or by what's easiest to implement, or by some combination of these factors. Some of the choices are pretty arbitrary; but in any case, what matters most is obviously that the reading and writing code "match", that is, that the data file writing code write the data in the right format such that the data file reading code can accurately read it. For the rest of this section, we'll explore several ways of writing and reading data to and from text data files, using various combinations of the stdio functions (and perhaps one or two of our own).

Suppose we had an array of integers:

	int a[10];

and suppose it had been filled up with values, and suppose we wanted to write them out to a data file. We could write them all on one line, separated by spaces:

	fprintf(ofp, "%d %d %d %d %d %d %d %d %d %d\n",
		a[0], a[1], a[2], a[3], a[4], a[5],
			a[6], a[7], a[8], a[9]);

We could write them on 10 separate lines:

	for(i = 0; i < 10; i++)
		fprintf(ofp, "%d\n", a[i]);

Realizing that the loop is easier and more flexible, we could go back to writing them all on one line, using a loop:

	for(i = 0; i < 10; i++)
		fprintf(ofp, "%d ", a[i]);
	fprintf(ofp, "\n");

If we were worried about that trailing space at the end of the line, we could arrange to eliminate it:

	for(i = 0; i < 10; i++)
		{
		if(i > 0)
			fprintf(ofp, " ");
		fprintf(ofp, "%d", a[i]);
		}
	fprintf(ofp, "\n");

Recognizing that fprintf is overkill for printing single, fixed characters, we could replace two of the calls with putc:

	for(i = 0; i < 10; i++)
		{
		if(i > 0)
			putc(' ', ofp);
		fprintf(ofp, "%d", a[i]);
		}
	putc('\n', ofp);

When it came time to read the numbers in, we would have at least as many choices. We could read the ten values all at once, using fscanf:

	int r = fscanf(ifp, "%d %d %d %d %d %d %d %d %d %d",
		&a[0], &a[1], &a[2], &a[3], &a[4], &a[5],
			&a[6], &a[7], &a[8], &a[9]);
	if(r != 10)
		fprintf(stderr, "error in data file\n");

Since the scanf family treats all whitespace (spaces, tabs, and newlines) the same, this code would read either the format with all the numbers on one line, or the format with one number per line. Notice that we check fscanf's return value, to make sure that it successfully read in all the numbers we expected it to. Since data files come in from the outside world, it's possible for them to be corrupted, and programs should not blindly read them assuming that they're perfect. A program that crashes when it attempts to read a damaged data file is terribly frustrating; a program that diagnoses the problem is much more polite.

We could also read the data file a line at a time, converting the text to integers via other means. If the integers were stored one per line, we could use code like this:

	#define MAXLINE 200

	char line[MAXLINE];
	for(i = 0; i < 10; i++)
		{
		if(fgets(line, MAXLINE, ifp) == NULL)
			{
			fprintf(stderr, "error in data file\n");
			break;
			}
		a[i] = atoi(line);
		}

(We could also use our own getline or fgetline function instead of fgets.) If the integers were stored all on one line, we could use the getwords function from chapter 10 to separate the numbers at the whitespace boundaries:

	char *av[10];

	if(fgets(line, MAXLINE, ifp) == NULL)
		fprintf(stderr, "error in data file\n");
	else if(getwords(line, av, 10) != 10)
		fprintf(stderr, "error in data file\n");
	else	{
		for(i = 0; i < 10; i++)
			a[i] = atoi(av[i]);
		}

Suppose, now, that there were not always 10 elements in the array a; suppose we had a separate integer variable na to record how many elements the array a currently contains. When writing the data out, we would certainly then use a loop; we might also want to precede the data by the count, in case that will make it easier for the reading program:

	fprintf(ofp, "%d\n", na);
	for(i = 0; i < na; i++)
		fprintf(ofp, "%d\n", a[i]);

We could also print all of the numbers on one line:

	fprintf(ofp, "%d", na);
	for(i = 0; i < na; i++)
		fprintf(ofp, " %d ", a[i]);

(Notice that the presence of the extra value at the beginning of the line makes the space separator game easier to play.)

Now, when reading the data in, we would simply read the count first, then the data. Using fscanf:

	if(fscanf(ifp, "%d", &na) != 1)
		{
		fprintf(stderr, "error in data file\n");
		return;
		}

	if(na > 10)
		{
		fprintf(stderr, "too many items in data file\n");
		return;
		}

	for(i = 0; i < na; i++)
		{
		if(fscanf(ifp, "%d", &a[i]) != 1)
			{
			fprintf(stderr, "error in data file\n");
			return;
			}
		}

(Here we assume that the code to read the array from the data file is part of a function, and that when we detect an error, we return early from the function. In practice, we would probably return some error code to the caller.)

If we chose to use fgets (or fgetline), the code might look like this for data on separate lines:

	if(fgets(line, MAXLINE, ifp) == NULL)
		{
		fprintf(stderr, "error in data file\n");
		return;
		}
	na = atoi(line);
	if(na > 10)
		{
		fprintf(stderr, "too many items in data file\n");
		return;
		}

	for(i = 0; i < na; i++)
		{
		if(fgets(line, MAXLINE, ifp) == NULL)
			{
			fprintf(stderr, "error in data file\n");
			return;
			}
		a[i] = atoi(line);
		}

Or, if the data were all on one line, like this:

	int ac;
	char *av[11];

	if(fgets(line, MAXLINE, ifp) == NULL)
		{
		fprintf(stderr, "error in data file\n");
		return;
		}

	ac = getwords(line, av, 10);
	if(ac < 1)
		{
		fprintf(stderr, "error in data file\n");
		return;
		}
	na = atoi(av[1]);
	if(na > 10)
		{
		fprintf(stderr, "too many items in data file\n");
		return;
		}
	if(na != ac - 1)
		{
		fprintf(stderr, "error in data file\n");
		return;
		}
	for(i = 0; i < na; i++)
		a[i] = atoi(av[i+1]);

But sometimes, you don't need to save the count (na) explicitly; the reading program can deduce the number of items from the number of items in the file. If the file contains only the integers in this array, then we can simply read integers until we reach end-of-file. For example, using fscanf:

	na = 0;
	while(na < 10 && fscanf(ifp, "%d", &a[na]) == 1)
		na++;

(This code is deceptively simple; we haven't carefully dealt with appropriate error messages for a data file with more than 10 values, or a data file with a non-numeric "value" for which fscanf returns 0.)

Again, we could also use fgets. If the data is on separate lines:

	na = 0;
	while(na < 10 && fgets(line, MAXLINE, ifp) != NULL)
		a[na++] = atoi(line);

If the data is all on one line:

	if(fgets(line, MAXLINE, ifp) == NULL)
		{
		fprintf(stderr, "error in data file\n");
		return;
		}
	na = getwords(line, av, 10);
	if(na > 10)
		{
		fprintf(stderr, "too many items in data file\n");
		return;
		}
	for(i = 0; i < na; i++)
		a[i] = atoi(av[i]);

Notice that this last implementation does not require that the file consist of only data for the array a. One line of the file consists of data for the array a, but other lines of the file could contain other data.

We could also scatter a's data on multiple lines, without using an explicit count, and with the ability for the file to contain other data as well, if we marked the end of the array data with an explicit marker in the file, rather than assuming that the array's data continued until end-of-file. For example, we could write the data out like this:

	for(i = 0; i < na; i++)
		fprintf(ofp, "%d\n", a[i]);
	fprintf(ofp, "end\n");

and read it like this:

	na = 0;
	while(fgets(line, MAXLINE, ifp) != NULL)
		{
		if(strncmp(line, "end", 3) == 0)
			break;
		if(na > 10)
			{
			fprintf(stderr, "too many items in data file\n");
			return;
			}
		a[na++] = atoi(line);
		}

(There's just one nuisance here in checking for the "end" marker: fgets leaves the \n in the line it reads, so a simple strcmp against "end" would fail. Here we use strncmp, which compares at most n characters, and we pass the third argument, n, as 3. Other solutions would be to use strcmp against the string "end\n", or to strip the \n somehow, or to use our old getline or fgetline functions, since they strip the \n for us.)

Now that we've seen many (too many!) options for writing and reading the array, how do you decide which to use? Should you use fscanf, or the slightly more ad hoc methods involving fgets, getwords, atoi, etc? It's largely a matter of personal preference. In the code fragments we've looked at so far, the ones using fscanf have seemed shorter, although in some cases that was because they weren't doing as much error checking as the ones that used fgets. In general, the methods using fgets will allow somewhat more flexibility, as we saw when checking for the explicit "end" marker, which would have been difficult or impossible using scanf or fscanf.

Now let's move to another example, a user-defined data structure. Suppose we have this structure:

	struct s
		{
		int i;
		float f;
		char s[20];
		};

To write an instance of this structure out, we could simply print its fields on one line:

	struct s x;
	...
	fprintf(ofp, "%d %g %s\n", x.i, x.f, x.s);

or on several lines:

	fprintf(ofp, "%d\n", x.i);
	fprintf(ofp, "%g\n", x.f);
	fprintf(ofp, "%s\n", x.s);

or simply

	fprintf(ofp, "%d\n%g\n%s\n", x.i, x.f, x.s);

(We use %g format for the float field because %g tends to print the most accurate representation in the smallest space, e.g. 1.23e6 instead of 1230000 and 1.23e-6 instead of 0.00000123 or 0.000001.)

To read this structure back in, we could again either use fscanf, or fgets and some other functions. As before, fscanf seems easier:

	if(fscanf(ifp, "%d %g %s", &x.i, &x.f, &x.s) != 3)
		{
		fprintf(stderr, "error in data file\n");
		return;
		}

Here we have a problem, though: what if the third, string field contains a space? In the scanf family, the %s format stops reading at whitespace, so if x.s had contained the string "Hello, world!", it would be read back in as "Hello,". As it happens, we could fix it by using the less-obvious format string "%d %g %[^\n]", where %[^\n] means "match any string of characters not including \n". But we also have another problem: what if the string is longer than the 20 characters we allocated for the s field? We could fix this by using %20s or %20[^\n], although we'd have to remember to change the scanf format string if we ever changed the size of the array.

Let's leave fscanf for a moment and look at our other alternatives. If we'd printed the data all on one line, we could use

	#include <stdlib.h>	/* for atof() */

	char *av[3];

	if(fgets(line, MAXLINE, ifp) == NULL)
		{
		fprintf(stderr, "error in data file\n");
		return;
		}
	if(getwords(line, av, 3) != 3)
		{
		fprintf(stderr, "error in data file\n");
		return;
		}
	x.i = atoi(av[0]);
	x.f = atof(av[1]);
	strcpy(x.s, av[2]);	/* XXX */

Here we luck out on the question of what happens if the string contains a space, because it happens that our version of getwords (see chapter 10, p. 13) leaves the remaining words in the last "word" if there are more words in the string than we told it to find, i.e. more than the third argument to getwords which gives the size of the av array. Here, we told it it could only look for 3 words, so if the string contains spaces, making the line appear to have 4 or more words, words 3, 4, etc. will all be pointed to by av[2]. However, we still have the problem that we haven't guarded against overflow of x.s if the third (plus fourth, etc.) word on the data line is longer than 20 characters. (The comment /* XXX */ is a traditional marker which means "this line is inadequate and definitely won't work reliably in all situations but for one reason or another the person writing it is not going to take the trouble to do it right just yet.")

If the data is written on three lines, on the other hand, we obviously have to call fgets three times to read it:

	if(fgets(line, MAXLINE, ifp) == NULL)
		{ fprintf(stderr, "error in data file\n"); return; }
	x.i = atoi(line);

	if(fgets(line, MAXLINE, ifp) == NULL)
		{ fprintf(stderr, "error in data file\n"); return; }
	x.f = atof(line);

	if(fgets(line, MAXLINE, ifp) == NULL)
		{ fprintf(stderr, "error in data file\n"); return; }
	strcpy(x.s, line);	/* XXX */

Now the last line has two problems: besides the lingering problem of overflow (if the line is more than 18 characters long), we have the problem that fgets retains the \n (which is why x.s will overflow if the line is longer than 18 characters, not 19). In this case, one way to fix the overflow problem would be to have fgets read into x.s directly:

	if(fgets(x.s, 20, ifp) == NULL)
		{ fprintf(stderr, "error in data file\n"); return; }

If we didn't want to have to remember to change that 20 in the call to fgets if we ever re-sized the array, we could get clever and write fgets(x.s, sizeof(x.s), ifp). Also, we might as well figure out how to get rid of that pesky \n. One way is by calling the standard library function strchr, which searches for a certain character in a string. This will require that we #include <string.h>, and declare an extra char * variable:

	#include <string.h>
	char *p;
	p = strchr(x.s, '\n');
	if(p != NULL)
		*p = '\0';

strchr returns a pointer to the character that it finds, or a null pointer if it doesn't find the character. If there's a \n in the line at all, we know it's at the end, so it's safe to overwrite it with a \0, making the string one character shorter. (Since we know that the \n is at the end, we could also call the function strrchr, which finds a character starting from the right.)

For any of the methods we've been using so far, what if one day we add a new field to the structure s? Obviously, we'll have to rewrite the code which writes the structure out and also the code which reads it in. Also, unless we're careful, the modified code won't be able to read in any data files we might happen to have lying around which were written before the structure was changed. Depending on the nature of the data file and the way it's used, this can be a real problem. (In principle, it's possible to write a utility program to convert the old data files to the new format, but it can be a nuisance to write that program, and it can be a real nuisance to track down all of the old data files that need converting.)

Therefore, when a data file format must be changed, it's often a good idea if the new, improved data file reader can be made to automatically detect and read old-format files as well. (Automatic detection isn't a strict necessity, but it's certainly a nicety.) Furthermore, it's much easier to write a new & improved data file reader, that can read both old and new formats, if the possibility was thought of back when the original data file format was designed.

One thing that helps a lot is if data file formats have version numbers, and if each data file begins with a number, in a simple format and known location which won't change even if the rest of the format changes, indicating which version of the format this file uses. Having a file format version number at the beginning of each data file leads to two immediate advantages:

  1. Whenever a new program reads a data file, it can immediately and unambiguously decide how it's going to read it, whether it can use its new & improved reading routines or whether it might have to fall back on its backwards-compatibility, old-style reader.
  2. If there is a suite of several programs, all of which read the same data files, and if for some reason there's an old version of one of the programs still in use, the old program can print an unambiguous message along the lines of "this is a new data file which I am too old to read", rather than printing the (misleading, in this case) "error in data file" (or crashing).

Another technique which can be immensely useful and which we'll explore next is to define a data file format in such a way that the overall format doesn't change even if new data is added to it.

It's easy to see why the simple data file fragments we've been looking at so far are not resilient in the face of newly-introduced data fields. In the case of struct s, the reader always assumed that the first field in the data file was i, the second field was f, and the third field was s. If we ever add any new fields, unless we're careful to add them at the end of the file (and lucky on top of that), the simpleminded reader will get confused.

One powerful way of getting around this problem is to tag each piece of data in the file, so that the reader knows unambiguously what it is. For example, suppose that we wrote instances of our struct s out like this:

	fprintf(ofp, "i %d\n", x.i);
	fprintf(ofp, "f %g\n", x.f);
	fprintf(ofp, "s %s\n", x.s);

Now, each line begins with a little code which identifies it. (The code in the data file happens to match the name of the corresponding structure member, but that's not necessary, nor is there any way of getting the compiler to make any correspondence automatically.)

If we simply modified one of our previous file-reading code fragments to read this new, tagged format, we might quickly end up with a mess. We'd be continually checking the tag on the line we just read against the tag we expected to read, and constantly printing error messages or trying to resynchronize. But in fact, there's no reason to expect the lines to come in a certain order, and it turns out that it's easier to read such a file a line at a time, without that assumption, taking each line as it comes and not worrying what order the lines come in. Here is how we might do it:

	x.i = 0; x.f = 0.0; x.s[0] = '\0';

	while(fgets(line, MAXLINE, ifp) != NULL)
		{
		if(*line == '#')
			continue;
		ac = getwords(line, av, 2);
		if(ac == 0)
			continue;
		if(strcmp(av[0], "i") == 0)
			x.i = atoi(av[1]);
		else if(strcmp(av[0], "f") == 0)
			x.f = atof(av[1]);
		else if(strcmp(av[0], "s") == 0)
			strcpy(x.s, av[1]);	/* XXX */
		}

This example also throws in a few new little features: a line beginning with # is ignored, so we will be able to place comment lines in data files by beginning them with #. The code also ignores blank lines (those for which getwords returns 0).

We're now treating the "data file" almost like a "command file"--the first word on each line is almost like a "command" telling us to do something: i means store this value in x.i; f means store this value in x.f, etc. Since we don't have any easy way of telling whether we ever got around to setting a particular field, we initialize each one to an appropriate default value before we start. Notice that we did not have a last line in the if/else/if/else chain saying

	else	fprintf(stderr, "error in data file\n");

Instead, we quietly ignore lines we don't recognize! This strategy is admittedly on the simpleminded side, and it would not be adequate under all circumstances, but it means that an old program can read a new data file containing fields it's never heard of. The old program will still be able to pluck out the data it does recognize and can use, while (deliberately) ignoring the (new) data it doesn't know about.

This code is not perfect. We still have the same sorts of problems with that string field, s: it might contain spaces, which we get around (this time) by calling getwords with a second argument of 2, so that all but the first word on the line end up "in" av[1]. Also, the code does not check to see that there actually was a second word on the line before using it to set x.i, x.f, or x.s. (In this case, we could fix that by complaining if getwords did not return 2.)

Finally, we still have the potential for overflow, and we might as well grit our teeth now and figure out how to fix it. Since we already initialized x.s to the empty string with the assignment x.s[0] = '\0', one way around the problem is to replace the call to strcpy with a call to strncat:

		...
		else if(strcmp(av[0], "s") == 0)
			strncat(x.s, av[1], 19);

(or, again, perhaps strncat(x.s, av[1], sizeof(x.s)-1)). The strcat and strncat functions are slightly misleadingly named: what they actually do is append the second string you hand them (i.e. the second argument) to the first, in place. In the case of strncat, it never copies more than n characters, where n is its third argument, although it does always append a \0, which is why we tell it to copy at most 19 characters, not 20. (Since x1.s starts out empty, there's definitely room for 19, although we would still have to worry about the possibility of a corrupted data file which contained two s lines. You might wonder why we couldn't simply use strncpy, but it turns out that, for obscure historical reasons, strncpy does not always append the \0.)

Although it has a few imperfections (which are easily remedied, and are left as exercises) this last example (using fgets, getwords, and an if/strcmp/else... chain) is an excellent basis for a flexible, robust data file reader.

One footnote about the troublesome string field, s: to get around the problem of fixed-size arrays, you might one day decide to declare the s field of struct s as a pointer rather than a fixed-size array. You would have to be careful while reading, however. It might seem that you could just write, for example,

	x.s = av[1];	/* assumes char *s, but also WRONG */

but this would not work; remember that whenever you use pointers you have to worry about memory allocation. If you assigned x.s in that way, where would be the memory that it points to? It would be wherever av[1] points, which is back into the line array. Not only is that (probably) a local array, valid only while the file-reading functions are active, but it's also overwritten with each new line in the data file. You'll obviously want x.s to retain a useful pointer value pointing to the text read from the file, which means that you'll still have to make a copy, after allocating some memory. In this case, you might do

	x.s = malloc(strlen(av[1]) + 1);
	if(x.s == NULL)
		{ fprintf(stderr, "out of memory\n"); return; }
	strcpy(x.s, av[1]);

To some extent, the problems we've been having with field s are fundamental. In particular, any time you use text formats which are based on whitespace-separated "words," string fields which might contain spaces are always tricky to handle.


17.2: Binary Data Files

Normally, when writing notes like these, I progress from the easy to the hard, or the boring to the interesting, or the deficient to the recommended. This chapter is the reverse; I heartily recommend that you use the text data files of the previous section whenever possible. This section on binary data files is included for completeness, and you're welcome to skip it if you're not interested in using binary data files or if it doesn't make sense.

We've already seen two examples of writing and reading binary data files, in section 16.7 of the previous chapter. To write out an array of integers, we called

	fwrite(array, sizeof(int), na, fp);

To read them back in, we called

	na = fread(array, sizeof(int), 10, fp);

To write out a structure, we called

	fwrite(&x, sizeof(struct s), 1, fp);

To read it back in, we called

	fread(&x, sizeof(struct s), 1, fp);

(which returns 1 if it succeeds).

These examples certainly seem attractive: they will result in compact data files, they will probably be quite efficient, and they are certainly simple for the programmer to write. However, data files created in this way fare quite badly when evaluated against our other criteria. They will not be human-readable; they will contain sets of inscrutable byte values which are exact copies of the memory regions used to contain the data structures. They will not be at all portable; they cannot be correctly read (at least, not with the simple calls to fread) on machines where basic types such as int have different sizes, or where the basic types are laid out differently in memory (e.g. "big endian" vs. "little endian", or different floating-point representations). They may not even be able to be read by the same code compiled under a different compiler on the same machine, since different compilers may use different sizes for integers, or lay out the fields of structures differently in memory. (The fields will always be in the order you expect, but different compilers may, for various reasons, leave different amounts of empty space or "padding" between certain fields.) These binary files will have no provision whatsoever for backwards or forwards compatibility; any change to the structure definition will completely change the implied format of the data file, with no hope of reading older (or newer) files. The only other benefit these files have is that if the data is for any reason sensitive, it will certainly be a bit better concealed from prying eyes.

We can get around these disadvantages of binary data files, but in so doing we'll lose many of the advantages, such as blinding efficiency or programmer convenience. If we care about data file portability or backwards or forwards compatibility, we will have to write structures one field at a time, not in one fell swoop. Furthermore, if we have an int to write, we may choose not to write it using fwrite:

	fwrite(&i, sizeof(int), 1, fp);

but rather a byte at a time, using putc:

	putc(i / 256, fp);
	putc(i % 256, fp);

In this way, we'd have precise control over the order in which the two halves of the int are being written. (We're assuming here that there's no more than two bytes' worth of data in the int, which is a safe assumption if we're portably assuming that ints can only hold up to +-32767.) When it came time to read the int back in, we might do that a byte at a time, too:

	i = getc(fp);
	i = 256 * i + getc(fp);

(We could not collapse this to i = 256 * getc(fp) + getc(fp), because we wouldn't know which order the two calls to getc would occur in.)

We might also choose to use tags to mark the various "fields" within our binary data file; the fields would be more likely to be byte codes such as 0x00, 0x01, and 0x02 than the character or string codes we used in the tagged text data file of the previous section.

If you do choose to use binary data files, you must open them for writing with fopen mode "wb" and for reading with "rb" (or perhaps one of the + modes; the point is that you do need the b). Remember that, in the default mode, the standard I/O functions all assume text files, and translate between \n and the operating system's end-of-line representation. If you try to read or write a binary data file in text mode, whenever your internal data happens to contain a byte which matches the code for \n, or your external data happens to contain bytes which match the operating system's end-of-line representation, they may be translated out from under you, screwing up your data.


Read sequentially: prev next top