Chapter 9: The C Preprocessor

Conceptually, the "preprocessor" is a translation phase that is applied to your source code before the compiler proper gets its hands on it. (Once upon a time, the preprocessor was a separate program, much as the compiler and linker may still be separate programs today.) Generally, the preprocessor performs textual substitutions on your source code, in three sorts of ways:

The next three sections will introduce these three preprocessing functions.

The syntax of the preprocessor is different from the syntax of the rest of C in several respects. First of all, the preprocessor is "line based." Each of the preprocessor directives we're going to learn about (all of which begin with the # character) must begin at the beginning of a line, and each ends at the end of the line. (The rest of C treats line ends as just another whitespace character, and doesn't care how your program text is arranged into lines.) Secondly, the preprocessor does not know about the structure of C--about functions, statements, or expressions. It is possible to play strange tricks with the preprocessor to turn something which does not look like C into C (or vice versa). It's also possible to run into problems when a preprocessor substitution does not do what you expected it to, because the preprocessor does not respect the structure of C statements and expressions (but you expected it to). For the simple uses of the preprocessor we'll be discussing, you shouldn't have any of these problems, but you'll want to be careful before doing anything tricky or outrageous with the preprocessor. (As it happens, playing tricky and outrageous games with the preprocessor is considered sporting in some circles, but it rapidly gets out of hand, and can lead to bewilderingly impenetrable programs.)

9.1 File Inclusion

9.2 Macro Definition and Substitution

9.3 Conditional Compilation


9.1 File Inclusion

[This section corresponds to K&R Sec. 4.11.1]

A line of the form

	#include <filename.h>
or
	#include "filename.h"
causes the contents of the file filename.h to be read, parsed, and compiled at that point. (After filename.h is processed, compilation continues on the line following the #include line.) For example, suppose you got tired of retyping external function prototypes such as
	extern int getline(char [], int);
at the top of each source file. You could instead place the prototype in a header file, perhaps getline.h, and then simply place
	#include "getline.h"
at the top of each source file where you called getline. (You might not find it worthwhile to create an entire header file for a single function, but if you had a package of several related function, it might be very useful to place all of their declarations in one header file.) As we may have mentioned, that's exactly what the Standard header files such as stdio.h are--collections of declarations (including external function prototype declarations) having to do with various sets of Standard library functions. When you use #include to read in a header file, you automatically get the prototypes and other declarations it contains, and you should use header files, precisely so that you will get the prototypes and other declarations they contain.

The difference between the <> and "" forms is where the preprocessor searches for filename.h. As a general rule, it searches for files enclosed in <> in central, standard directories, and it searches for files enclosed in "" in the "current directory," or the directory containing the source file that's doing the including. Therefore, "" is usually used for header files you've written, and <> is usually used for headers which are provided for you (which someone else has written).

The extension ".h", by the way, simply stands for "header," and reflects the fact that #include directives usually sit at the top (head) of your source files, and contain global declarations and definitions which you would otherwise put there. (That extension is not mandatory--you can theoretically name your own header files anything you wish--but .h is traditional, and recommended.)

As we've already begun to see, the reason for putting something in a header file, and then using #include to pull that header file into several different source files, is when the something (whatever it is) must be declared or defined consistently in all of the source files. If, instead of using a header file, you typed the something in to each of the source files directly, and the something ever changed, you'd have to edit all those source files, and if you missed one, your program could fail in subtle (or serious) ways due to the mismatched declarations (i.e. due to the incompatibility between the new declaration in one source file and the old one in a source file you forgot to change). Placing common declarations and definitions into header files means that if they ever change, they only have to be changed in one place, which is a much more workable system.

What should you put in header files?

However, there are a few things not to put in header files:

Since header files typically contain only external declarations, and should not contain function bodies, you have to understand just what does and doesn't happen when you #include a header file. The header file may provide the declarations for some functions, so that the compiler can generate correct code when you call them (and so that it can make sure that you're calling them correctly), but the header file does not give the compiler the functions themselves. The actual functions will be combined into your program at the end of compilation, by the part of the compiler called the linker. The linker may have to get the functions out of libraries, or you may have to tell the compiler/linker where to find them. In particular, if you are trying to use a third-party library containing some useful functions, the library will often come with a header file describing those functions. Using the library is therefore a two-step process: you must #include the header in the files where you call the library functions, and you must tell the linker to read in the functions from the library itself.


9.2 Macro Definition and Substitution

[This section corresponds to K&R Sec. 4.11.2]

A preprocessor line of the form

	#define name text
defines a macro with the given name, having as its value the given replacement text. After that (for the rest of the current source file), wherever the preprocessor sees that name, it will replace it with the replacement text. The name follows the same rules as ordinary identifiers (it can contain only letters, digits, and underscores, and may not begin with a digit). Since macros behave quite differently from normal variables (or functions), it is customary to give them names which are all capital letters (or at least which begin with a capital letter). The replacement text can be absolutely anything--it's not restricted to numbers, or simple strings, or anything.

The most common use for macros is to propagate various constants around and to make them more self-documenting. We've been saying things like

	char line[100];
	...
	getline(line, 100);
but this is neither readable nor reliable; it's not necessarily obvious what all those 100's scattered around the program are, and if we ever decide that 100 is too small for the size of the array to hold lines, we'll have to remember to change the number in two (or more) places. A much better solution is to use a macro:
	#define MAXLINE 100
	char line[MAXLINE];
	...
	getline(line, MAXLINE);
Now, if we ever want to change the size, we only have to do it in one place, and it's more obvious what the words MAXLINE sprinkled through the program mean than the magic numbers 100 did.

Since the replacement text of a preprocessor macro can be anything, it can also be an expression, although you have to realize that, as always, the text is substituted (and perhaps evaluated) later. No evaluation is performed when the macro is defined. For example, suppose that you write something like

	#define A 2
	#define B 3
	#define C A + B
(this is a pretty meaningless example, but the situation does come up in practice). Then, later, suppose that you write
	int x = C * 2;
If A, B, and C were ordinary variables, you'd expect x to end up with the value 10. But let's see what happens.

The preprocessor always substitutes text for macros exactly as you have written it. So it first substitites the replacement text for the macro C, resulting in

	int x = A + B * 2;
Then it substitutes the macros A and B, resulting in
	int x = 2 + 3 * 2;
Only when the preprocessor is done doing all this substituting does the compiler get into the act. But when it evaluates that expression (using the normal precedence of multiplication over addition), it ends up initializing x with the value 8!

To guard against this sort of problem, it is always a good idea to include explicit parentheses in the definitions of macros which contain expressions. If we were to define the macro C as

	#define C (A + B)
then the declaration of x would ultimately expand to
	int x = (2 + 3) * 2;
and x would be initialized to 10, as we probably expected.

Notice that there does not have to be (and in fact there usually is not) a semicolon at the end of a #define line. (This is just one of the ways that the syntax of the preprocessor is different from the rest of C.) If you accidentally type

	#define MAXLINE 100;			/* WRONG */
then when you later declare
	char line[MAXLINE];
the preprocessor will expand it to
	char line[100;];			/* WRONG */
which is a syntax error. This is what we mean when we say that the preprocessor doesn't know much of anything about the syntax of C--in this last example, the value or replacement text for the macro MAXLINE was the 4 characters 1 0 0 ; , and that's exactly what the preprocessor substituted (even though it didn't make any sense).

Simple macros like MAXLINE act sort of like little variables, whose values are constant (or constant expressions). It's also possible to have macros which look like little functions (that is, you invoke them with what looks like function call syntax, and they expand to replacement text which is a function of the actual arguments they are invoked with) but we won't be looking at these yet.


9.3 Conditional Compilation

[This section corresponds to K&R Sec. 4.11.3]

The last preprocessor directive we're going to look at is #ifdef. If you have the sequence

	#ifdef name
	program text
	#else
	more program text
	#endif
in your program, the code that gets compiled depends on whether a preprocessor macro by that name is defined or not. If it is (that is, if there has been a #define line for a macro called name), then "program text" is compiled and "more program text" is ignored. If the macro is not defined, "more program text" is compiled and "program text" is ignored. This looks a lot like an if statement, but it behaves completely differently: an if statement controls which statements of your program are executed at run time, but #ifdef controls which parts of your program actually get compiled.

Just as for the if statement, the #else in an #ifdef is optional. There is a companion directive #ifndef, which compiles code if the macro is not defined (although the "#else clause" of an #ifndef directive will then be compiled if the macro is defined). There is also an #if directive which compiles code depending on whether a compile-time expression is true or false. (The expressions which are allowed in an #if directive are somewhat restricted, however, so we won't talk much about #if here.)

Conditional compilation is useful in two general classes of situations:

Conditional compilation can be very handy, but it can also get out of hand. When large chunks of the program are completely different depending on, say, what operating system the program is being compiled for, it's often better to place the different versions in separate source files, and then only use one of the files (corresponding to one of the versions) to build the program on any given system. Also, if you are using an ANSI Standard compiler and you are writing ANSI-compatible code, you usually won't need so much conditional compilation, because the Standard specifies exactly how the compiler must do certain things, and exactly which library functions it much provide, so you don't have to work so hard to accommodate the old variations among compilers and libraries.