Chapter 15: User-Defined Data Structures

So far, we have been using C's basic types--char, int, long int, double, etc.--and a few derived types--arrays of basic types, pointers to basic types, and functions returning basic types. In this chapter, we'll learn about another way to derive types: by building user-defined types.

There's an old joke about Henry Ford saying that you could get the Model T in any color you wanted, as long as it was black. User-defined data types have a little bit of the same restriction: you don't have ultimate flexibility; you can define your own data types any way you want, as long as they're collections of other types. (What you couldn't define would be data types that held, say, tractors or teddy bears or Model T Fords, because of course computers have no way of holding those objects. You're ultimately restricted to the primitive types of data which computers can represent, and C's basic types cover most of those.)

Very roughly speaking, a structure is a little bit like an array. An array is a collection of associated values, all of the same type. A structure is a collection of associated values, but the values can all have different types.

15.1: Structures

15.2: Accessing Members of Structures

15.3: Operations on Structures

15.4: Pointers to Structures

15.5: Linked Data Structures


Read sequentially: prev next top

15.1: Structures

[This section corresponds to K&R Sec. 6.1]

The basic user-defined data type in C is the structure, or struct. (C structures are analogous to the records found in some other languages.) Defining structures is a two-step process: first you define a "template" which describes the new type, then you declare variables having the new type (or functions returning the new type, etc.).

As a simple example, suppose we wanted to define our own type for representing complex numbers. (If you're blissfully ignorant of these beasts, a complex number consists of a "real" and "imaginary" part, where the imaginary part is some multiple of the square root of negative 1. You don't have to understand complex numbers to understand this example; you can think of the real and imaginary parts as the x and y coordinates of a point on a plane.) FORTRAN has a built-in complex type, but C does not. How might we add one? Since a complex number consists of a real and imaginary part, we need a way of holding both these quantities in one data type, and a structure will do just the trick. Here is how we might declare our complex type:

	struct complex
		{
		double real;
		double imag;
		};

A structure declaration consists of up to four parts, of which we can see three in the example above. The first part is the keyword struct which indicates that we are talking about a structure. The second part is a name or tag by which this structure (that is, this new data type) will be known. The third part is a list of the structure's members (also called components or fields). This list is enclosed in braces {}, and contains what look like the declarations of ordinary variables. Each member has a name and a type, just like ordinary variables, but here we are not declaring variables; we are setting up the structure of the structure by defining the collection of data types which will make up the structure. Here we see that the complex structure will be made up of two members, both of type double, one named real and one named imag.

It's important to understand that what we've defined here is just the new data type; we have not yet declared any variables of this new type! The name complex (the second part of the structure declaration) is not the name of a variable; it's the name of the structure type. The names real and imag are not the names of variables; they're identifiers for the two components of the structure.

We declare variables of our new complex type with declarations like these:

	struct complex c1;
or
	struct complex c2, c3;

These look almost like our previous declarations of variables having basic types, except that instead of a type keyword like int or double, we have the two-word type name struct complex. The keyword struct indicates that we're talking about a structure, and the identifier complex is the name for the particular structure we're talking about. c1, c2, and c3 will all be declared as variables of type struct complex; each one of them will have real and imaginary parts buried inside them. (We'll see how to get at those parts in the next section.) Using our graphic, "labeled box" notation, we could draw representations of c1, c2, and c3 like this:



Actually, these pictures are a bit misleading; the outer box indicating each composite structure suggests that there might be more inside them than just the two members, real and imag (that is, more than the two values of type double). A simpler but more representative picture would be:


The only memory allocated is for two values of type double (the two boxes); all the names are just for our convenience and the compiler's reference; none are typically stored in the program's memory at run time.

Notice that when we define structures in this way we have not quite defined a new type on a par with int or double. We can not say

	complex c1;		/* WRONG */

The name complex does not become a full-fledged type name like int or double; it's just the name of a particular structure, and we must use the keyword struct and the name of a particular structure (e.g. complex) to talk about that structure type. (There is a way to define new full-fledged type names like int and double, and in C++ a new structure does automatically become a full-fledged type, but we won't worry about these wrinkles for now.)

I said that a structure definition consisted of up to four parts. We saw the first three of them in the first example; the fourth part of a full strucure declaration is simply a list of variables, which are to be declared as having the structure type at the same time as the structure itself is defined. For example, if we had written

	struct complex
		{
		double real;
		double imag;
		} c1, c2, c3;

we would have defined the type struct complex, and right away declared three variables c1, c2, and c3 all of type struct complex.

In fact, three of the four parts of a structure declaration (all but the keyword struct) are optional. If a declaration contains the keyword struct, a structure tag, and a brace-enclosed list of members (as in the first structure definition we saw), it's a definition of the structure itself (that is, just the template). If a declaration contains the keyword struct, a structure tag, and a list of variable names (as in the first declarations of c1, c2, and c3 we saw), it's a declaration of those variables having that structure type (the structure type itself must of course typically be declared elsewhere). If a declaration contains all four elements (as in the second declaration of c1, c2, and c3 we saw), it's a definition of the structure type and a declaration of some variables. It's also possible to use the first, third, and fourth parts:

	struct 	{
		double real;
		double imag;
		} c1, c2, c3;

Here we declare c1, c2, and c3 as having a structure type with no tag name, which is not usually very useful, because without a tag name we won't be able to declare any other variables or functions of this type for c1, c2, and c3 to play with. (Finally, it's also possible to declare just the tag name, leaving both the list of members and any declarations of variables for later, but this is only needed in certain fairly rare and obscure situations.)

Because a structure definition can also declare variables, it's important not to forget the semicolon at the end of a structure definition. If you accidentally write

	struct complex
		{
		double real;
		double imag;
		}

without a semicolon, the compiler will keep looking for something later in the file and try to declare it as being of type struct complex, which will either result in a confusing error message or (if the compiler succeeds) a confusing misdeclaration.


15.2: Accessing Members of Structures

We said that a structure was a little bit like an array: a collection of members (elements). We access the elements of an array by using a numeric subscript in square brackets []. We access the elements of a structure by name, using the structure selection operator which is a dot (a period). The structure selection operator is a little like the other binary operators we've seen, but much more restricted: on its left must be a variable or object of structure type, and on its right must be the name of one of the members of that structure. For example, if c1 is a variable of type struct complex as declared in the previous section, then c1.real is its real part and c1.imag is its imaginary part.

Like subscripted array references, references to the members of structure variables (using the structure selection operator) can appear anywhere, either on the right or left side of assignment operators. We could say

	c1.real = 1

to set the real part of c1 (that is, the real member within c1) to 1, or

	c1.imag = c2.imag

to fetch the imaginary part of c2 and assign it to the imaginary part of c1, or

	c1.real = c2.real + c3.real

to take the real parts of c2 and c3, add them together, and assign the result to the real part of c1.


15.3: Operations on Structures

[This section corresponds roughly to K&R Sec. 6.2]

There is a relatively small number of operations which C directly supports on structures. As we've seen, we can define structures, declare variables of structure type, and select the members of structures. We can also assign entire structures: the expression

	c1 = c2

would assign all of c2 to c1 (both the real and imaginary parts, assuming the preceding declarations). We can also pass structures as arguments to functions, and declare and define functions which return structures. But to do anything else, we typically have to write our own code (often as functions). For example, we could write a function to add two complex numbers:

	struct complex
	cpx_add(struct complex c1, struct complex c2)
	{
	struct complex sum;
	sum.real = c1.real + c2.real;
	sum.imag = c1.imag + c2.imag;
	return sum;
	}

We could then say things like

	c1 = cpx_add(c2, c3)

(Two footnotes: Typically, you do need a temporary variable like sum in a function like cpx_add above that returns a structure. In C++, it's possible to couple your own functions to the built-in operators, so that you could write c1 = c2 + c2 and have it call your complex add function.)

One more thing you can do with a structure is initialize a structure variable while declaring it. As for array initializations, the initializer consists of a comma-separated list of values enclosed in braces {}:

	struct complex c1 = {1, 2};
	struct complex c2 = {3, 4};

Of course, the type of each initializer in the list must be compatible with the type of the corresponding structure member.


15.4: Pointers to Structures

[This section corresponds to K&R Sec. 6.4]

Pointers in C are general; we can have pointers to any type. It turns out that pointers to structures are particularly useful.

We declare pointers to structures the same way we declare any other pointers: by preceding the variable name with an asterisk in the declaration. We could declare two pointers to struct complex with

	struct complex *p1, *p2;

And, as before, we could set these pointers to point to actual variables of type complex:

	p1 = &c2;
	p2 = &c3;

Then,

	*p1 = *p2

would copy the structure pointed to by p2 to the structure pointed to by p1 (i.e. c3 to c2), and

	p1 = p2

would set p1 to point wherever p2 points. (None of this is new, these are the obvious analogs of how all pointer assignments work.) If we wanted to access the member of a pointed-to structure, it would be a tiny bit messy--first we'd have to use * to get the structure pointed to, then . to access the desired member. Furthermore, since . has higher precedence than *, we'd have to use parentheses:

	(*p1).real

(Without the parentheses, i.e. if we wrote simply *p1.real, we'd be taking the structure p1, selecting its member named real, and accessing the value that p1.real points to, which would be doubly nonfunctional, since the real member is in our ongoing example not a pointer, and p1 is not a structure but rather a pointer to a structure, so the . operator won't work on it.)

Since pointers to structures are common, and the parentheses in the example above are a nuisance, there's another structure selection operator which works on pointers to structures. If p is a pointer to a structure and m is a member of that structure, then

	p->m

selects that member of the pointed-to structure. The expression p->m is therefore exactly equivalent to

	(*p).m


15.5: Linked Data Structures

[This section corresponds to K&R Sec. 6.5]

One reason that pointers to structures are useful and common is that they can be used to build linked data structures, in which a structure contains a pointer to another instance of the same structure (or perhaps a different structure). The simplest example is a singly-linked list, which we might declare like this:

struct listnode
	{
	char *item;
	struct listnode *next;
	};

This structure describes one node in a list; a list may consist of many nodes, one for each item in the list. Here, each item in the list (the field we've named item) will be a string, represented (i.e. pointed to) by a char *. Each node in the list is linked to its successor by its next field, which is a pointer to another struct listnode. (The compiler is perfectly happy to place a pointer to a structure inside that very same structure; it would only complain if you tried to stuff an entire struct listnode into a struct listnode, which would tend to make the struct listnode explode.) We'll use a null pointer as the next field of the last node in the list, since by definition a null pointer doesn't point anywhere.

We could set up a tiny list with these declarations:

struct listnode node2 = {"world", NULL};
struct listnode node1 = {"hello", &node2};
struct listnode *head = &node1;

A box-and-arrows picture of the resulting list would look like this:



The list has two nodes, allocated by the compiler in response to the first two declarations we gave. We've also allocated a pointer variable, head, which points at the "head" of the list.

Once we've built a list, we'll naturally want to do things with it. One of the simplest operations is to print the list back out. The code to do so is very simple. We declare another list pointer lp and then cause it to step over each node in the list, in turn:

	struct listnode *lp;
	for(lp = head; lp != NULL; lp = lp->next)
		printf("%s\n", lp->item);

This for loop deserves some attention, especially if you haven't seen one like it before. Many for loops step an int variable (often called i) through a series of integer values. However, the three controlling expressions of a for loop are not limited to that pattern; you may in fact use any expressions at all, although it's best if they conform to the expected initialize;test;increment pattern. The list-printing loop above certainly does: the expression lp = head initializes lp to point to the head of the loop; the expression lp != NULL tests whether lp still points to a real node (or whether it has reached the null pointer which marks the end of the list); and the expression lp = lp->next sets lp to point to the next node in the list, one past where it did before.

The two-element list above is pretty useless; its only worth is as a first example. The real power of linked lists (and other linked data structures) is that they can grow on demand, in response to the data that your program finds itself working with. For a linked list to grow on demand, however, we'll have to allocate its nodes dynamically, because we won't know in advance how many of them we'll need. (In the first example, we had two static nodes, because we knew in advance, at compile time, that our list would have two elements. But that static allocation won't do for a dynamic list. How would we know how many struct listnode variables to allocate?)

The general solution, of course, is to call malloc. Here is a scrap of code which inserts a new word at the head of a list:

	#include <stdio.h>		/* for fprintf, stderr */
	#include <stdlib.h>		/* for malloc */

	char *newword = "test";
	struct listnode *newnode = malloc(sizeof(struct listnode));
	if(newnode == NULL)
		{
		fprintf(stderr, "out of memory\n");
		exit(1);
		}

	newnode->item = newword;
	newnode->next = head;
	head = newnode;

The expression sizeof(struct listnode) in the call to malloc asks the compiler to compute the number of bytes required to store one struct listnode, and that's exactly how many bytes of memory we ask for from malloc. Make sure you see how the last two lines work to splice the new node in at the head of the list, by making the new node's next pointer point at the old head of the list, and then resetting the head of the list to be the new node. (Another word for a list where we always add new items at the beginning is a stack.)

Naturally, we'd like to encapsulate this operation of prepending an item to a list as a function. Doing so is just a little bit tricky, because the list's head pointer is modified every time. There are several ways to achieve this modification; the way we'll do it is to have our list-prepending function return a pointer to the new head of the list. Here is the function:

#include <stdio.h>		/* for fprintf, stderr */
#include <stdlib.h>		/* for malloc, exit */
#include <string.h>		/* for strlen, strcpy */

struct listnode *prepend(char *newword, struct listnode *oldhead)
{
struct listnode *newnode = malloc(sizeof(struct listnode));
if(newnode == NULL)
	{
	fprintf(stderr, "out of memory\n");
	exit(1);
	}

newnode->item = malloc(strlen(newword) + 1);
if(newnode->item == NULL)
	{
	fprintf(stderr, "out of memory\n");
	exit(1);
	}
strcpy(newnode->item, newword);

newnode->next = oldhead;
return newnode;
}

Since we want this to be a general-purpose function, we also allocate new space for the new string (word, item) being stored. Otherwise, we'd be depending on the caller to arrange that the pointer to the new string remain valid for as long as the list was in use. As we'll see, that's not always a safe assumption. By allocating our own memory, which "belongs" to the list, we ensure that the list isn't dependent on the caller in this way. (Notice, too, that the number of bytes we ask for is strlen(newword) + 1.)

(As an aside, it's a mild blemish on the above code that it contains two identical calls to fprintf, complaining about two separate potential failures in two calls to malloc. It's quite possible to combine these two cases, and many C programmers prefer to do so, although the expression may be a bit scary-looking at first:

struct listnode *newnode;
if((newnode = malloc(sizeof(struct listnode))) == NULL ||
		(newnode->item = malloc(strlen(newword) + 1)) == NULL)
	{
	fprintf(stderr, "out of memory\n");
	exit(1);
	}

How does this work? First it calls malloc(sizeof(struct listnode)), and assigns the result to newnode. Then it calls malloc(strlen(newword) + 1), and assigns the result to newnode->item. This code relies on two special guarantees of C's || operator, namely that it always evaluates its left-hand side first, and that if the left-hand side evaluates as true, it doesn't bother to evaluate the right-hand side, because once the left-hand side is true, the final result is definitely going to be true. Therefore, we're guaranteed that we'll allocate space for newnode to point to before we try to fill in newnode->item, but if the first call to malloc fails, we won't call malloc a second time or try to fill in newnode->item at all. These guarantees--of left-to-right evaluation, and of skipping the evaluation of the right-hand side if the left-hand side determines the result--are unique to the || and && operators. It's perfectly acceptable to rely on these guarantees when using || and &&, but don't assume that other operators will operate deterministically from left to right, because most of them don't.)

Now that we have our prepend function, we can build a list by calling it several times in succession:

	struct listnode *head = NULL;
	head = prepend("world", head);
	head = prepend("hello", head);

This code builds essentially the same list as our first, static example. Notice how we initialize the list head pointer with a null pointer, which is synonymous with an empty list. (Notice also that the code we wrote up above, for printing a list, would also deal correctly with an empty list.)

Using static calls to prepend is hardly more interesting than building a static link by hand. To make things truly interesting, let's read words or strings typed by the user (or redirected from a file), and prepend those to a list. The code is not hard:

	#define MAXLINE 200

	char line[MAXLINE];
	struct listnode *head = NULL;
	struct listnode *lp;

	while(getline(line, MAXLINE) != EOF)
		head = prepend(line, head);

	for(lp = head; lp != NULL; lp = lp->next)
		printf("%s\n", lp->item);

(getline is the line-reading function we've been using. If you don't have a copy handy, you can use the fgets function from the standard library.)

If you type in this code and run it, you will find that it prints lines back out in the reverse order that you typed them. (In doing so, of course, it slurps all the lines into memory, so you might run out of memory if you tried to use this technique for reversing all the lines in a huge file.) Notice that when we call prepend in this way, it is important that prepend allocate memory for, and stash away, each string. Can you see what would happen if prepend did not, that is, if it simply said newnode->item = newword?

Linked lists are only the simplest example of linked data structures. There are also queues, doubly-linked lists, trees, and circular lists. We'll see more concrete examples of linked structures in the adventure game example. The set of objects sitting in a room (or held by the player) will be represented by a linked list of objects, and the rooms will be linked to each other to indicate the passages between rooms.


Read sequentially: prev next top