jmtd → doom → wad

This page is a draft.

Introduction

This article documents my experiments with reading files in the doom WAD format, using portable, ANSI-standard C.

ANSI C is widely regarded to be a portable language: most of the pitfalls you will encounter writing portable C is not the language itself, but assumptions about things (such as variable sizes, endianness) which the language does not dictate.

This article highlights just how much is actually necessary to write portable standard C.

I also point out some idosyncracies that wad-building tools have.

WAD overview

A WAD file consists of

a 12-byte header
lumps of data
a directory of lumps

The header consists of

4 bytes magic (IWAD or PWAD)
number of lumps in the WAD
byte-offset to the directory

The directory consists of 16-byte entries. Each entry consists of

the byte-offset for the lump
the size of the lump
an 8-byte ASCII name (padded with '\0')

All the numbers mentioned above are 4-bytes wide, unsigned, and arranged in little-endian byte order (i.e., the native byte order for DOS).

Wad header

As mentioned, the WAD header consists of

a 4-bytes magic number
a 4-byte number of items in the WAD
a 4-byte offset to the directory

magic number

WAD files have a 12-byte header. The first byte is one of ASCII "P" or "I", depending on whether it's an ID WAD or a "patch" WAD. The next three bytes are invariably WAD.

Program #1

The first code snippet is the largest of the lot, because we deal with all the overhead of opening, reading from and closing files, plus the C boilerplate.

This is a program to tell you whether the filename supplied on the command-line is an IWAD, a PWAD, or not a DOOM WAD at all.

Some examples of usage:

C:\tcc>1 1.c
1.c is not a WAD file

C:\tcc>1 c:/doom/gib.wad
c:/doom/gib.wad: PWAD

C:\tcc>1 c:/doom/doom2.wad
c:/doom/doom2.wad: IWAD

C:\tcc>1 asfasdf
cannot open C:\TCC\1.EXE: No such file or directory

Parsing the integers

All the integers in the WAD data are 4-byte (32 bit) wide, unsigned integers, arrange in little-endian order.

Sizes

The size (or width) of integer types in C is dependent on the compiler being used and the CPU being compiled for. Here's a table of some integer types and their sizes in various environments:

`type`	`size`
`32bit`	16bit	64bit
int	4	2	4
short int	2	2	2
long int	4	4	8
long long	8	4	?
void *	?	?	8

The C99 standard includes a header called inttypes.h which defines types which are of known widths, for example, uint32_t which is guaranteed to be a 32-bit (4-byte) wide unsigned integer, irrespective of the environment.

I wanted to avoid dependencies on things outside of the C89 standard for my experiments, however. Luckily, the C89 standard defines a minimum size for some types, where it doesn't define their width. The long type is defined to be 4-bytes wide at minimum. The table above confirms this for the environments I've tested.

We can therefore safely use unsigned long to hold the integers defined in the WAD structure: on those platforms where this takes up more than 4 bytes, the spare space merely goes to waste.

endianness

If p is a pointer to the beginning of an integer in the WAD structure, than p[0] is the least-significant byte and p[3] the most.

C's bit-shifting operators behave endian-independent, although they are described in K&R in terms of big-endian environments. The left shift operator, <<, increases the value of the item being shifted (the bits become more significant).

We can therefore use the left-bitshift operator and the bitwise OR operator to interpret the number in an endian-independent way, as follows:

static unsigned long parseint(unsigned char *p) {
    return   (unsigned long)p[0]
           | (unsigned long)p[1] << 8
           | (unsigned long)p[2] << 16
           | (unsigned long)p[3] << 24;
}

The casts are necessary, because the expression has type int before assignment. In environments where an int is smaller than a long (such as Turbo C) we would overflow quite quickly.

Program #2

This snippets just extends program #1 to print out the number of lumps in the WAD and the directory offset, by interpreting the numbers in the header.

The program uses the procedure defined below, and adds the following few lines:

entries = parseint(header+4);
doffset = parseint(header+8);
printf("containing %lu entries, "
       "directory at offset %lu\n", 
       entries,doffset);</code></pre>

Reading the directory

Reading the directory from this point is fairly trivial: each directory entry is a fixed-width and we know how to interpret the numerical contents.