Chapter 6. Platform Dependencies

One of our criteria for choosing the formats discussed in this book was whether they are used for data exchange (both between applications and across platforms). This analysis necessarily ruled out formats incorporating hardware-specific instructions (for example, printer files). Although the formats we discuss here do not raise many hardware issues, several machine dependency issues do come up with some regularity. Two of these issues have some practical implications beyond being simply sources of annoyance. This chapter describes those issues. It also touches on differences between filenames among different platforms. These are significant only because filenames may offer clues about the origins of files you may receive and need to convert.

Contents:
Byte Order
File Size and Memory Limitations
Floating-Point Formats
Bit Order
Filenames
For Further Information

Byte Order

We generally think of information in memory or on disk as being organized into a series of individual bytes of data. The data is read sequentially in the order in which the bytes are stored. This type of data is called byte-oriented data and is typically used to store character strings and data created by 8-bit CPUs.

Few computers look at the universe through an 8-bit window, however. For reasons of efficiency, 16-, 32-, and 64-bit CPUs prefer to work with bytes organized into 16-, 32-, and 64-bit cells, which are called words, doublewords, and quadwords, respectively. The order of the bytes within word-, doubleword-, and quadword-oriented data is not always the same; it varies depending upon the CPU that created it. (Note, however, that CPUs do exist in which byte ordering can be changed.)

Byte-oriented data has no particular order and is therefore read the same on all systems. Word-oriented data does present a potential problem--probably the most common portability problem you will encounter when moving files between platforms. The problem arises when binary data is written to a file on a machine with one byte order and is then read on a machine assuming a different byte order. Obviously, the data will be read incorrectly.

It is the order of the bytes within each word and doubleword of data that determine the "endianness" of the data. The two main categories of byte-ordering schemes are called big-endian and little-endian.[1] Big-endian machines store the most significant byte (MSB) at the lowest address in a word, usually referred to as byte 0. Big-endian machines include those based on the Motorola MC68000A series of CPUs (the 68000, 68020, 68030, 68040, and so on), including the Commodore Amiga, the Apple Macintosh, and some UNIX machines.

[1] The terms big-endian and little-endian were originally found in Jonathan Swift's book, Gulliver's Travels, as satirical descriptions of politicians who disputed whether eggs should be broken at their big end or their little end. This term was first applied to computer architecture by Danny Cohen. (See "For Further Information" below.)

Little-endian machines store the least significant byte (LSB) at the lowest address in a word. The two-byte word value, 1234h, written to a file in little-endian format, would be read as 3412h on a big-endian system. This occurs because the big-endian system assumes that the MSB, in this case the value 12h, is at the lowest address within the byte. The little-endian system, however, places the MSB at the highest address in the byte. When read, the position of the bytes in the word are effectively flipped in the file-reading process by the big-endian machine. Little-endian machines include those based on the Intel iAPX86 series of CPUs (the 8088, 80286, 80386, 80486, and so forth), including the IBM PC and clones.

A third term, middle-endian, has been coined to refer to all byte-ordering schemes that are neither big-endian nor little-endian. Such middle-endian ordering schemes include the 3-4-1-2, 2-1-4-3, 2-3-0-1, and 1-0-3-2 packed-decimal formats. The Digital Equipment Corporation PDP-11 is an example of a middle-endian machine. The PDP-11 has a DWORD byte-ordering scheme of 2-3-0-1.

The I/O routines in the C standard library always read word data in the native byte order of the machine hosting the application. This means that functions such as fread() and fwrite() have no knowledge of byte order and cannot provide needed conversions. Most C libraries, however, contain a function named swab(), which is used to swap the bytes in an array of bytes. While swab() can be used to convert words of data from one byte order to another, doing so can be inefficient, due to the necessity of making multiple calls for words greater than two bytes in size.

Programmers working with bitmap files need to be concerned about byte order, because many popular formats such as Macintosh Paint (MacPaint), Interchange File Format (IFF or AmigaPaint), and SunRaster image files are always read and written in big-endian byte order. The TIFF file format is unique, however, in that any TIFF file can be written in either format, and any TIFF reader must be able to read either byte order correctly regardless of the system on which the code is executing.