Chapter 9. Data Compression

Compression is the process used to reduce the physical size of a block of information. In computer graphics, we're interested in reducing the size of a block of graphics data so we can fit more information in a given physical storage space. We also might use compression to fit larger images in a block of memory of a given size. You may find when you examine a particular file format specification that the term data encoding is used to refer to algorithms that perform compression. Data encoding is actually a broader term than data compression. Data compression is a type of data encoding, and one that is used to reduce the size of data. Other types of data encoding are involved in encryption (cryptography) and data transmission (e.g., Morse code).

Contents:
Data Compression Terminology
Pixel Packing
Run-Length Encoding (RLE)
Lempel-Ziv-Welch (LZW) Compression
CCITT (Huffman) Encoding
JPEG Compression
JBIG Compression
ART Compression
Fractal Image Compression
For Further Information About Data Compression

A compressor, naturally enough, performs compression, and a decompressor reconstructs the original data. Although this may seem obvious, a decompressor can operate only by using knowledge of the compression algorithm used to convert the original data into its compressed form. What this means in practice is that there is no way for you to avoid understanding the compression algorithms in the market today if you are interested in manipulating data files. At a minimum, you need general knowledge of the conceptual basis of the algorithms.

If you read through a number of specification documents, you'll find that most formats incorporate some kind of compression method, no matter how rudimentary. You'll also find that only a few different compression schemes are in common use throughout the industry. The most common of these schemes are variants of the following methods, which we discuss in the sections below.

Run-length encoding (RLE)
Lempel-Ziv-Welch (LZW)
CCITT (one type of CCITT compression is a variant on Huffman encoding)
Discrete Cosine Transform (DCT) (used in the JPEG compression we discuss in this chapter)
JBIG
ART
Fractal

In addition, we discuss pixel packing, which is not a compression method per se but an efficient way to store data in contiguous bytes of memory.

Data compression works somewhat differently for the three common types of graphics data: bitmap, vector, and metafile. In bitmap files, only the image data is usually compressed; the header and any other data (color map, footer, and so on) are always left uncompressed. This uncompressed data makes up only a small portion of a typical bitmap file.

Vector files normally do not incorporate a native form of data compression. Vector files store a mathematical description of an image rather than the image data itself. There are reasons why vector files are rarely compressed:

The expression of the image data is already compact in design, so data compression would have little effect.
Vector images are typically fast to read but slow to reconstruct; adding the overhead of decompression would make rendering them still slower.

If a vector file is compressed at all, you will usually find the entire file compressed, header and all.

In metafiles, data compression schemes often resemble those used to compress bitmap files, depending upon the type of data the metafiles contain.

Programmers commonly confuse compression algorithms with the files in which they're used. Many programmers ask vendors or newsgroups for specifications for the CCITT or JPEG file formats. There are no such specifications. Compression algorithms define only how data is encoded, not how it is stored on disk. For detailed information on how data is stored on disk, look to an actual image file format specification, such as BMP or GIF, which will define file headers, byte order, and other issues not covered by discussions of compression algorithms. More complex formats, such as TIFF, may incorporate several different compression algorithms.

The following sections introduce the terms used in discussions of data compression and each of the main types of data compression algorithms used for graphics file formats today.

Data Compression Terminology

This section describes the terms you'll encounter when you read about data compression schemes in this chapter and in graphics file format specifications.

The terms unencoded data and raw data describe data before it has been compressed, and the terms encoded data and compressed data describe the same information after it has been compressed.

The term compression ratio is used to refer to the ratio of uncompressed data to compressed data. Thus, a 10:1 compression ratio is considered five times more efficient than 2:1. Of course, data compressed using an algorithm yielding 10:1 compression is five times smaller than the same data compressed using an algorithm yielding 2:1 compression. In practice, because only image data is normally compressed, analysis of compression ratios provided by various algorithms must take into account the absolute sizes of the files tested.

Physical and Logical Compression

Compression algorithms are often described as squeezing, squashing, crunching, or imploding data, but these are not very accurate descriptions of what is actually happening. Although the major use of compression is to make data use less disk space, compression does not actually physically cram the data into a smaller size package in any meaningful sense.

Instead, compression algorithms are used to re-encode data into a different, more compact representation conveying the same information. In other words, fewer words are used to convey the same meaning, without actually saying the same thing.

The distinction between physical and logical compression methods is made on the basis of how the data is compressed or, more precisely, how the data is rearranged into a more compact form. Physical compression is performed on data exclusive of the information it contains; it only translates a series of bits from one pattern to another, more compact one. While the resulting compressed data may be related to the original data in a mechanical way, that relationship will not be obvious to us.

Physical compression methods typically produce strings of gibberish, at least relative to the information content of the original data. The resulting block of compressed data is normally smaller than the original because the physical compression algorithm has removed the redundancy that existed in the data itself. Most of the compression methods discussed in this chapter are physical methods.

Logical compression is accomplished through the process of logical substitution--that is, replacing one alphabetic, numeric, or binary symbol with another. Changing "United States of America" to "USA" is a good example of logical substitution, because "USA" is derived directly from the information contained in the string "United States of America" and retains some of its meaning. In a similar fashion "can't" can be logically substituted for "cannot". Logical compression works only on data at the character level or higher and is based exclusively on information contained within the data. Logical compression is generally not used in image data compression.

Symmetric and Asymmetric Compression

Compression algorithms can also be divided into two categories: symmetric and asymmetric. A symmetric compression method uses roughly the same algorithms, and performs the same amount of work, for compression as it does for decompression. For example, a data transmission application where compression and decompression are both being done on the fly will usually require a symmetric algorithm for the greatest efficiency.

Asymmetric methods require substantially more work to go in one direction than they require in the other. Usually, the compression step takes far more time and system resources than the decompression step. In the real world this makes sense. For example, if we are making an image database in which an image will be compressed once for storage, but decompressed many times for viewing, then we can probably tolerate a much longer time for compression than for decompression. An asymmetric algorithm that uses much CPU time for compression, but is quick to decode, would work well in this case.

Algorithms that are asymmetric in the other direction are less common but have some applications. In making routine backup files, for example, we fully expect that many of the backup files will never be read. A fast compression algorithm that is expensive to decompress might be useful in this case.

Adaptive, Semi-Adaptive, and Non-Adaptive Encoding

Certain dictionary-based encoders, such as CCITT compression algorithms (see the section later in this chapter called "CCITT (Huffman) Encoding") are designed to compress only specific types of data. These non-adaptive encoders contain a static dictionary of predefined substrings that are known to occur with high frequency in the data to be encoded. A non-adaptive encoder designed specifically to compress English language text would contain a dictionary with predefined substrings such as "and", "but", "of", and "the", because these substrings appear very frequently in English text.

An adaptive encoder, on the other hand, carries no preconceived heuristics about the data it is to compress. Adaptive compressors, such as LZW, achieve data independence by building their dictionaries completely from scratch. They do not have a predefined list of static substrings and instead build phrases dynamically as they encode.

Adaptive compression is capable of adjusting to any type of data input and of returning output using the best possible compression ratio. This is in contrast to non-adaptive compressors, which are capable of efficiently encoding only a very select type of input data for which they are designed.

A mixture of these two dictionary encoding methods is the semi-adaptive encoding method. A semi-adaptive encoder makes an initial pass over the data to build the dictionary and a second pass to perform the actual encoding. Using this method, an optimal dictionary is constructed before any encoding is actually performed.

Lossy and Lossless Compression

The majority of compression schemes we deal with in this chapter are called lossless. What does this mean? When a chunk of data is compressed and then decompressed, the original information contained in the data is preserved. No data has been lost or discarded; the data has not been changed in any way.

Lossy compression methods, however, throw away some of the data in an image in order to achieve compression ratios better than that of most lossless compression methods. Some methods contain elaborate heuristic algorithms that adjust themselves to give the maximum amount of compression while changing as little of the visible detail of the image as possible. Other less elegant algorithms might simply discard a least significant portion of each pixel, and, in terms of image quality, hope for the best.

The terms lossy and lossless are sometimes erroneously used to describe the quality of a compressed image. Some people assume that if any image data is lost, this could only degrade the image. The assumption is that we would never want to lose any data at all. This is certainly true if our data consists of text or numerical data that is associated with a file, such as a spreadsheet or a chapter of our great American novel. In graphics applications, however, under certain circumstances data loss may be acceptable, and even recommended.

In practice, a small change in the value of a pixel may well be invisible, especially in high-resolution images where a single pixel is barely visible anyway. Images containing 256 or more colors may have selective pixel values changed with no noticeable effect on the image.

In black-and-white images, however, there is obviously no such thing as a small change in a pixel's value: each pixel can only be black or white. Even in black-and-white images, however, if the change simply moves the boundary between a black and a white region by one pixel, the change may be difficult to see and therefore acceptable.

As mentioned in Chapter 2, Computer Graphics Basics, the human eye is limited in the number of colors it can distinguish simultaneously, particularly if those colors are not immediately adjacent in the image or are sharply contrasting. An intelligent compression algorithm can take advantage of these limitations, analyze an image on this basis, and achieve significant data size reductions based on the removal of color information not easily noticed by most people.

In case this sounds too much like magic, rest assured that much effort has gone into the development of so-called lossy compression schemes in recent years, and many of these algorithms can achieve substantial compression ratios while retaining good quality images. This is an active field of research, and we are likely to see further developments as our knowledge of the human visual system evolves, and as results from the commercial markets regarding acceptance of lossy images make their way back to academia.

For more information about lossy compression, see the JPEG section later in this chapter.