HDF Mike Folk A fundamental requirement of scientific data management is the ability to access as much information in as many ways and as quickly and easily as possible. To make this possible, there needs to be a data storage and retrieval system that facilitates these capabilities. Specific needs of such a system include the following. * Support for scientific data and metadata. Scientific data is characterized by a variety of different data types and representations, data sets (including images) that can be extremely large and complex, and the need to attach accompanying attributes, parameters, notebooks, and other metadata. * Support for a range of hardware platforms. Data can originate on one machine, only to be used later by many different machines. Scientists must be able to access data and metadata on as many hardware platforms as possible * Support for a range of software tools. Scientists need a variety of software tools and utilities for easily searching, analyzing, archiving, and transporting the data and metadata. These tools range from a library of routines for reading and writing data and metadata to small utilities that simply display an image on a console, to full-blown database retrieval systems that provide multiple views of thousands of sets of data and metadata. * Rapid data transfer. Both the size and the dispersion of scientific data sets require that mechanisms must exist to get the data from place to place rapidly. * Extensibility. As new types of information are generated and new kinds of science are done, a means must be provided to support them. The HDF project at NCSA is one important part of a larger effort to address these needs. HDF ("Hierarchical Data Format") itself is a self- describing extensible file format based on the use of tagged objects that have standard meanings. The set of available data objects encompasses both primary and secondary data (metadata). Most HDF objects are machine- and medium-independent, physical representations of data and metadata. HDF data objects and structures. HDF is designed with the assumption that we cannot know a priori what types of data objects will be needed in the future, nor can we know how scientists will want to view their data. As new science is done, new types of data objects are needed, and new tags must be created. In order to avoid unnecessary proliferation of tags, and to insure that all tags are available to potential users who need to share data, a public domain, portable library is available that interprets all public tags. The library contains user interfaces designed to provide views of the data that are most natural for users. As we learn more about the the way scientists need to view their data, we can add user interfaces that reflect data models consistent with those views. HDF currently supports the most common types of data and metadata that scientists use, including multidimensional gridded data, 2d and 3d raster images, polygonal mesh data, multivariate datasets, sparse matrices, finite-element data, spreadsheets, splines, non-Cartesian coordinate data, and text. In the future there will almost certainly be a need to incorporate new types of data, such as voice and video, some of which might actually be stored on other media than the central file itself. In this sense, it may be desirable to employ the concept of a "virtual file", which functions like a file, but doesn't fit our normal notion of a file as a monolithic sequence of bits stored entirely on a disk or tape somewhere. HDF supports a hierarchical grouping structure called Vset that allows scientists to organize data objects within HDF files to fit their views of how the objects go together, much as a person in an office or laboratory organizes information in folders, drawers, journal boxes, and on their desktops. Unlike the way we might organize information in our offices, however, more than one structure can be used to describe the same data, providing different views of the data, depending on the scientists' needs. HDF Software. The usefulness of a file storage format is enhanced enormously by the provision of tools for accessing and analyzing data stored in the format. HDF provides a public domain, portable library of routines for accessing scientific data and converting it to or from host-specific formats. By using these routines to store and retrieve their data in HDF files scientists gain access to the full capabilities of the library. The HDF library currently runs on about a dozen machines, ranging the Macintosh to the Cray, and is continually being ported to new platforms. Within NCSA the HDF project is closely tied to other projects that support scientific data analysis and management, including the NCSA Software Tools Group (STG) and the data management facility project. STG provides software to enhance communications and to serve the special needs of computational science, especially in the area of scientific visulaization. By using HDF as their data storage medium, these tools are able to link personal computers , remote workstations, and supercomputers into a cooperative computer environment. They enable scientists to generate images from datasets, analyze and animate those images, present the results of their research to colleagues and students, and share data files. Visualization tools + file structures . One example of this synergy between file format and visualization tools is HDF's Vset structure and its use NCSA's Polyview tool. Fig. 1 (I don't yet have the slide for this...sorry) shows a Polyview view of the results of a simulation of protein folding in cytochrome C. The HDF Vset structure stores irregularly structured, non-uniform datasets that can be described by vertices or polygons. In this case, the "polygon" is a schematic representation of cytochrome C. . Polyview is an STG tool for Silicon Graphics workstations that displays polygons, or points, as a 2D or 3D image and allows a user to render, manipulate and display the objects described by the Vset. In the cytochrome C simulation, the protein folding process was simulated over 1000 time steps, each time step corresponding to a new polygon. By using Polyview to animate the 1000-step folding process, scientists were able to discover events and relationships that were not clear by viewing the data in any other form. File format standards. Although HDF was designed to be a local NCSA standard, the national and international clientele of the center has taken HDF into a much broader arena. Numerous collaborations with commercial vendors (e.g., Thinking Machines Corporation, Convex), research centers (e.g., LLNL, CNSF, SDSC, SLATEC), and various universities have resulted from the enthusiastic reception to HDF's functionality. Of course, HDF is only one of many standard file formats in the scientific community. The problem of interoperability among many standards is a particularly vexing one, and one that must be dealt with in increasingly as scientists share data and tools. One approach to the problem, from the perspective of HDF, is to increase the scope of HDF by incorporating the functionality of other data handling systems into its extensible structure. For example, the astronomical community has developed a standard data format called FITS for interchanging astronomical images and other digital arrays. Since many users of FITS can benefit from the functionality that has grown up around HDF, translators have been written to convert FITS images to an HDF format. Another widely used and fast growing standard if netCDF, which provides a data model that very effectively meets the needs of many different scientific communities. Merging the calling interface of netCDF with HDF would provide access to a standard format and a set of powerful tools (the STG tools are based on HDF) these communities.