![]() |
Status and Plans for HDF-EOS, NASA's Format for EOS Standard Products. |
In the early 1990s, NASA's Earth Science Data Information Systems (ESDIS) Project
began to address the technological challenges involved in producing, distributing,
analyzing and archiving the Earth Observing System (EOS) Standard Products.
Since data interuse and inter-disciplinary science investigation are central
to Earth systems science in general and particularly to the goals of EOS, NASA
found that a standard data format would best facilitate data exchange and interoperability.
Furthermore, the existence of a common format would encourage the development
of tools for analysis that could be applied across the spectrum of data sets.
When it was determined that a common format could provide important benefits
to EOS, ESDIS engaged a number of DAACs and science teams in examining and testing
their data products using a variety of common scientific formats. In 1993, after
careful review of more than a dozen alternatives, NASA chose the Hierarchical
Data Format (HDF) as the file format for EOS Standard Products. The HDF is a
file format, application programming interface (API) and implementing library
developed by the National Center for Supercomputing Applications (NCSA). HDF
is well suited as a standard for Earth Science data. It is self describing,
it is portable across many computing systems, and it is designed explicitly
for scientific use with predefined structures common to scientific data. Furthermore,
EOS teams have found HDF to be actively and effectively supported by a NCSA,
a national leader in the advancement of applications computing.
To further facilitate data sharing, certain "idioms" with respect
to geo-referencing, data organization, and metadata storage are encouraged.
The EOS standard use of HDF for satellite swath data, gridded data, and point
data are implemented by HDF-EOS, developed by NASA under the EOSDIS Core System
(ECS) contract. HDF-EOS is an API and library of routines that invoke HDF to
create standard groups of HDF objects that form HDF-EOS idioms. Within HDF-EOS
are 'structural metadata' that provide a common mechanism for attaching geo-referencing
information to science data. In addition, a software library, called the ECS
Science Data Processing Toolkit is provided to implement a standard for attaching
inventory metadata to HDF-based files. At the time NASA selected this standard
for use by EOS, HDF was in version 4. In this article, I refer to the version
of HDF-EOS built on HDF4 as HE4.
In 1999, NCSA released HDF version 5, a greatly improved, but structurally incompatible
format. That is, the data model or internal storage implemented by HDF version
5 is very different than that implemented by HDF version 4.For many, the new
data format has been a reason for concern; either about costs to transition
from HDF4 to HDF5 or about potential termination of support for HDF4. However,
NASA will not terminate support for HDF4 as long as the format is needed. And,
NASA, ECS and NCSA are working together to develop tools to help in data transition.
HDF5, the new standard, is different from HDF4, the format for EOS standard
products adopted for TRMM, Terra, ACRIM, SAGE III and Aqua. The new standard
is not backward compatible either in code or in the underlying conceptual model
with the old. This is not good news for long-term science endeavors. Changes
in computing technologies pose a real and serious challenge to maintenance of
long series data collections.
Why HDF5?
As science computing systems evolved, it became clear to NCSA's HDF group that
HDF4 would have difficulty evolving to meet the demands of these systems. The
future of Earth Observing systems is likely to include parallel processing environments,
very large data sets, data spanning multiple computing environments, new data
models, and complex data analysis and visualization capabilities requiring industry
standard interfaces. But, HDF4 supports only datasets smaller than 2 gigabytes,
with fewer than 20,000 datasets in any one file, and is not capable of efficiently
performing I/O in parallel computing environments. Size and complexity are an
issue. HDF4 library consists of over 300,000 lines of mature, heritage code
that represents a variety of disparate scientific data models. The lack of underlying
commonality in the implementation of these models contributes to the complexity
of the code. This conceptual complexity in turn makes it difficult to adapt
the library to modern high performance computing architectures.
NCSA spent three years looking for ways to extend and adapt HDF4 to meet these
challenges, but in the end it was clear that such an adaptation would only result
in an extremely complex format and I/O library, which would not only be difficult
to maintain, but would not meet these new requirements nearly as well as a completely
new design would. Indeed, it was felt that, if the HDF libraries were not completely
overhauled, the data format and software would gradually become unable to support
the modern computing needs of scientists. With these pressures, and informed
by the lessons learned by NCSA in developing and supporting HDF4 over many years,
led to the development of HDF5, a new data paradigm built from a solid foundation
of computing science data principles.
The good news is that HDF5 is clearly superior to HDF4. The underlying concepts
are more robust and the workmanship is cleaner, more compact, direct and simply
more maintainable. HDF5 will be a powerful, flexible and pragmatic data format
for many years, or decades. There are no plans for a future transition to another,
different "HDF6". We believe that HDF5 "got it right"; that
the capabilities built into HDF5 will directly benefit the Earth Science Community
because they directly map into our needs in the near and distant future.
Just as NASA defined certain aggregates of HDF4 structures to represent HDF-EOS
point, swath and grid in HE4, so HE5 is a standard usage of HDF5 to implement
these same structures. In October, 2000, the EOS Aura Data Systems Working Group
adopted HE5 as the Aura platform standard. This is the first EOS mission to
use HDF5 and HE5 for all standard products. The Aura instrument science teams
are together working to further standardize their products to assure compatibility
among the Aura instrument teams by defining standard file metadata and other
conventions.
Continued Support for HDF4
HDF4 heritage and transition from HDF4 to HDF5 are important considerations.
Many years of effort have gone into developing high quality data production
software based on the HDF4 and HE4 standards, and any change to a new standard
is a rightful concern. NASA and NCSA understand this, and are striving to assure
that these challenges will not be any more burdensome for science data producers
than necessary, especially for the science data end users. We will do that by
developing compatibility and transition tools, by working closely with teams
that make the transitions, and by continuing to maintain the HDF4 code as long
as required.
Despite the new HDF5, HDF4, and HE4 are not "dead" or even "heritage"
formats. All standard products from Terra use these formats, as do the standard
products from other missions mentioned above. The standard products from the
upcoming Aqua mission will also use HE4. NASA's support for HDF4 and HE4 will
continue for as long as significant data holdings use this standard. The level
of support for HDF4/HE4 includes correcting errors, porting to new operating
systems, adding functionality as required and providing help-desk support for
installation and use.
NCSA understands the requirements of the EOS community and NASA's Earth Science
Enterprise. The overriding concern is that we must assure that our data remain
usable over time. NCSA shares our goal of maintaining viability of scientific
data over an extended period. At NASA's request, they have produced detailed
documentation of the HDF4 data format and software. This documentation is indispensable
for long term preservation of the data because it will retain the format design
independent of implementing code. This is necessary insurance, but we do not
intend to rely on it. NCSA will continue to maintain HDF4 for as long as NASA
identifies this as a requirement. The most recent release of HDF4 (4.1r4) is
dated November 2000 and the next is expected this coming fall. That release
will include some minor new functionality as well as bug fixes and tools updates.
The HE4 libraries will also continue to be supported under NASA's direction
under the ECS contract and in the future. The HE4 API is stable at HDF-EOS 2.72
released March 1, 2001. The next release is planned for September 2001. There
are no major new capabilities, only bug fixes. NASA anticipates that HE4 will
continue to have maintenance releases that include error corrections and porting
to new hardware or operating systems at a rate of once to twice a year for as
long as support for HDF4 is required.
Facilitating the transition to HDF5 and HE5
HDF-EOS helps
The lack of direct backward compatibility of either the API or the file format
complicates the transition from HDF4 to HDF5. However, by minimizing the science
impact of the changing format standard, HDF-EOS simplifies things enormously.
To the extent that EOS products adhere to the HDF-EOS standard format and API,
we will be able to provide tools that automatically convert to the new representation
or provide common read access across implementations. That is, if applications
use only the HDF-EOS swath, grid, and point objects, and HDF-EOS metadata, the
transition from HDF4 to HDF5 is greatly simplified. Another benefit for product
developers and users is that the HE5 API bears a very strong resemblance to
the HE4 API. This is one of the primary reasons for defining standard EOS types
using HDF.
Transition tools
Together with our software developers, NASA and NCSA are investigating several
transition tool strategies. In evaluating potential tools, we are considering
the following kinds of questions:
|
We think that the answers to these questions, like so many technical details
that relate to such a large and varied field, will differ by application area,
science discipline and product.
NASA and NCSA are providing direct help in the form of "compatibility tools"
to many science product developers and vendors of tools that work with EOS data.
With compatibility tools in place, the science team can confidently time any
contemplated conversion to the point of least disruption. It is likely that
such a change would be in conjunction with scientifically necessary reprocessing.
Such reprocessing generally makes previous instances of the product obsolete
and so interuse between product versions is not as great a concern. Still, data
interoperability requires that a data product in HE5 must be operable with older
HE4 products.
The most basic transition tool is a data file converter. This is a utility or
API whose input is a file in one format and whose output is a file in another
format.
Converting simple HDF-EOS files from HE4 to HE5.
The ECS contractor has developed a tool called heconvert that converts HE4 Grid,
Point and Swath objects to the same respective HE5 objects. The heconvert program
incorporates both HE libraries . We are exploring ways to publish this capability
for HDF-EOS objects as a separate API. Such a high level compatibility library
would rely on the HDF4 API but would incorporate code to detect whether the
file is HDF4 or HDF5 to call the corresponding HE library.
Converting simple HDF4 files to HDF5.
Likewise, NCSA has developed a tool called "h4toh5" that converts
an entire HDF4 file into a corresponding HDF5 file. "h4toh5" can be
used as a standalone utility, or it can be incorporated into other software.
For example the NCSA HDF5 viewer/editor, H5View, uses h4toh5 to read an HDF4
file, convert it to HDF5, and view it. Hence, it is already possible with H5View
to view an HDF4 file as if it were an HDF5 file. NCSA is now also developing
an "h4toh5" library, which consists of a collection of function calls
for converting individual HDF4 objects to HDF5. This library provides more fine-grained
conversions than does the h4toh5 tool, and can, for example be used to convert
only selected HDF4 objects to HDF5, or to change the relative placement of objects
in moving them from HDF4 to HDF5.
Converting complex HE4 files to HE5.
The EOS standard does not require products to contain HDF-EOS objects exclusively.
Typically, science data product developers have placed both HE4 and HDF4 objects
into a single product. These products pose special challenges for automated
conversion. Each HE4 object (a set of related HDF4 objects) must be replaced
with its corresponding HE5 object and then each remaining HDF4 object must
be converted to an HDF5 object. If there were implicit relationships between
the HDF-EOS objects and the other HDF objects, the conversion software would
likely not know about these and so may not act appropriately. In these cases
the data format conversion requires engineering intervention to retain the intent
of the designers. One tool facilitating complex conversions is an HE file cracker
tool, called Java EOS Browser. This application can open and display contents
of either HE4 or HE5 granules.
Computer Aided Data Engineering Tool
We are studying the possibility of creating a computer aided data engineering
tool. This tool would combine components of several tools already developed
to display the structure of an EOS product in HDF4 and permit the user to interactively
transform selected objects into equivalent HDF5 objects. Guided by standard
transformations, the user would efficiently create a standard translation for
a particular product type. One result of this interaction would be the production
of the framework of a C language program that could be used to transform other
instances of the same EOS product. This code could be developed into a stand-alone
converter, an import front end to a higher level tool, or inserted into a later
version of product generation software.
Wrapping the HDF4 and HDF5 libraries into one I/O library.
It has been suggested that a single library be implemented that could entirely
support both the HDF4 and HDF5 APIs, making it possible for tools to read data
either from HDF4 or from HDF5 without knowing the difference. This approach
has been studied extensively by NCSA, and unfortunately it was found to be nearly
impossible to create a wrapper for all HDF read and write operations. This is
because of fundamental differences between the HDF4 and HDF5 data models and
APIs. While the HDF4 and HDF5 libraries are not backward compatible in general,
it may yet be possible to create a top layer that wraps both libraries for specific
HDF read and write operations (i.e. the HDF-EOS operations). Such a narrow compatibility
may be sufficient to provide most necessary product capabilities to general
purpose application layers such as Matlab, IDL or other data analysis or visualization
environments.
That said, it is almost certain that there will be products that use HDF4 capabilities
in unique ways. High level HDF4 to HDF5 compatibility for these products may
not be possible for a general library.
Writing and accessing metadata.
An example of specific high level compatibility is the writing and access of
metadata in HE4 and HE5 files. HDF-EOS files contain inventory and archive metadata
conformant to the EOSDIS data model and stored in Object Description Language
(ODL). The inventory metadata is a copy of the metadata produced at the time
the instance of the data product is created and stored in database for purposes
of locating data in the archives. The ECS Science Data Processing Toolkit version
5.2.7.3 can access EOSDIS metadata in either HE4 or HE5 files. The API will
recognize which version of HDF-EOS is being processed and respond accordingly.
Conclusion
NASA recognizes that a burden has been placed in EOS data producers and users
by the introduction of HDF5. In the case of EOS Terra, Aqua and other current
HDF4 users, no requirement will be levied to convert data to the newer format.
HDF4 will be supported by NASA as long as a requirement exists. Data granules
based on HDF4 will remain accessible though EOSDIS and currently available services
will be supported indefinitely. Given the superior quality of the newer HDF5
format, we believe that science teams should, over the next few years, carefully
consider definition of new products using the HDF5 standard as the Aura team
has. For conversion of existing products, change should likely be in the context
of scientifically appropriate data processing or re-processing. To facilitate
transition to the new standard, a variety of tools will be available.
The status of the HDF and HDF-EOS standards is the subject of an annual workshop
open to all. This year, the fifth annual HDF/HDF-EOS Workshop is to be held
September 19-21 at the Hawthorne Suites hotel in Champaign IL. Strategies and
plans for the new format standard will be a major focus of discussion. The workshop
also features demonstrations and tutorials. See the HDF-EOS web site for more
details.
The following URLs contain information about the HDF4 to HDF5 transition:
http://hdfgroup.com/h4toh5/
Links to the
HDF-EOS5 documentation and code