HDF5eis: A storage and input/output solution for big multidimensional time series data from environmental sensors
Author(s)
White, Malcolm CA; Zhang, Zhendong; Bai, Tong; Qiu, Hongrui; Chang, Hilary; Nakata, Nori; ... Show more Show less
DownloadPublished version (818.1Kb)
Publisher Policy
Publisher Policy
Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.
Terms of use
Metadata
Show full item recordAbstract
Modern high-performance computing (HPC) tasks overwhelm conventional geophysical data formats. We describe a new data schema called HDF5eis (read H-D-F-size) for handling big multidimensional time series data from environmental sensors in HPC applications and implement a freely available Python application programming interface (API) for building and processing HDF5eis files. HDF5eis augments the popular Hierarchical Data Format 5 with a minimal set of additional conventions that facilitate fast and flexible data input and output protocols for regularly sampled (in time) data with any number of dimensions. HDF5eis supports arbitrary ancillary data (e.g., metadata) storage in columnar format or as UTF-8 encoded byte streams alongside time series data. Our HDF5eis API enables simple and efficient access to big data sets distributed across a potentially large number of small heterogeneous files through a single point of access. HDF5eis outperforms conventional seismic data formats by up to two orders of magnitude in terms of random read access times. We contribute HDF5eis as an operational tool and an experimental draft proposal that will help establish the next generation of data standards in the earth sciences.
Date issued
2023-04-12Department
Massachusetts Institute of Technology. Department of Earth, Atmospheric, and Planetary SciencesJournal
Geophysics
Publisher
Society of Exploration Geophysicists
Citation
Malcolm C. A. White, Zhendong Zhang, Tong Bai, Hongrui Qiu, Hilary Chang, Nori Nakata; HDF5eis: A storage and input/output solution for big multidimensional time series data from environmental sensors. Geophysics 2023;; 88 (3): F29–F38.
Version: Final published version