Publication Details |
Category | Text Publication |
Reference Category | Journals |
DOI | 10.5194/gmd-18-5873-2025 |
Licence ![]() |
|
Title (Primary) | Statistical summaries for streamed data from climate simulations: one-pass algorithms |
Author | Grayson, K.; Thober, S.
![]() |
Source Titel | Geoscientific Model Development |
Year | 2025 |
Department | CHS |
Volume | 18 |
Issue | 17 |
Page From | 5873 |
Page To | 5890 |
Language | englisch |
Topic | T5 Future Landscapes |
Data and Software links | https://doi.org/10.26050/WDCC/nextGEMS_cyc3 https://doi.org/10.5281/zenodo.12533197 https://doi.org/10.5281/zenodo.15438184 https://doi.org/10.5281/zenodo.15439803 |
Abstract | Global climate models (GCMs) are being increasingly run at finer resolutions to better capture small-scale dynamics and reduce uncertainties associated with parameterizations. Despite advances in high-performance computing (HPC), the resulting terabyte- to petabyte-scale data volumes now being produced from GCMs are overwhelming traditional long-term storage. To address this, some climate modelling projects are adopting a method known as data streaming, where model output is transmitted directly to downstream data consumers (any user of climate model data, e.g. an impact model) during model runtime, eliminating the need to archive complete datasets. This paper introduces the One_Pass Python package (v0.8.0), which enables users to compute statistics on streamed GCM output via one-pass algorithms – computational techniques that sequentially process data in a single pass without requiring access to the full time series. Crucially, packaging these algorithms independently, rather than relying on standardized statistics from GCMs, provides flexibility for a diverse range of downstream data consumers and allows for integration into various HPC workflows. We present these algorithms for four different statistics: mean, standard deviation, percentiles and histograms. Each statistic is presented in the context of a use case, showing its application to a relevant variable. For statistics that can be represented by a single floating point value (i.e. mean, standard deviation, variance), the results are identical to “conventional” approaches within numerical precision limits, while the memory savings scale linearly with the period of time covered by the statistic. For statistics that require a distribution (percentiles and histograms), we make use of the t-digest, an algorithm that ingests streamed data and reduces them to a set of key clusters representing the distribution. Using this algorithm, we achieve excellent accuracy for variables with near-normal distributions (e.g. wind speed) and acceptable accuracy for skewed distributions such as precipitation. We also provide guidance on the best compression factor (the memory vs. accuracy trade-off) to use for each variable. We conclude by exploring the concept of convergence in streamed statistics, an essential factor for downstream applications such as bias-adjusting streamed data. |
Persistent UFZ Identifier | https://www.ufz.de/index.php?en=20939&ufzPublicationIdentifier=31338 |
Grayson, K., Thober, S., Lacima-Nadolnik, A., Alsina-Ferrer, I., Lledó, L., Sharifi, E., Doblas-Reyes, F. (2025): Statistical summaries for streamed data from climate simulations: one-pass algorithms Geosci. Model Dev. 18 (17), 5873 - 5890 10.5194/gmd-18-5873-2025 |