UFZ - Publication Index - Helmholtz-Centre for Environmental Research

Publication Details

Category	Text Publication
Reference Category	Journals
DOI	10.5194/gmd-18-5873-2025
Licence
Title (Primary)	Statistical summaries for streamed data from climate simulations: one-pass algorithms
Author	Grayson, K.; Thober, S. ; Lacima-Nadolnik, A.; Alsina-Ferrer, I.; Lledó, L.; Sharifi, E.; Doblas-Reyes, F.
Source Titel	Geoscientific Model Development
Year	2025
Department	CHS
Volume	18
Issue	17
Page From	5873
Page To	5890
Language	englisch
Topic	T5 Future Landscapes
Data and Software links	https://doi.org/10.26050/WDCC/nextGEMS_cyc3 https://doi.org/10.5281/zenodo.12533197 https://doi.org/10.5281/zenodo.15438184 https://doi.org/10.5281/zenodo.15439803
Abstract	Global climate models (GCMs) are being increasingly run at finer resolutions to better capture small-scale dynamics and reduce uncertainties associated with parameterizations. Despite advances in high-performance computing (HPC), the resulting terabyte- to petabyte-scale data volumes now being produced from GCMs are overwhelming traditional long-term storage. To address this, some climate modelling projects are adopting a method known as data streaming, where model output is transmitted directly to downstream data consumers (any user of climate model data, e.g. an impact model) during model runtime, eliminating the need to archive complete datasets. This paper introduces the One_Pass Python package (v0.8.0), which enables users to compute statistics on streamed GCM output via one-pass algorithms – computational techniques that sequentially process data in a single pass without requiring access to the full time series. Crucially, packaging these algorithms independently, rather than relying on standardized statistics from GCMs, provides flexibility for a diverse range of downstream data consumers and allows for integration into various HPC workflows. We present these algorithms for four different statistics: mean, standard deviation, percentiles and histograms. Each statistic is presented in the context of a use case, showing its application to a relevant variable. For statistics that can be represented by a single floating point value (i.e. mean, standard deviation, variance), the results are identical to “conventional” approaches within numerical precision limits, while the memory savings scale linearly with the period of time covered by the statistic. For statistics that require a distribution (percentiles and histograms), we make use of the t-digest, an algorithm that ingests streamed data and reduces them to a set of key clusters representing the distribution. Using this algorithm, we achieve excellent accuracy for variables with near-normal distributions (e.g. wind speed) and acceptable accuracy for skewed distributions such as precipitation. We also provide guidance on the best compression factor (the memory vs. accuracy trade-off) to use for each variable. We conclude by exploring the concept of convergence in streamed statistics, an essential factor for downstream applications such as bias-adjusting streamed data.
Persistent UFZ Identifier	https://www.ufz.de/index.php?en=20939&ufzPublicationIdentifier=31338
Grayson, K., Thober, S., Lacima-Nadolnik, A., Alsina-Ferrer, I., Lledó, L., Sharifi, E., Doblas-Reyes, F. (2025): Statistical summaries for streamed data from climate simulations: one-pass algorithms Geosci. Model Dev. 18 (17), 5873 - 5890 10.5194/gmd-18-5873-2025