Compress data without sacrificing benefits from mirroring
Most upstream data repositories use uncompressed legacy formats or do explicitly gzip data files. The former gives away about a factor of 3 in file size. The latter results in reasonable disk use but adds decompression overhead every time the data is used.
An approach that (at first glance) helps, would be to download the data and then convert it to reasonably sized netCDF4-classic
files with deflation. This, however, breaks efficient use of, e.g., wget
s mirroring capabilities which rely on comparing upstream files to those already present on disk.
Another thing to keep in mind: Compression with, e.g., ncks
(see TM/TMSoftware/convert_to_deflated_nc4classic_with_small_chunks.sh) breaks Git hashing by adding a history argument containing date info. Simply keeping one copy for the mirror, then compress, and then track versions won't work without cleaning the netCDF files. (We need content-based hashing ...)