Towards an efficient but functional representation on disk
I've been playing with git worktree
to reduce the git overhead of tracking different versions of data repositories.
Checking out one work tree per version
Being in a directory $DATA
, the following example checks out a bare copy of the ARGO
project to ${DATA}/.bare/ARGO.git
and then checks out tag v1.0
to a separate working directory ${DATA}/ARGO/v1.0/
.
mkdir -p ${DATA}/.bare
cd ${DATA}/.bare
git clone --bare git@git.geomar.de:data/ARGO.git
cd ARGO.git
git worktree add ${DATA}/ARGO/v1.0 v1.0
Disk usage
The resulting disk use is as follows:
-
${DATA}/.bare/ARGO.git
contains all data from all versions of all files in the ARGO project. -
${DATA}/ARGO/v1.0
only contains a single working copy of the data at tagv1.0
.
Scaling disk usage
Now, imagine that there is a v1.1
which adds more temporal coverage (by appending profiles actually sampled after v1.0
was created) and modifies a few files representing old profiles (by, e.g., retro-actively flagging data as bad which have turned out to be problematic only recently). And imagine checking out v1.1
to ${DATA}/ARGO/v1.1
.
The resulting disk usage is:
-
${DATA}/.bare/ARGO.git
contains all data from all versions of all files in the ARGO project. -
${DATA}/ARGO/v1.0
only contains a single working copy of the data at tagv1.0
. -
${DATA}/ARGO/v1.1
only contains a single working copy of the data at tagv1.1
.
In concrete numbers:
-
${DATA}/.bare/ARGO.git
: approx.80GB
-
${DATA}/ARGO/v1.0
: approx.80GB
-
${DATA}/ARGO/v1.1
: approx.80GB
-
sum for 2 versions with minor differences: approx.
3*80GB
or240GB
For version bumps changing only a few files, this scales as (n+1)
as opposed to 2n
for the conventional approach of cloning to one directory per tag.
For version bumps changing the majoriy of data, scaling approaches but never exceeds 2n
.
Why keep the bare repos at all?
The described setup allows for any user to cd ${DATA}/ARGO/v1.0
and git log
to check data provenance themselves. Dropping the bares breaks this line.