Towards an efficient but functional representation on disk

I've been playing with git worktree to reduce the git overhead of tracking different versions of data repositories.

Checking out one work tree per version

Being in a directory $DATA, the following example checks out a bare copy of the ARGO project to ${DATA}/.bare/ARGO.git and then checks out tag v1.0 to a separate working directory ${DATA}/ARGO/v1.0/.

mkdir -p ${DATA}/.bare
cd ${DATA}/.bare
git clone --bare git@git.geomar.de:data/ARGO.git
cd ARGO.git
git worktree add ${DATA}/ARGO/v1.0 v1.0

Disk usage

The resulting disk use is as follows:

${DATA}/.bare/ARGO.git contains all data from all versions of all files in the ARGO project.
${DATA}/ARGO/v1.0 only contains a single working copy of the data at tag v1.0.

Scaling disk usage

Now, imagine that there is a v1.1 which adds more temporal coverage (by appending profiles actually sampled after v1.0 was created) and modifies a few files representing old profiles (by, e.g., retro-actively flagging data as bad which have turned out to be problematic only recently). And imagine checking out v1.1 to ${DATA}/ARGO/v1.1.

The resulting disk usage is:

${DATA}/.bare/ARGO.git contains all data from all versions of all files in the ARGO project.
${DATA}/ARGO/v1.0 only contains a single working copy of the data at tag v1.0.
${DATA}/ARGO/v1.1 only contains a single working copy of the data at tag v1.1.

In concrete numbers:

${DATA}/.bare/ARGO.git: approx. 80GB
${DATA}/ARGO/v1.0: approx. 80GB
${DATA}/ARGO/v1.1: approx. 80GB
sum for 2 versions with minor differences: approx. 3*80GB or 240GB

For version bumps changing only a few files, this scales as (n+1) as opposed to 2n for the conventional approach of cloning to one directory per tag.

For version bumps changing the majoriy of data, scaling approaches but never exceeds 2n.

Why keep the bare repos at all?

The described setup allows for any user to cd ${DATA}/ARGO/v1.0 and git log to check data provenance themselves. Dropping the bares breaks this line.

Edited Jun 16, 2017 by Willi Rath