Commit c80f2793 authored by Willi Rath's avatar Willi Rath

Remove drr contents

parent 91e387d2
*.egg-info/
*.pyc
.cache/
.coverage
.eggs/
image: continuumio/miniconda3:latest
before_script:
- conda install -q -y -c conda-forge python=3.5 pytest pytest-pep8 pytest-cov pep257
- pip install -e .
test:
stage: test
script:
- pytest --cov=data_repo_renderer --cov-report term-missing --cov-fail-under=85 -v tests/
style:
stage: test
allow_failure: true
script:
- pep257 -v data_repo_renderer/
- pep8 -v .
# data repo renderer
Render data repos from a single YAML file.
- master:
[![build status](https://git.geomar.de/data/data_repo_renderer/badges/master/build.svg)](https://git.geomar.de/data/data_repo_renderer/commits/master)
[![coverage report](https://git.geomar.de/data/data_repo_renderer/badges/master/coverage.svg)](https://git.geomar.de/data/data_repo_renderer/commits/master)
- develop:
[![build status](https://git.geomar.de/data/data_repo_renderer/badges/develop/build.svg)](https://git.geomar.de/data/data_repo_renderer/commits/develop)
[![coverage report](https://git.geomar.de/data/data_repo_renderer/badges/develop/coverage.svg)](https://git.geomar.de/data/data_repo_renderer/commits/develop)
## What is this?
This is a Python package wich takes a realtively simple YAML file (see
examples in [input_data/](input_data/)) and creates (renders) a full data
repository with scripts to download, update, pre- and post-process, and version
control data. The idea is that addind data sets to a central data base will be
easy for a normal user who will only have to fill in a template YAML file and
then either take care of the repository themselves or submit it via an issue in
the [data/docs project](https://git.geomar.de/data/docs/).
## What to read?
- If you just want to ask for the addition of a new data set, have a look at
the examples in [input_data/](input_data/). In particular, look at [the
HadISST example](input_data/HadISST/) and the corresponding [rendered
repository](https://git.geomar.de/data/HadISST/), and try to provide the
relevant information.
- If you wanto to fully maintain an own repository or help developing this
project, read on.
## Installation
To install the renderer, make sure you have a recent Python3 (tests run
sucessfully with `3.5` at the moment.)
```bash
cd ~/src/
git clone https://git.geomar.de/data/data_repo_renderer.git
cd data_repo_renderer
pip install -e .
```
See also [setup.py](setup.py).
## Usage
After installation, help can be found with:
```bash
data_repo_renderer -h
```
Typically, you will want to use
```bash
data_repo_renderer \
--prefix <destination_path> \
--util <additional_scripts> YAML_FILE
```
- `<destination_path>` is the path where the repository will be rendered. If
`--prefix` is omitted, the repository will be rendered in `./rendered/`.
- `<additional_scripts>` is a path to a directory with additional scripts that
will be copied to `<destination_path>/util/`. This path is meant to hold
scripts to be called for pre or post processing.
## A walkthrough
This will explain all steps to create <https://git.geomar.de/data/HadISST/>
from [input_data/HadISST/meta.yaml](input_data/HadISST/meta.yaml).
### Configuration file
The configuration file `HadISST/meta.yaml` defines the desired paths to the
repository on Geomar's Git server, a description, and urls for the data files
and the documentation:
```yaml
repo_name: HadISST
people: Willi Rath (<wrath@geomar.de>)
http_path_remote: https://git.geomar.de/data/HadISST
git_path_remote: git@git.geomar.de:data/HadISST.git
repo_description: |
Met Office Hadley Centre observations datasets
<http://www.metoffice.gov.uk/hadobs/hadisst/data/download.html>.
prefixes: data doc
data:
- url: http://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_sst.nc.gz
prefix: data
file_name: HadISST_sst.nc
method: !!python/name:data_repo_renderer.CurlSingleFile
- url: http://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_ice.nc.gz
prefix: data
file_name: HadISST_ice.nc
method: !!python/name:data_repo_renderer.CurlSingleFile
- url: http://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST1_SST_update.nc.gz
prefix: data
file_name: HadISST1_SST_update.nc
method: !!python/name:data_repo_renderer.CurlSingleFile
- url: http://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST1_ICE_update.nc.gz
prefix: data
file_name: HadISST1_ICE_update.nc
method: !!python/name:data_repo_renderer.CurlSingleFile
doc:
- url: http://www.metoffice.gov.uk/hadobs/hadisst/data/download.html
file_name: download.html
prefix: doc
method: !!python/name:data_repo_renderer.CurlSingleFile
```
### Running the renderer
With the above configuration file `input_data/HadISST/meta.yaml`, run:
```bash
data_repo_renderer --prefix <path_with_enough_space>/HadISST input_data/HadISST/meta.yaml
```
### Resulting structure
Rendering will result in:
```
<path_with_enough_space>/HadISST
├── init.sh
├── meta.yaml
├── README.md
└── update.sh
```
### Creating the remote
The rendered repository will try to use <https://git.geomar.de/data/HadISST/>
(or, better, the SSH version of this repo) as a remote. So we created the
project and left it empty.
### Initialization of the repo
The `init.sh`, which needs to be run exactly once (after creating the empty
repository on a server) will be:
```bash
#!/bin/bash
# Rendered with data_repo_renderer 0.1.1.dev40+g797d29f.d20170719
git init || exit 1
git remote add origin git@git.geomar.de:data/HadISST.git || exit 1
git config --add lfs.activitytimeout 30
```
Running it with
```bash
cd <path_with_enough_space>/HadISST
./init.sh
```
will add the remote, perform an initial commit, and push it to the master
branch of <https://git.geomar.de/data/HadISST/>.
### Updating the repo
To download the data and update the repo, the `update.sh` is created:
```bash
#!/bin/bash
# Rendered with data_repo_renderer 0.1.1.dev40+g797d29f.d20170719
mkdir -p log
exec &> >(tee -a "log/update.log")
date -I'ns'
mkdir -p data doc
git pull
git lfs pull
git lfs track "data/**"
curl -o "data/HadISST_sst.nc" "http://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_sst.nc.gz"
curl -o "data/HadISST_ice.nc" "http://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_ice.nc.gz"
curl -o "data/HadISST1_SST_update.nc" "http://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST1_SST_update.nc.gz"
curl -o "data/HadISST1_ICE_update.nc" "http://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST1_ICE_update.nc.gz"
curl -o "doc/download.html" "http://www.metoffice.gov.uk/hadobs/hadisst/data/download.html"
target_branch=`git describe 2> /dev/null`_update_`date +%s%N`
git checkout -b ${target_branch}
git add .
git commit -m "Auto-update data"
git push -u origin ${target_branch}
```
Run it with:
```bash
./update.sh
```
This will first get the latest version of the master branch of
<https://git.geomar.de/data/HadISST/>, and then download the latest versions of
the data files, commit them in a new branch, and push the updated files to the
server.
### Structure after the update
```
<path_with_enough_space>/HadISST
├── data
│   ├── HadISST1_ICE_update.nc
│   ├── HadISST1_SST_update.nc
│   ├── HadISST_ice.nc
│   └── HadISST_sst.nc
├── doc
│   └── download.html
├── init.sh
├── log
│   └── update.log
├── meta.yaml
├── README.md
└── update.sh
```
### The resulting README
The resulting `README.md` will be:
> # HadISST
>
> People: Willi Rath (<wrath@geomar.de>)
>
> Met Office Hadley Centre observations datasets
> <http://www.metoffice.gov.uk/hadobs/hadisst/data/download.html>.
>
>
> ## Known problems
>
> - Open and closed issues are here:
> <https://git.geomar.de/data/HadISST/issues?scope=all&state=all>
>
> - Found a problem? Report it here:
> <https://git.geomar.de/data/HadISST/issues/new>
>
>
> ## History
>
> - Download logs are in [log/update.log](log/update.log).
>
> - Also have a look at the
> [activity log](https://git.geomar.de/data/HadISST/activity).
>
>
> ## Original Documentation
>
> See [doc/](doc/) for any of the original documentation.
>
>
> ## Maintenance
>
> Update with
> ```bash
> update.sh
> ```
>
>
> For details on the configuration, look at [update.sh](update.sh) and
> [meta.yaml](meta.yaml).
>
> *Rendered with
> [data_repo_renderer](https://git.geomar.de/data/data_repo_renderer/)
> <version>*
>
This diff is collapsed.
"""Setup data_repo_renderer."""
from setuptools import setup
setup(name="data_repo_renderer",
description="Render data repos",
packages=["data_repo_renderer"],
package_dir={"data_repo_renderer": "data_repo_renderer"},
use_scm_version=True,
setup_requires=['setuptools_scm'],
install_requires=["setuptools", "pyyaml"],
entry_points={
"console_scripts":
["data_repo_renderer = data_repo_renderer:cli_run_renderer", ]},
zip_safe=False)
# -*- coding:utf-8 -*-
import data_repo_renderer
from pathlib import Path
import pytest
yaml_example_file = """
repo_name: test_repo
people: Jane Doe (<jane.doe@example.com>), John Doe (<john.doe@example.com>)
http_path_remote: http://www.example.com/git/group/reponame
git_path_remote: http://www.example.com/git/group/reponame.git
repo_description: |
A sample description.
With multiple lines.
## And some markdown.
asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf
asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf
asdf asdf asdf asdf asdf asdf
acknowledgements: |
The TEST_REPO data was provided by EXAMPLE.COM.
citations:
- text: Doe, J., J. Doe, A new TEST dataset, J. Alchemy,
doi:12.345/987654asdf11, 2017
doi: 12.345/987654asdf11
- text: Doe, J., J. Doe, An old TEST dataset, J. Alchemy,
doi:12.345/987654asdf10, 2011
doi: 12.345/987654asdf10
prefixes: data doc
pre_processing:
- echo "Not doing anything for pre-processing."
credential_files:
- "~/.data_repo_creds/STH.cred"
- "~/.data_repo_creds/SOTH.cred"
data:
- url: https://www.example.com/files/nao_station_monthly.txt
prefix: data
file_name: nao_station_monthly.txt
method: !!python/name:data_repo_renderer.CurlSingleFile
- url: https://www.example.com/files/nao_station_djfm.txt
prefix: data
file_name: nao_station_djfm.txt
method: !!python/name:data_repo_renderer.CurlSingleFile
- url: https://www.example.com/files/nao_station_annual.txt
prefix: data
file_name: nao_station_annual.txt
method: !!python/name:data_repo_renderer.CurlSingleFile
- url: https://www.example.com/
cut_dirs: 2
prefix: data
accept_files: "*.nc,*.cdf,*.nc.gz"
method: !!python/name:data_repo_renderer.WgetRecursive
- url: https://www.example.com/restricted/
cut_dirs: 2
prefix: data
accept_files: "*.nc"
username_var: "STH_USER"
password_var: "STH_PWD"
method: !!python/name:data_repo_renderer.WgetRecursiveCred
doc:
- url: https://www.example.com/doc_01.html
file_name: doc_01.html
prefix: doc
method: !!python/name:data_repo_renderer.CurlSingleFile
- url: https://www.example.com/doc_02.html
file_name: doc_02.html
prefix: doc
method: !!python/name:data_repo_renderer.CurlSingleFile
post_processing:
- gunzip data/*.gz
- util/postprocessing_01.sh
- util/postprocessing_02.sh
"""
script_example_file = """#!/bin/bash
# Rendered with data_repo_renderer {version}
mkdir -p log
exec &> >(tee -a "log/update.log")
date -I'ns'
mkdir -p data doc
source "~/.data_repo_creds/STH.cred"
source "~/.data_repo_creds/SOTH.cred"
echo "Not doing anything for pre-processing."
git remote set-head origin -a
default_branch=`git symbolic-ref \\
--short refs/remotes/origin/HEAD | cut -d/ -f2-`
git checkout ${{default_branch}}
git pull
git lfs pull
git lfs track "data/**"
curl -o "data/nao_station_monthly.txt" \
"https://www.example.com/files/nao_station_monthly.txt"
curl -o "data/nao_station_djfm.txt" \
"https://www.example.com/files/nao_station_djfm.txt"
curl -o "data/nao_station_annual.txt" \
"https://www.example.com/files/nao_station_annual.txt"
wget -nv -r -c -np -nH --cut-dirs=2 --accept "*.nc,*.cdf,*.nc.gz" -P "data" \
"https://www.example.com/"
wget -nv -r -c -np -nH --cut-dirs=2 --user="$STH_USER" --password="$STH_PWD" \
--accept "*.nc" -P "data" "https://www.example.com/restricted/"
curl -o "doc/doc_01.html" \
"https://www.example.com/doc_01.html"
curl -o "doc/doc_02.html" \
"https://www.example.com/doc_02.html"
gunzip data/*.gz
util/postprocessing_01.sh
util/postprocessing_02.sh
target_branch=`git describe 2> /dev/null`_update_`date +%s%N`
git checkout -b ${{target_branch}}
git add .
git commit -m "Auto-update data"
git push -u origin ${{target_branch}}
""".format(version=data_repo_renderer.__version__)
@pytest.fixture
def tmp_path(tmpdir):
return Path(str(tmpdir))
@pytest.fixture
def yaml_example(tmp_path):
yaml_file = tmp_path / "example.yaml"
with yaml_file.open(mode="w") as f:
f.write(yaml_example_file)
return yaml_file
@pytest.fixture
def util(tmp_path):
util_path = tmp_path / "util"
util_path.mkdir(parents=True, exist_ok=True)
(util_path / "postprocessing_01.sh").touch()
(util_path / "postprocessing_02.sh").touch()
return util_path
def test_full_yaml_example_01(util, yaml_example, tmp_path):
data_repo_renderer.cli_run_renderer(["--prefix", str(tmp_path /
"rendered"),
"--util", str(util),
str(yaml_example)])
with (tmp_path / "rendered" / "update.sh").open() as stream:
written_script = stream.read()
assert written_script == script_example_file
# -*- coding:utf-8 -*-
import data_repo_renderer
import textwrap
import pytest
def test_base_class_generates_empty_string():
assert data_repo_renderer.Renderer(yaml_dict={}).__str__ == ""
def test_curl_single_file_rendering():
yaml_dict = {"prefix": "pref", "file_name": "fn", "url": "http://url"}
target_string = "curl -o \"pref/fn\" \"http://url\"\n"
renderer = data_repo_renderer.CurlSingleFile(yaml_dict=yaml_dict)
assert renderer.__str__ == target_string
def test_wget_recursive_rendering_with_excluded_dirs_and_accept_files():
yaml_dict = {"prefix": "pref", "cut_dirs": 77, "url": "http://url",
"accept_files": "*.*", "exclude_directories": "/asdf,/zxcv/a"}
target_string = ("wget -nv -r -c -np -nH --cut-dirs=77 "
"--accept \"*.*\" -X \"/asdf,/zxcv/a\" -P \"pref\" "
"\"http://url\"\n")
renderer = data_repo_renderer.WgetRecursive(yaml_dict=yaml_dict)
assert renderer.__str__ == target_string
def test_wget_recursive_rendering_without_excluded_dirs_and_accept_files():
yaml_dict = {"prefix": "pref", "cut_dirs": 77, "url": "http://url"}
target_string = ("wget -nv -r -c -np -nH --cut-dirs=77 "
"-P \"pref\" \"http://url\"\n")
renderer = data_repo_renderer.WgetRecursive(yaml_dict=yaml_dict)
assert renderer.__str__ == target_string
def test_loading_credentials():
yaml_dict = {"credential_files": ["~/.data_repo_creds/SOMETHING.cred",
"~/.data_repo_creds/SOMEOTHER.cred"]}
target_string = textwrap.dedent("""
source "~/.data_repo_creds/SOMETHING.cred"
source "~/.data_repo_creds/SOMEOTHER.cred"
""")
renderer = data_repo_renderer.LoadCredentials(yaml_dict=yaml_dict)
assert renderer.__str__ == target_string
def test_wget_recursive_cred_rendering():
yaml_dict = {"prefix": "pref", "cut_dirs": 77, "url": "http://url",
"accept_files": "*.*", "username_var": "ASDF_USER",
"password_var": "ASDF_PWD"}
target_string = ("wget -nv -r -c -np -nH --cut-dirs=77 "
"--user=\"$ASDF_USER\" --password=\"$ASDF_PWD\" "
"--accept \"*.*\" -P \"pref\" \"http://url\"\n")
renderer = data_repo_renderer.WgetRecursiveCred(yaml_dict=yaml_dict)
print(renderer.yaml_dict, renderer.template)
assert renderer.__str__ == target_string
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment