Commit 162e4193 authored by Claas Faber's avatar Claas Faber
Browse files

changed boknis eck demo notebook

parent 0272f6c9
Loading
Loading
Loading
Loading
+79 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Getting and processing PANGAEA dataset
[PANGAEA](https://pangaea.de/) is a 'world data center', a long term archive for environmental data. It is a joint venture of AWI and marum. The GEOMAR data management team closely colaborates with PANGAEA, and it is the recommended main data archive for 'our' projects.

In this example, we are accessing the **Boknis Eck** Dataset, a long term monitoring campaign running since 1954, monitoring various environmental variables for a station in the Baltic close to Kiel.

![alt text](https://www.bokniseck.de/documents/895129/940374/map_BoknisEck.jpg/2ff8ed1b-1956-487c-8d1d-59b22b0633d6?t=1377032189033 "Location Booknis Eck")

%% Cell type:markdown id: tags:

## Import python packages
There is a huge amount of very very usefull software libraries, called *packages* available for python. Whatever you want to do, chances are high that someone else had a similar problem and solved it already. Searching for a package to help you with your task before starting to program something yourself is highly recommended.

Installing packages is easy most of the time if you are using conda: `conda install package_XYZ` will do the trick most of the time. This will search packages from the official conda channel. For more variety use the community channel *conda-forge*:
```bash
conda install -c conda-forge package_XYZ
```
For larger projects or projects that you want to share with others, you should create a conda *environmet file* like the one that comes with this project:
```yaml
# file dm-tools_py.yml
name: dm-tools_py3
channels:
  - conda-forge
dependencies:
  - python=3.6
  - anaconda-client
  - basemap
  - cmocean
  - ipython
  - jupyter
  - matplotlib
  - netCDF4
  - numpy
  - pandas
  - psycopg2
  - seaborn
  - xarray
```
This allows you to recreate the environment and install all neccessary packages at once:
```bash
conda env create --file dm-tools_py.yml.yml
```

%% Cell type:code id: tags:

``` python
# import python packages
import pandas as pd  # pandas for dealing with tabular data
import requests  # requests for getting data over the web
import matplotlib.pyplot as plt  # matplotlib and
import seaborn as sns            # seaborn for data visualisation
import pprint  # pretty_print for nicely formatted text output
# Jupyter has so-called *magic* commands. The command above sets up the notebook for plotting
%matplotlib inline
```

%% Cell type:markdown id: tags:

## Getting the data
The package [`requests`](http://docs.python-requests.org/en/master/) allows you to easily query services providing data over the web. In this example, I searched PANGAEA for the [Boknis Eck dataset](https://doi.pangaea.de/10.1594/PANGAEA.855693) and copied the Link behind
> [Download dataset as tab-delimited text](https://doi.pangaea.de/10.1594/PANGAEA.855693?format=textfile)

at the bottom of the dataset's page

%% Cell type:code id: tags:

``` python
# data url found by searching on PANGAEA website
data_url = 'https://doi.pangaea.de/10.1594/PANGAEA.855693?format=text'

# use requests to get the data
r = requests.get(data_url)
data = r.text

# write data to disk. Note: there is a slight difference between python2 and python3
with open('data/boknis.txt', 'wb') as outfile:
    #outfile.write(data.encode(r.encoding))  # use this syntax with python 2
    outfile.write(data.encode('utf-8'))  # use this syntax with python3
```

%% Cell type:code id: tags:

``` python
!mkdir data
```

%% Output

    mkdir: data: File exists

%% Cell type:markdown id: tags:

## Evaluating the data
Now that the data is saved locally, you can open it in a text editor and have a ook at it. We will just print the first 4000 characters hiere

%% Cell type:code id: tags:

``` python
# use pprint instead of print to display line endings correctly here
pprint.pprint(data[:4000])
```

%% Output

    ('/* DATA DESCRIPTION:\n'
     'Citation:\tBange, Hermann W; Malien, Frank (2015): Hydrochemistry from time '
     'series station Boknis Eck from 1957 to 2014. doi:10.1594/PANGAEA.855693\n'
     'Related to:\tLennartz, Sinikka; Lehmann, Andreas; Herrford, Josefine; '
     'Malien, Frank; Hansen, Hans Peter; Biester, Harald; Bange, Hermann W (2014): '
     'Long-term trends at the Boknis Eck time series station (Baltic Sea), '
     '1957-2013: does climate change counteract the decline in eutrophication? '
     'Biogeosciences, 11(22), 6323-6339, doi:10.5194/bg-11-6323-2014\n'
     'Source data set:\tBange, Hermann W; Malien, Frank (2014): Boknis Eck '
     'Timeseries Database. http://www.bokniseck.de/\n'
     'Coverage:\tLATITUDE: 54.529500 * LONGITUDE: 10.039330\n'
     '\tDATE/TIME START: 1957-04-30T00:00:00 * DATE/TIME END: 2014-12-16T11:06:33\n'
     '\tMINIMUM DEPTH, water: 1 m * MAXIMUM DEPTH, water: 35 m\n'
     'Event(s):\tBoknis_Eck_1957 * LATITUDE: 54.529500 * LONGITUDE: 10.039330 * '
     'DATE/TIME: 1957-04-30T00:00:00 * DEVICE: CTD/Rosette (CTD-RO)\n'
     'Comment:\tFlags according to WOCE standard\n'
     'Parameter(s):\tDATE/TIME (Date/Time) * GEOCODE * PI: Bange, Hermann W '
     '(hbange@geomar.de, http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tLatitude of event (Latitude)\n'
     '\tLongitude of event (Longitude)\n'
     '\tDEPTH, water [m] (Depth water) * GEOCODE * PI: Bange, Hermann W '
     '(hbange@geomar.de, http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tCast number (Cast) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tSample code/label (Sample label) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tChlorophyll a [µg/l] (Chl a) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tNitrate [µmol/l] (NO3) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tFlag (Flag) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/) * COMMENT: NO3\n'
     '\tNitrite [µmol/l] ([NO2]-) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tFlag (Flag) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/) * COMMENT: NO2\n'
     '\tOxygen [µmol/kg] (OXYGEN) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tFlag (Flag) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/) * COMMENT: Oxygen\n'
     '\tPhosphate [µmol/l] (PO4) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tFlag (Flag) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/) * COMMENT: PO4\n'
     '\tSalinity (Sal) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tSilicon dioxide [µmol/l] (SiO2) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     '\tFlag (Flag) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/) * COMMENT: SiO2\n'
     '\tTemperature, water [°C] (Temp) * PI: Bange, Hermann W (hbange@geomar.de, '
     'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
     'License:\tCreative Commons Attribution 3.0 Unported (CC-BY)\n'
     'Size:\t44574 data points\n'
     '*/\n'
     'Date/Time\tLatitude\tLongitude\tDepth water [m]\tCast\tSample label\tChl a '
     '[µg/l]\tNO3 [µmol/l]\tFlag (NO3)\t[NO2]- [µmol/l]\tFlag (NO2)\tOXYGEN '
     '[µmol/kg]\tFlag (Oxygen)\tPO4 [µmol/l]\tFlag (PO4)\tSal\tSiO2 [µmol/l]\tFlag '
     '(SiO2)\tTemp [°C]\n'
     '1957-04-30T00:00:00\t54.5295\t10.0393\t1\t1\t1\t\t\t\t\t\t321.9\t\t0.000\t\t'
     '15.30\t\t\t7.70\n'
     '1957-04-30T00:00:00\t54.5295\t10.0393\t5\t1\t1\t\t\t\t\t\t325.0\t\t0.010\t\t'
     '15.30\t\t\t5.40\n'
     '1957-04-30T00:00:00\t54.5295\t10.0393\t10\t1\t1\t\t\t\t\t\t325.0\t\t0.020\t\t'
     '15.70\t\t\t6.10\n'
     '1957-04-30T00:00:00\t54.5295\t10.0393\t15\t1\t1\t\t\t\t\t\t318.8\t\t0.030\t\t'
     '16.40\t\t\t4.50\n'
     '1957-04-30T00:00:00\t54.5295\t10.0393\t20\t1\t1\t\t\t\t\t\t300.0\t\t0.060\t\t'
     '17.00\t\t\t4.30\n'
     '1957-04-30T00:00:00\t54.5295\t10.0393\t26\t1\t1\t\t\t\t\t\t281.3\t\t0.240\t\t'
     '17.40\t\t\t4.30\n'
     '1957-05-14T00:00:00\t54.5295\t10.0393\t1\t1\t2\t\t\t\t\t\t\t\t0.020\t\t'
     '15.40\t\t\t8.70\n'
     '1957-05-14T00:00:00\t54.5295\t10.0393\t5\t1\t2\t\t\t\t\t\t\t\t0.070\t\t'
     '15.40\t\t\t8.70\n'
     '19')

%% Cell type:markdown id: tags:

## Digesting the data
As we can see, the has a header section marked by
```
/*
HEADER
*/
```
We need to ignore the data wehn digesting it with pandas later. To find out how many rows to skip, we could count them. Or we could write a small script to do it for us:

%% Cell type:code id: tags:

``` python
data_start = 0
lines = data.split('\n')
# go through all lines in the data and search for the end of the header section
for i, line in enumerate(lines):
    if line.startswith('*/'):  # */ marks the end of the header
        data_start = i  # remember line number
        break  # no need to look further
data_start
```

%% Output

    30

%% Cell type:markdown id: tags:

[Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) is a package for handling tabular data. It definitely is among the top 5 of the most usefull python libraries for data scientists. Although there is a learning curve, once you have understood the basics it is far mor usefull and practical then Excel.

Pandas can read and write a number of formats. Besides standard formats like `.xls` and `.csv` you can for examlpe connect directly to a SQL database. Try typing `pd.read<tab>` in a new code cell to see the available formats it can read from.

%% Cell type:code id: tags:

``` python
###
# pd.read_csv needs some parameters in order to understand our data:
# sep='\t': tells pandas that the columns are separated by tabs
# parse_dates=True: by setting this, pandas will try to interpret text columns as dates where possible
# index_col=0: tells pandas that the first column of the dataset (date) should be the index
# skiprows=data_start+1: ignore the header lines. We determined teh value of date_start before
df = pd.read_csv('data/boknis.txt', sep='\t', parse_dates=True, index_col=0, skiprows=data_start+1)
# display the first few rows of the dataset
df.head()
```

%% Output

                Latitude  Longitude  Depth water [m]  Cast  Sample label  \
    Date/Time
    1957-04-30   54.5295    10.0393                1   1.0             1
    1957-04-30   54.5295    10.0393                5   1.0             1
    1957-04-30   54.5295    10.0393               10   1.0             1
    1957-04-30   54.5295    10.0393               15   1.0             1
    1957-04-30   54.5295    10.0393               20   1.0             1
    
                Chl a [µg/l]  NO3 [µmol/l]  Flag (NO3)  [NO2]- [µmol/l]  \
    Date/Time
    1957-04-30           NaN           NaN         NaN              NaN
    1957-04-30           NaN           NaN         NaN              NaN
    1957-04-30           NaN           NaN         NaN              NaN
    1957-04-30           NaN           NaN         NaN              NaN
    1957-04-30           NaN           NaN         NaN              NaN
    
                Flag (NO2)  OXYGEN [µmol/kg]  Flag (Oxygen)  PO4 [µmol/l]  \
    Date/Time
    1957-04-30         NaN             321.9            NaN          0.00
    1957-04-30         NaN             325.0            NaN          0.01
    1957-04-30         NaN             325.0            NaN          0.02
    1957-04-30         NaN             318.8            NaN          0.03
    1957-04-30         NaN             300.0            NaN          0.06
    
                Flag (PO4)   Sal  SiO2 [µmol/l]  Flag (SiO2)  Temp [°C]
    Date/Time
    1957-04-30         NaN  15.3            NaN          NaN        7.7
    1957-04-30         NaN  15.3            NaN          NaN        5.4
    1957-04-30         NaN  15.7            NaN          NaN        6.1
    1957-04-30         NaN  16.4            NaN          NaN        4.5
    1957-04-30         NaN  17.0            NaN          NaN        4.3

%% Cell type:markdown id: tags:

## Exploring and processing the data
We got a first look at the data by printing a few rows of the table. From that we can see that we have measurements of several hydrobiochemical parameters and that samples were taken at different depth for each sampling day.

To get a better impression of the data, it is a good idea to make some plots. Here, we use [matplotlib](http://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) for visualising the data. Matplotlib is a powerfull plotting library. Searborne is build on top of matplotlib and provides a quick way to produce statisticat data visualisation.

So let's get started and make a boxplot of temperature at the measured depth:

%% Cell type:code id: tags:

``` python
sns.boxplot(x='Depth water [m]', y='Temp [°C]', data=df)
```

%% Output

    <matplotlib.axes._subplots.AxesSubplot at 0x11096b588>


%% Cell type:markdown id: tags:

Ups. The plot looks messy. There are a lot of measureing depth, but the distributions are skewed for many of them.

Lets have a look at the number of measurements for each depth. We use pandas `groupby` function to rearrange the data:

%% Cell type:code id: tags:

``` python
# group data into bins by values in column'Depth water [m]'. Count number of measurements for each depth.
df.groupby('Depth water [m]')['Depth water [m]'].count()
```

%% Output

    Depth water [m]
    1     839
    2      99
    3      94
    4       4
    5     831
    7       2
    8       3
    9      10
    10    832
    11      3
    12      5
    13      8
    14      8
    15    870
    17      8
    18      3
    19      6
    20    840
    21      4
    22      3
    23      3
    24     13
    25    561
    26    276
    27     52
    28     15
    29      1
    35      1
    Name: Depth water [m], dtype: int64

%% Cell type:markdown id: tags:

As we can see, there are a lot of measurements for 1, 5, 10, 15, 20, 25 and 26 meters. It ist possibly safe to ignore all other depth.

%% Cell type:code id: tags:

``` python
# pandas ways of filtering and selecting data can be confusing at first...
df = df[df['Depth water [m]'].isin([1, 5, 10, 15, 20, 25, 26])]
df.groupby('Depth water [m]')['Depth water [m]'].count()
```

%% Output

    Depth water [m]
    1     839
    5     831
    10    832
    15    870
    20    840
    25    561
    26    276
    Name: Depth water [m], dtype: int64

%% Cell type:markdown id: tags:

OK, we've got that sorted. But something strange is going on for 25 and 26 meters. Let's plot some data for these depth.

%% Cell type:code id: tags:

``` python
# plot temp @25m depth
df[df['Depth water [m]']==25]['Temp [°C]'].plot(style='k>', label='Temp at depth 25m')
# plot temp @26m depth
df[df['Depth water [m]']==26]['Temp [°C]'].plot(style='g<', label='Temp at depth 26m')
# add a legend to the plot
plt.legend()
```

%% Output

    <matplotlib.legend.Legend at 0x111dfc128>


%% Cell type:markdown id: tags:

Hm, looks like they measured mainly at 26m before the 1980s and at 25m after that. Strange, look into this later. For now, let's just sort all 26m measurements into the 25m bin.

%% Cell type:code id: tags:

``` python
df.loc[df['Depth water [m]'] == 26, 'Depth water [m]'] = 25
df.groupby('Depth water [m]')['Depth water [m]'].count()
```

%% Output

    Depth water [m]
    1     839
    5     831
    10    832
    15    870
    20    840
    25    837
    Name: Depth water [m], dtype: int64

%% Cell type:markdown id: tags:

Great, this looks more reasonable. The number of measurements for each depth are pretty comparable now (note to self: might want to check the temporal distributions later to avoid bias)

Let's make some boxplots with the cleaned data.

%% Cell type:code id: tags:

``` python
sns.boxplot(x='Depth water [m]', y='Temp [°C]', data=df)
# issuing plt.show() tells matplotlib to make a new plot for the next data series
# instead of adding the series to the existing plot.
plt.show()
sns.boxplot(x='Depth water [m]', y='PO4 [µmol/l]', data=df)
plt.show()
sns.boxplot(x='Depth water [m]', y='NO3 [µmol/l]', data=df)
plt.show()
sns.boxplot(x='Depth water [m]', y='OXYGEN [µmol/kg]', data=df)
plt.show()
sns.boxplot(x='Depth water [m]', y='Chl a [µg/l]', data=df)


```

%% Output





    <matplotlib.axes._subplots.AxesSubplot at 0x1147180b8>


%% Cell type:markdown id: tags:

Nice. Interpret the plots or make some more. Have a look at the [seaborn tutorial](https://seaborn.pydata.org/tutorial.html) too see some examples or check out how pandas makes working with [time series](http://earthpy.org/pandas-basics.html) easier.

%% Cell type:code id: tags:

``` python
```

%% Cell type:code id: tags:

``` python
```

%% Output

    [master 3312417] data processing
     1 file changed, 877 insertions(+), 54 deletions(-)

%% Cell type:code id: tags:

``` python
```

%% Output

    On branch master
    Your branch is ahead of 'origin/master' by 3 commits.
      (use "git push" to publish your local commits)
    Changes not staged for commit:
      (use "git add/rm <file>..." to update what will be committed)
      (use "git checkout -- <file>..." to discard changes in working directory)
    
    	modified:   boknis.ipynb[m
    	deleted:    test.txt[m
    
    Untracked files:
      (use "git add <file>..." to include in what will be committed)
    
    	.ipynb_checkpoints/[m
    	Untitled.ipynb[m
    	data/boknis.db[m
    
    no changes added to commit (use "git add" and/or "git commit -a")

%% Cell type:code id: tags:

``` python
sns.jointplot(x='NO3 [µmol/l]', y='OXYGEN [µmol/kg]', data=df[df['Depth water [m]']==25], kind="kde");
```

%% Output


%% Cell type:code id: tags:

``` python
# slice data to contain only one sampling depth
data_slice = df[df['Depth water [m]']==25]
# select a time period of interest
data_slice = data_slice['1990':'2010']
# plot full resolution data
data_slice['NO3 [µmol/l]'].plot()
plt.show()
# plot monthly mean values
data_slice.resample('m').mean()['NO3 [µmol/l]'].plot()
plt.show()
# plot annual mean values
data_slice.resample('a').mean()['NO3 [µmol/l]'].plot()
plt.show()
```

%% Output




%% Cell type:code id: tags:

``` python
import sqlite3
conn = sqlite3.connect('data/boknis.db')
df.to_sql(con=conn, name='boknis')
df_db = pd.read_sql_query(con=conn, sql='SELECT * FROM boknis')
df_db.index = df_db['Date/Time']
df_db.head()
```

%% Output

    /Users/cfaber/anaconda/envs/dm-tools_py3/lib/python3.6/site-packages/pandas/core/generic.py:1201: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
      chunksize=chunksize, dtype=dtype)

                                   Date/Time  Latitude  Longitude  Depth water  \
    Date/Time
    1957-04-30 00:00:00  1957-04-30 00:00:00   54.5295    10.0393            1
    1957-04-30 00:00:00  1957-04-30 00:00:00   54.5295    10.0393            5
    1957-04-30 00:00:00  1957-04-30 00:00:00   54.5295    10.0393           10
    1957-04-30 00:00:00  1957-04-30 00:00:00   54.5295    10.0393           15
    1957-04-30 00:00:00  1957-04-30 00:00:00   54.5295    10.0393           20
    
                         Cast  Sample label  Chl a  NO3  Flag (NO3)      \
    Date/Time
    1957-04-30 00:00:00   1.0             1    NaN  NaN         NaN NaN
    1957-04-30 00:00:00   1.0             1    NaN  NaN         NaN NaN
    1957-04-30 00:00:00   1.0             1    NaN  NaN         NaN NaN
    1957-04-30 00:00:00   1.0             1    NaN  NaN         NaN NaN
    1957-04-30 00:00:00   1.0             1    NaN  NaN         NaN NaN
    
                         Flag (NO2)  OXYGEN  Flag (Oxygen)   PO4  Flag (PO4)  \
    Date/Time
    1957-04-30 00:00:00         NaN   321.9            NaN  0.00         NaN
    1957-04-30 00:00:00         NaN   325.0            NaN  0.01         NaN
    1957-04-30 00:00:00         NaN   325.0            NaN  0.02         NaN
    1957-04-30 00:00:00         NaN   318.8            NaN  0.03         NaN
    1957-04-30 00:00:00         NaN   300.0            NaN  0.06         NaN
    
                          Sal  SiO2  Flag (SiO2)  Temp
    Date/Time
    1957-04-30 00:00:00  15.3   NaN          NaN   7.7
    1957-04-30 00:00:00  15.3   NaN          NaN   5.4
    1957-04-30 00:00:00  15.7   NaN          NaN   6.1
    1957-04-30 00:00:00  16.4   NaN          NaN   4.5
    1957-04-30 00:00:00  17.0   NaN          NaN   4.3

%% Cell type:markdown id: tags:

## Versioning of changes
If you made changes to this notebook that you want to keep track of, you can call git directly from within this notebook:

%% Cell type:code id: tags:

``` python
!git status
```

%% Output

    On branch master
    Your branch is ahead of 'origin/master' by 6 commits.
      (use "git push" to publish your local commits)
    Changes not staged for commit:
      (use "git add <file>..." to update what will be committed)
      (use "git checkout -- <file>..." to discard changes in working directory)
    
    	modified:   boknis.ipynb[m
    	modified:   jupyter-notebook-for-OPeNDAP-data-access.ipynb[m
    	modified:   sql_db.ipynb[m
    
    Untracked files:
      (use "git add <file>..." to include in what will be committed)
    
    	.ipynb_checkpoints/[m
    
    no changes added to commit (use "git add" and/or "git commit -a")

%% Cell type:code id: tags:

``` python
!git add boknis.ipynb
```

%% Cell type:code id: tags:

``` python
!git commit -m'data processing'
```