Commit e2d93b6b authored by Claas Testuser's avatar Claas Testuser
Browse files

added Jupyter NBs

parent cbf6bc84
Loading
Loading
Loading
Loading

boknis.ipynb

0 → 100644
+270 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Getting and processing PANGAEA dataset
[PANGAEA](https://pangaea.de/) is a 'world data center', a long term archive for environmental data. It is a joint venture of AWI and marum. The GEOMAR data management team closely colaborates with PANGAEA, and it is the recommended main data archive for 'our' projects.

In this example, we are accessing the **Boknis Eck** Dataset, a long term monitoring campaign running since 1954, monitoring various environmental variables for a station in the Baltic close to Kiel.

![alt text](https://www.bokniseck.de/documents/895129/940374/map_BoknisEck.jpg/2ff8ed1b-1956-487c-8d1d-59b22b0633d6?t=1377032189033 "Location Booknis Eck")

%% Cell type:markdown id: tags:

## Import python packages
There is a huge amount of very very usefull software libraries, called *packages* available for python. Whatever you want to do, chances are high that someone else had a similar problem and solved it already. Searching for a package to help you with your task before starting to program something yourself is highly recommended.

Installing packages is easy most of the time if you are using conda: `conda install package_XYZ` will do the trick most of the time. This will search packages from the official conda channel. For more variety use the community channel *conda-forge*:
```bash
conda install -c conda-forge package_XYZ
```
For larger projects or projects that you want to share with others, you should create a conda *environmet file* like the one that comes with this project:
```yaml
# file dm-tools_py.yml
name: dm-tools_py3
channels:
  - conda-forge
dependencies:
  - python=3.6
  - anaconda-client
  - basemap
  - cmocean
  - ipython
  - jupyter
  - matplotlib
  - netCDF4
  - numpy
  - pandas
  - psycopg2
  - seaborn
  - xarray
```
This allows you to recreate the environment and install all neccessary packages at once:
```bash
conda env create --file dm-tools_py.yml.yml
```

%% Cell type:code id: tags:

``` python
# import python packages
import pandas as pd  # pandas for dealing with tabular data
import requests  # requests for getting data over the web
import matplotlib.pyplot as plt  # matplotlib and
import seaborn as sns            # seaborn for data visualisation
import pprint  # pretty_print for nicely formatted text output
# Jupyter has so-called *magic* commands. The command above sets up the notebook for plotting
%matplotlib inline
```

%% Cell type:markdown id: tags:

## Getting the data
The package [`requests`](http://docs.python-requests.org/en/master/) allows you to easily query services providing data over the web. In this example, I searched PANGAEA for the [Boknis Eck dataset](https://doi.pangaea.de/10.1594/PANGAEA.855693) and copied the Link behind
> [Download dataset as tab-delimited text](https://doi.pangaea.de/10.1594/PANGAEA.855693?format=textfile)

at the bottom of the dataset's page

%% Cell type:code id: tags:

``` python
# data url found by searching on PANGAEA website
data_url = 'https://doi.pangaea.de/10.1594/PANGAEA.855693?format=text'

# use requests to get the data
r = requests.get(data_url)
data = r.text

# write data to disk. Note: there is a slight difference between python2 and python3
with open('data/boknis.txt', 'w+') as outfile:
    #outfile.write(data.encode(r.encoding))  # use this syntax with python 2
    outfile.write(data)  # use this syntax with python3
```

%% Cell type:markdown id: tags:

## Evaluating the data
Now that the data is saved locally, you can open it in a text editor and have a ook at it. We will just print the first 4000 characters hiere

%% Cell type:code id: tags:

``` python
# use pprint instead of print to display line endings correctly here
pprint.pprint(data[:4000])
```

%% Cell type:markdown id: tags:

## Digesting the data
As we can see, the has a header section marked by
```
/*
HEADER
*/
```
We need to ignore the data wehn digesting it with pandas later. To find out how many rows to skip, we could count them. Or we could write a small script to do it for us:

%% Cell type:code id: tags:

``` python
data_start = 0
lines = data.split('\n')
# go through all lines in the data and search for the end of the header section
for i, line in enumerate(lines):
    if line.startswith('*/'):  # */ marks the end of the header
        data_start = i  # remember line number
        break  # no need to look further
data_start
```

%% Cell type:markdown id: tags:

[Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) is a package for handling tabular data. It definitely is among the top 5 of the most usefull python libraries for data scientists. Although there is a learning curve, once you have understood the basics it is far mor usefull and practical then Excel.

Pandas can read and write a number of formats. Besides standard formats like `.xls` and `.csv` you can for examlpe connect directly to a SQL database. Try typing `pd.read<tab>` in a new code cell to see the available formats it can read from.

%% Cell type:code id: tags:

``` python
###
# pd.read_csv needs some parameters in order to understand our data:
# sep='\t': tells pandas that the columns are separated by tabs
# parse_dates=True: by setting this, pandas will try to interpret text columns as dates where possible
# index_col=0: tells pandas that the first column of the dataset (date) should be the index
# skiprows=data_start+1: ignore the header lines. We determined teh value of date_start before
df = pd.read_csv('data/boknis.txt', sep='\t', parse_dates=True, index_col=0, skiprows=data_start+1)
# display the first few rows of the dataset
df.head()
```

%% Cell type:markdown id: tags:

## Exploring and processing the data
We got a first look at the data by printing a few rows of the table. From that we can see that we have measurements of several hydrobiochemical parameters and that samples were taken at different depth for each sampling day.

To get a better impression of the data, it is a good idea to make some plots. Here, we use [matplotlib](http://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) for visualising the data. Matplotlib is a powerfull plotting library. Searborne is build on top of matplotlib and provides a quick way to produce statisticat data visualisation.

So let's get started and make a boxplot of temperature at the measured depth:

%% Cell type:code id: tags:

``` python
sns.boxplot(x='Depth water [m]', y='Temp [°C]', data=df)
```

%% Cell type:markdown id: tags:

Ups. The plot looks messy. There are a lot of measureing depth, but the distributions are skewed for many of them.

Lets have a look at the number of measurements for each depth. We use pandas `groupby` function to rearrange the data:

%% Cell type:code id: tags:

``` python
# group data into bins by values in column'Depth water [m]'. Count number of measurements for each depth.
df.groupby('Depth water [m]')['Depth water [m]'].count()
```

%% Cell type:markdown id: tags:

As we can see, there are a lot of measurements for 1, 5, 10, 15, 20, 25 and 26 meters. It ist possibly safe to ignore all other depth.

%% Cell type:code id: tags:

``` python
# pandas ways of filtering and selecting data can be confusing at first...
df = df[df['Depth water [m]'].isin([1, 5, 10, 15, 20, 25, 26])]
df.groupby('Depth water [m]')['Depth water [m]'].count()
```

%% Cell type:markdown id: tags:

OK, we've got that sorted. But something strange is going on for 25 and 26 meters. Let's plot some data for these depth.

%% Cell type:code id: tags:

``` python
# plot temp @25m depth
df[df['Depth water [m]']==25]['Temp [°C]'].plot(style='k>', label='Temp at depth 25m')
# plot temp @26m depth
df[df['Depth water [m]']==26]['Temp [°C]'].plot(style='g<', label='Temp at depth 26m')
# add a legend to the plot
plt.legend()
```

%% Cell type:markdown id: tags:

Hm, looks like they measured mainly at 26m before the 1980s and at 25m after that. Strange, look into this later. For now, let's just sort all 26m measurements into the 25m bin.

%% Cell type:code id: tags:

``` python
df.loc[df['Depth water [m]'] == 26, 'Depth water [m]'] = 25
df.groupby('Depth water [m]')['Depth water [m]'].count()
```

%% Cell type:markdown id: tags:

Great, this looks more reasonable. The number of measurements for each depth are pretty comparable now (note to self: might want to check the temporal distributions later to avoid bias)

Let's make some boxplots with the cleaned data.

%% Cell type:code id: tags:

``` python
sns.boxplot(x='Depth water [m]', y='Temp [°C]', data=df)
# issuing plt.show() tells matplotlib to make a new plot for the next data series
# instead of adding the series to the existing plot.
plt.show()
sns.boxplot(x='Depth water [m]', y='PO4 [µmol/l]', data=df)
plt.show()
sns.boxplot(x='Depth water [m]', y='NO3 [µmol/l]', data=df)
plt.show()
sns.boxplot(x='Depth water [m]', y='OXYGEN [µmol/kg]', data=df)
plt.show()
sns.boxplot(x='Depth water [m]', y='Chl a [µg/l]', data=df)


```

%% Cell type:markdown id: tags:

Nice. Interpret the plots or make some more. Have a look at the [seaborn tutorial](https://seaborn.pydata.org/tutorial.html) too see some examples or check out how pandas makes working with [time series](http://earthpy.org/pandas-basics.html) easier.

%% Cell type:code id: tags:

``` python
sns.jointplot(x='NO3 [µmol/l]', y='OXYGEN [µmol/kg]', data=df[df['Depth water [m]']==25], kind="kde");
```

%% Cell type:code id: tags:

``` python
# slice data to contain only one sampling depth
data_slice = df[df['Depth water [m]']==25]
# select a time period of interest
data_slice = data_slice['1990':'2010']
# plot full resolution data
data_slice['NO3 [µmol/l]'].plot()
plt.show()
# plot monthly mean values
data_slice.resample('m').mean()['NO3 [µmol/l]'].plot()
plt.show()
# plot annual mean values
data_slice.resample('a').mean()['NO3 [µmol/l]'].plot()
plt.show()
```

%% Cell type:code id: tags:

``` python
import sqlite3
conn = sqlite3.connect('data/boknis.db')
df.to_sql(con=conn, name='boknis')
df_db = pd.read_sql_query(con=conn, sql='SELECT * FROM boknis')
df_db.index = df_db['Date/Time']
df_db.head()
```

%% Cell type:code id: tags:

``` python
```

boknis_start.ipynb

0 → 100644
+335 −0

File added.

Preview size limit exceeded, changes collapsed.

+12 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

## Links
* A gallery of interesting Jupyter Notebooks : https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
* Can't stop using it… or Python for Excel : https://www.xlwings.org/

%% Cell type:markdown id: tags:

# Install requirements
* anaconda - is installed
* git - is installed
+202 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

Official website for OPeNDAP:  http://www.opendap.org/
Code site: https://github.com/opendap
Notebook adapted from Python-Intro course by Wili Rath

%% Cell type:markdown id: tags:

## Modules

Get a couple of modules needed for the task at hand:

- [netCDF4](http://unidata.github.io/netcdf4-python/) --- netCDF4 library
- [pyplot](http://matplotlib.org/api/pyplot_api.html) --- Standard plotting routines
- [basemap](http://matplotlib.org/basemap/) --- Map projections
- [cmocean](http://matplotlib.org/cmocean/) --- Very nice ocean colormaps
- [numpy](http://docs.scipy.org/doc/numpy/reference/) --- Numerical toolbox

We provide aliases for the imported modules.  In one case (`Basemap`), we only import one object.

%% Cell type:code id: tags:

``` python
import netCDF4 as nc
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import cmocean as co
import numpy as np
```

%% Cell type:markdown id: tags:

## Initialization

Initialize the notebook.  We want plots to be displayed inline. Also set default figure size.

%% Cell type:code id: tags:

``` python
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 9)
```

%% Cell type:markdown id: tags:

## Input Data
In this example, we are using a publically available dataset we get over the web.

- We use the netCDF4 library to open the netCDF data set using [nc.Dataset()](http://unidata.github.io/netcdf4-python/#netCDF4.Dataset.__init__).  Try
```python
nc.Dataset?
```
and in particular
```python
nc.Dataset.__init__?
```
to read the documentation shipped with the modules. You can also access data from local files or from other (private) repositories in a similar way.

%% Cell type:code id: tags:

``` python
# This is an example of sea surface temperature data for day 73 in 2017 retrieved from NASA's OPeNDAP enabled servers
url = 'https://oceandata.sci.gsfc.nasa.gov:443/opendap/MODIST/L3SMI/2017/073/T2017073.L3m_DAY_SST_sst_9km.nc'
data_set = nc.Dataset(url)
```

%% Cell type:code id: tags:

``` python
print(data_set)
```

%% Cell type:code id: tags:

``` python
# get (pointers to) variables
lon = data_set.variables['lon']
lat = data_set.variables['lat']
sst = data_set.variables['sst']

# understand the difference between lon and lon[:]
print(type(lon))
print(type(lon[:]))
```

%% Cell type:code id: tags:

``` python
# print the actual data
print("Meta-data:\n\n", lon)       # meta data
print("Actual data:\n\n", lon[:])  # actual numeric data
```

%% Cell type:markdown id: tags:

# Shapes of the variables

Inspect the shapes of `lon`, `lat`, `ssh`.

%% Cell type:code id: tags:

``` python
print(lon.shape, lat.shape, sst.shape)
```

%% Cell type:markdown id: tags:

## A simple plot

- Use `pyplot`'s `pcolormesh` to plot the 2-dimensional data. Print the documentation with:
```python
plt.pcolormesh?
```


- Try `plt.colorbar()` and `plt.title()`.

- There are also `plt.[x,y]label`, `plt.axis`, and `plt.grid`.

%% Cell type:code id: tags:

``` python
plt.pcolormesh(lon, lat, sst[:].squeeze(), cmap='inferno')
plt.colorbar()
plt.title('MODIS SST')
plt.axis('tight')
plt.xlabel('longitude')
plt.xlabel('latitude')
plt.grid()
```

%% Cell type:markdown id: tags:

## A nicer projection

- Use [Basemap](http://matplotlib.org/basemap/api/basemap_api.html#mpl_toolkits.basemap.Basemap) to create a Mercator projection covering the globe between `70S` and `70N`.

- With `m = Basemap(...)` you can later use `m.pcolormesh(...)` to plot the data on the map projection.

- We need 2-dimensional representations of the coordinates:
```python
mlon, mlat = np.meshgrid(lon, lat)
```

- To add coast lines, filled land masses, a grid, etc., you can play with the following:
```python
m.drawcoastlines()
m.fillcontinents()
m.drawparallels(np.arange(-70,90,20),labels=[1,0,0,0]);
m.drawmeridians(np.arange(0,420,60),labels=[0,0,0,1]);
m.colorbar()
plt.title("AVISO ssh (m), %04d" % (year))
```
- Use the [cmocean](http://matplotlib.org/cmocean/) color maps:
```python
cmap=co.cm.<mapname goes here>
```

- Save a file using
```python
plt.savefig(plot_file_name)
```
where `plot_file_name` should be a string containing the year we picked above.

%% Cell type:code id: tags:

``` python
# set up projection:
m = Basemap(projection='merc',
    llcrnrlat=-70.0, llcrnrlon=-180.0,
    urcrnrlat=+70.0, urcrnrlon=+180.0)

# 2d coordinates
mlon, mlat = np.meshgrid(lon, lat)

# 2d plot in the projection:
m.pcolormesh(mlon, mlat, sst[:].squeeze(),
             rasterized=True,
             cmap=co.cm.balance,
             latlon=True)

# grid etc. (see above)
m.drawcoastlines()
m.fillcontinents()
m.drawcountries()
m.drawparallels(np.arange(-70,90,20),labels=[1,0,0,0]);
m.drawmeridians(np.arange(0,420,60),labels=[0,0,0,1]);
m.colorbar()

# add formatted title
title = 'MODIS SST [°C]'
plt.gca().set_title(title)

# save file
plot_file_name = 'modis_sst.pdf'
plt.savefig(plot_file_name)
```

%% Cell type:code id: tags:

``` python
```

sql_db.ipynb

0 → 100644
+176 −0

File added.

Preview size limit exceeded, changes collapsed.