changed boknis eck demo notebook (162e4193) · Commits · isos / hands-on

boknis.ipynb

+79 −0

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Getting and processing PANGAEA dataset
		[PANGAEA](https://pangaea.de/) is a 'world data center', a long term archive for environmental data. It is a joint venture of AWI and marum. The GEOMAR data management team closely colaborates with PANGAEA, and it is the recommended main data archive for 'our' projects.

		In this example, we are accessing the Boknis Eck Dataset, a long term monitoring campaign running since 1954, monitoring various environmental variables for a station in the Baltic close to Kiel.

		![alt text](https://www.bokniseck.de/documents/895129/940374/map_BoknisEck.jpg/2ff8ed1b-1956-487c-8d1d-59b22b0633d6?t=1377032189033 "Location Booknis Eck")

		%% Cell type:markdown id: tags:

		## Import python packages
		There is a huge amount of very very usefull software libraries, called packages available for python. Whatever you want to do, chances are high that someone else had a similar problem and solved it already. Searching for a package to help you with your task before starting to program something yourself is highly recommended.

		Installing packages is easy most of the time if you are using conda: `conda install package_XYZ` will do the trick most of the time. This will search packages from the official conda channel. For more variety use the community channel conda-forge:
		```bash
		conda install -c conda-forge package_XYZ
		```
		For larger projects or projects that you want to share with others, you should create a conda environmet file like the one that comes with this project:
		```yaml
		# file dm-tools_py.yml
		name: dm-tools_py3
		channels:
		- conda-forge
		dependencies:
		- python=3.6
		- anaconda-client
		- basemap
		- cmocean
		- ipython
		- jupyter
		- matplotlib
		- netCDF4
		- numpy
		- pandas
		- psycopg2
		- seaborn
		- xarray
		```
		This allows you to recreate the environment and install all neccessary packages at once:
		```bash
		conda env create --file dm-tools_py.yml.yml
		```

		%% Cell type:code id: tags:

		``` python
		# import python packages
		import pandas as pd # pandas for dealing with tabular data
		import requests # requests for getting data over the web
		import matplotlib.pyplot as plt # matplotlib and
		import seaborn as sns # seaborn for data visualisation
		import pprint # pretty_print for nicely formatted text output
		# Jupyter has so-called magic commands. The command above sets up the notebook for plotting
		%matplotlib inline
		```

		%% Cell type:markdown id: tags:

		## Getting the data
		The package [`requests`](http://docs.python-requests.org/en/master/) allows you to easily query services providing data over the web. In this example, I searched PANGAEA for the [Boknis Eck dataset](https://doi.pangaea.de/10.1594/PANGAEA.855693) and copied the Link behind
		> [Download dataset as tab-delimited text](https://doi.pangaea.de/10.1594/PANGAEA.855693?format=textfile)

		at the bottom of the dataset's page

		%% Cell type:code id: tags:

		``` python
		# data url found by searching on PANGAEA website
		data_url = 'https://doi.pangaea.de/10.1594/PANGAEA.855693?format=text'

		# use requests to get the data
		r = requests.get(data_url)
		data = r.text

		# write data to disk. Note: there is a slight difference between python2 and python3
		with open('data/boknis.txt', 'wb') as outfile:
		#outfile.write(data.encode(r.encoding)) # use this syntax with python 2
		outfile.write(data.encode('utf-8')) # use this syntax with python3
		```

		%% Cell type:code id: tags:

		``` python
		!mkdir data
		```

		%% Output

		mkdir: data: File exists

		%% Cell type:markdown id: tags:

		## Evaluating the data
		Now that the data is saved locally, you can open it in a text editor and have a ook at it. We will just print the first 4000 characters hiere

		%% Cell type:code id: tags:

		``` python
		# use pprint instead of print to display line endings correctly here
		pprint.pprint(data[:4000])
		```

		%% Output

		('/* DATA DESCRIPTION:\n'
		'Citation:\tBange, Hermann W; Malien, Frank (2015): Hydrochemistry from time '
		'series station Boknis Eck from 1957 to 2014. doi:10.1594/PANGAEA.855693\n'
		'Related to:\tLennartz, Sinikka; Lehmann, Andreas; Herrford, Josefine; '
		'Malien, Frank; Hansen, Hans Peter; Biester, Harald; Bange, Hermann W (2014): '
		'Long-term trends at the Boknis Eck time series station (Baltic Sea), '
		'1957-2013: does climate change counteract the decline in eutrophication? '
		'Biogeosciences, 11(22), 6323-6339, doi:10.5194/bg-11-6323-2014\n'
		'Source data set:\tBange, Hermann W; Malien, Frank (2014): Boknis Eck '
		'Timeseries Database. http://www.bokniseck.de/\n'
		'Coverage:\tLATITUDE: 54.529500 * LONGITUDE: 10.039330\n'
		'\tDATE/TIME START: 1957-04-30T00:00:00 * DATE/TIME END: 2014-12-16T11:06:33\n'
		'\tMINIMUM DEPTH, water: 1 m * MAXIMUM DEPTH, water: 35 m\n'
		'Event(s):\tBoknis_Eck_1957 * LATITUDE: 54.529500 * LONGITUDE: 10.039330 * '
		'DATE/TIME: 1957-04-30T00:00:00 * DEVICE: CTD/Rosette (CTD-RO)\n'
		'Comment:\tFlags according to WOCE standard\n'
		'Parameter(s):\tDATE/TIME (Date/Time) * GEOCODE * PI: Bange, Hermann W '
		'(hbange@geomar.de, http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tLatitude of event (Latitude)\n'
		'\tLongitude of event (Longitude)\n'
		'\tDEPTH, water [m] (Depth water) * GEOCODE * PI: Bange, Hermann W '
		'(hbange@geomar.de, http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tCast number (Cast) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tSample code/label (Sample label) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tChlorophyll a [µg/l] (Chl a) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tNitrate [µmol/l] (NO3) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tFlag (Flag) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/) * COMMENT: NO3\n'
		'\tNitrite [µmol/l] ([NO2]-) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tFlag (Flag) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/) * COMMENT: NO2\n'
		'\tOxygen [µmol/kg] (OXYGEN) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tFlag (Flag) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/) * COMMENT: Oxygen\n'
		'\tPhosphate [µmol/l] (PO4) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tFlag (Flag) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/) * COMMENT: PO4\n'
		'\tSalinity (Sal) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tSilicon dioxide [µmol/l] (SiO2) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'\tFlag (Flag) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/) * COMMENT: SiO2\n'
		'\tTemperature, water [°C] (Temp) * PI: Bange, Hermann W (hbange@geomar.de, '
		'http://www.geomar.de/mitarbeiter/fb2/ch/hbange/)\n'
		'License:\tCreative Commons Attribution 3.0 Unported (CC-BY)\n'
		'Size:\t44574 data points\n'
		'*/\n'
		'Date/Time\tLatitude\tLongitude\tDepth water [m]\tCast\tSample label\tChl a '
		'[µg/l]\tNO3 [µmol/l]\tFlag (NO3)\t[NO2]- [µmol/l]\tFlag (NO2)\tOXYGEN '
		'[µmol/kg]\tFlag (Oxygen)\tPO4 [µmol/l]\tFlag (PO4)\tSal\tSiO2 [µmol/l]\tFlag '
		'(SiO2)\tTemp [°C]\n'
		'1957-04-30T00:00:00\t54.5295\t10.0393\t1\t1\t1\t\t\t\t\t\t321.9\t\t0.000\t\t'
		'15.30\t\t\t7.70\n'
		'1957-04-30T00:00:00\t54.5295\t10.0393\t5\t1\t1\t\t\t\t\t\t325.0\t\t0.010\t\t'
		'15.30\t\t\t5.40\n'
		'1957-04-30T00:00:00\t54.5295\t10.0393\t10\t1\t1\t\t\t\t\t\t325.0\t\t0.020\t\t'
		'15.70\t\t\t6.10\n'
		'1957-04-30T00:00:00\t54.5295\t10.0393\t15\t1\t1\t\t\t\t\t\t318.8\t\t0.030\t\t'
		'16.40\t\t\t4.50\n'
		'1957-04-30T00:00:00\t54.5295\t10.0393\t20\t1\t1\t\t\t\t\t\t300.0\t\t0.060\t\t'
		'17.00\t\t\t4.30\n'
		'1957-04-30T00:00:00\t54.5295\t10.0393\t26\t1\t1\t\t\t\t\t\t281.3\t\t0.240\t\t'
		'17.40\t\t\t4.30\n'
		'1957-05-14T00:00:00\t54.5295\t10.0393\t1\t1\t2\t\t\t\t\t\t\t\t0.020\t\t'
		'15.40\t\t\t8.70\n'
		'1957-05-14T00:00:00\t54.5295\t10.0393\t5\t1\t2\t\t\t\t\t\t\t\t0.070\t\t'
		'15.40\t\t\t8.70\n'
		'19')

		%% Cell type:markdown id: tags:

		## Digesting the data
		As we can see, the has a header section marked by
		```
		/*
		HEADER
		*/
		```
		We need to ignore the data wehn digesting it with pandas later. To find out how many rows to skip, we could count them. Or we could write a small script to do it for us:

		%% Cell type:code id: tags:

		``` python
		data_start = 0
		lines = data.split('\n')
		# go through all lines in the data and search for the end of the header section
		for i, line in enumerate(lines):
		if line.startswith('/'): # / marks the end of the header
		data_start = i # remember line number
		break # no need to look further
		data_start
		```

		%% Output

		30

		%% Cell type:markdown id: tags:

		[Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) is a package for handling tabular data. It definitely is among the top 5 of the most usefull python libraries for data scientists. Although there is a learning curve, once you have understood the basics it is far mor usefull and practical then Excel.

		Pandas can read and write a number of formats. Besides standard formats like `.xls` and `.csv` you can for examlpe connect directly to a SQL database. Try typing `pd.read<tab>` in a new code cell to see the available formats it can read from.

		%% Cell type:code id: tags:

		``` python
		###
		# pd.read_csv needs some parameters in order to understand our data:
		# sep='\t': tells pandas that the columns are separated by tabs
		# parse_dates=True: by setting this, pandas will try to interpret text columns as dates where possible
		# index_col=0: tells pandas that the first column of the dataset (date) should be the index
		# skiprows=data_start+1: ignore the header lines. We determined teh value of date_start before
		df = pd.read_csv('data/boknis.txt', sep='\t', parse_dates=True, index_col=0, skiprows=data_start+1)
		# display the first few rows of the dataset
		df.head()
		```

		%% Output

		Latitude Longitude Depth water [m] Cast Sample label \
		Date/Time
		1957-04-30 54.5295 10.0393 1 1.0 1
		1957-04-30 54.5295 10.0393 5 1.0 1
		1957-04-30 54.5295 10.0393 10 1.0 1
		1957-04-30 54.5295 10.0393 15 1.0 1
		1957-04-30 54.5295 10.0393 20 1.0 1

		Chl a [µg/l] NO3 [µmol/l] Flag (NO3) [NO2]- [µmol/l] \
		Date/Time
		1957-04-30 NaN NaN NaN NaN
		1957-04-30 NaN NaN NaN NaN
		1957-04-30 NaN NaN NaN NaN
		1957-04-30 NaN NaN NaN NaN
		1957-04-30 NaN NaN NaN NaN

		Flag (NO2) OXYGEN [µmol/kg] Flag (Oxygen) PO4 [µmol/l] \
		Date/Time
		1957-04-30 NaN 321.9 NaN 0.00
		1957-04-30 NaN 325.0 NaN 0.01
		1957-04-30 NaN 325.0 NaN 0.02
		1957-04-30 NaN 318.8 NaN 0.03
		1957-04-30 NaN 300.0 NaN 0.06

		Flag (PO4) Sal SiO2 [µmol/l] Flag (SiO2) Temp [°C]
		Date/Time
		1957-04-30 NaN 15.3 NaN NaN 7.7
		1957-04-30 NaN 15.3 NaN NaN 5.4
		1957-04-30 NaN 15.7 NaN NaN 6.1
		1957-04-30 NaN 16.4 NaN NaN 4.5
		1957-04-30 NaN 17.0 NaN NaN 4.3

		%% Cell type:markdown id: tags:

		## Exploring and processing the data
		We got a first look at the data by printing a few rows of the table. From that we can see that we have measurements of several hydrobiochemical parameters and that samples were taken at different depth for each sampling day.

		To get a better impression of the data, it is a good idea to make some plots. Here, we use [matplotlib](http://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) for visualising the data. Matplotlib is a powerfull plotting library. Searborne is build on top of matplotlib and provides a quick way to produce statisticat data visualisation.

		So let's get started and make a boxplot of temperature at the measured depth:

		%% Cell type:code id: tags:

		``` python
		sns.boxplot(x='Depth water [m]', y='Temp [°C]', data=df)
		```

		%% Output

		<matplotlib.axes._subplots.AxesSubplot at 0x11096b588>



		%% Cell type:markdown id: tags:

		Ups. The plot looks messy. There are a lot of measureing depth, but the distributions are skewed for many of them.

		Lets have a look at the number of measurements for each depth. We use pandas `groupby` function to rearrange the data:

		%% Cell type:code id: tags:

		``` python
		# group data into bins by values in column'Depth water [m]'. Count number of measurements for each depth.
		df.groupby('Depth water [m]')['Depth water [m]'].count()
		```

		%% Output

		Depth water [m]
		1 839
		2 99
		3 94
		4 4
		5 831
		7 2
		8 3
		9 10
		10 832
		11 3
		12 5
		13 8
		14 8
		15 870
		17 8
		18 3
		19 6
		20 840
		21 4
		22 3
		23 3
		24 13
		25 561
		26 276
		27 52
		28 15
		29 1
		35 1
		Name: Depth water [m], dtype: int64

		%% Cell type:markdown id: tags:

		As we can see, there are a lot of measurements for 1, 5, 10, 15, 20, 25 and 26 meters. It ist possibly safe to ignore all other depth.

		%% Cell type:code id: tags:

		``` python
		# pandas ways of filtering and selecting data can be confusing at first...
		df = df[df['Depth water [m]'].isin([1, 5, 10, 15, 20, 25, 26])]
		df.groupby('Depth water [m]')['Depth water [m]'].count()
		```

		%% Output

		Depth water [m]
		1 839
		5 831
		10 832
		15 870
		20 840
		25 561
		26 276
		Name: Depth water [m], dtype: int64

		%% Cell type:markdown id: tags:

		OK, we've got that sorted. But something strange is going on for 25 and 26 meters. Let's plot some data for these depth.

		%% Cell type:code id: tags:

		``` python
		# plot temp @25m depth
		df[df['Depth water [m]']==25]['Temp [°C]'].plot(style='k>', label='Temp at depth 25m')
		# plot temp @26m depth
		df[df['Depth water [m]']==26]['Temp [°C]'].plot(style='g<', label='Temp at depth 26m')
		# add a legend to the plot
		plt.legend()
		```

		%% Output

		<matplotlib.legend.Legend at 0x111dfc128>



		%% Cell type:markdown id: tags:

		Hm, looks like they measured mainly at 26m before the 1980s and at 25m after that. Strange, look into this later. For now, let's just sort all 26m measurements into the 25m bin.

		%% Cell type:code id: tags:

		``` python
		df.loc[df['Depth water [m]'] == 26, 'Depth water [m]'] = 25
		df.groupby('Depth water [m]')['Depth water [m]'].count()
		```

		%% Output

		Depth water [m]
		1 839
		5 831
		10 832
		15 870
		20 840
		25 837
		Name: Depth water [m], dtype: int64

		%% Cell type:markdown id: tags:

		Great, this looks more reasonable. The number of measurements for each depth are pretty comparable now (note to self: might want to check the temporal distributions later to avoid bias)

		Let's make some boxplots with the cleaned data.

		%% Cell type:code id: tags:

		``` python
		sns.boxplot(x='Depth water [m]', y='Temp [°C]', data=df)
		# issuing plt.show() tells matplotlib to make a new plot for the next data series
		# instead of adding the series to the existing plot.
		plt.show()
		sns.boxplot(x='Depth water [m]', y='PO4 [µmol/l]', data=df)
		plt.show()
		sns.boxplot(x='Depth water [m]', y='NO3 [µmol/l]', data=df)
		plt.show()
		sns.boxplot(x='Depth water [m]', y='OXYGEN [µmol/kg]', data=df)
		plt.show()
		sns.boxplot(x='Depth water [m]', y='Chl a [µg/l]', data=df)


		```

		%% Output









		<matplotlib.axes._subplots.AxesSubplot at 0x1147180b8>



		%% Cell type:markdown id: tags:

		Nice. Interpret the plots or make some more. Have a look at the [seaborn tutorial](https://seaborn.pydata.org/tutorial.html) too see some examples or check out how pandas makes working with [time series](http://earthpy.org/pandas-basics.html) easier.

		%% Cell type:code id: tags:

		``` python
		```

		%% Cell type:code id: tags:

		``` python
		```

		%% Output

		[master 3312417] data processing
		1 file changed, 877 insertions(+), 54 deletions(-)

		%% Cell type:code id: tags:

		``` python
		```

		%% Output

		On branch master
		Your branch is ahead of 'origin/master' by 3 commits.
		(use "git push" to publish your local commits)
		Changes not staged for commit:
		(use "git add/rm <file>..." to update what will be committed)
		(use "git checkout -- <file>..." to discard changes in working directory)

		modified: boknis.ipynb[m
		deleted: test.txt[m

		Untracked files:
		(use "git add <file>..." to include in what will be committed)

		.ipynb_checkpoints/[m
		Untitled.ipynb[m
		data/boknis.db[m

		no changes added to commit (use "git add" and/or "git commit -a")

		%% Cell type:code id: tags:

		``` python
		sns.jointplot(x='NO3 [µmol/l]', y='OXYGEN [µmol/kg]', data=df[df['Depth water [m]']==25], kind="kde");
		```

		%% Output



		%% Cell type:code id: tags:

		``` python
		# slice data to contain only one sampling depth
		data_slice = df[df['Depth water [m]']==25]
		# select a time period of interest
		data_slice = data_slice['1990':'2010']
		# plot full resolution data
		data_slice['NO3 [µmol/l]'].plot()
		plt.show()
		# plot monthly mean values
		data_slice.resample('m').mean()['NO3 [µmol/l]'].plot()
		plt.show()
		# plot annual mean values
		data_slice.resample('a').mean()['NO3 [µmol/l]'].plot()
		plt.show()
		```

		%% Output







		%% Cell type:code id: tags:

		``` python
		import sqlite3
		conn = sqlite3.connect('data/boknis.db')
		df.to_sql(con=conn, name='boknis')
		df_db = pd.read_sql_query(con=conn, sql='SELECT * FROM boknis')
		df_db.index = df_db['Date/Time']
		df_db.head()
		```

		%% Output

		/Users/cfaber/anaconda/envs/dm-tools_py3/lib/python3.6/site-packages/pandas/core/generic.py:1201: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
		chunksize=chunksize, dtype=dtype)

		Date/Time Latitude Longitude Depth water \
		Date/Time
		1957-04-30 00:00:00 1957-04-30 00:00:00 54.5295 10.0393 1
		1957-04-30 00:00:00 1957-04-30 00:00:00 54.5295 10.0393 5
		1957-04-30 00:00:00 1957-04-30 00:00:00 54.5295 10.0393 10
		1957-04-30 00:00:00 1957-04-30 00:00:00 54.5295 10.0393 15
		1957-04-30 00:00:00 1957-04-30 00:00:00 54.5295 10.0393 20

		Cast Sample label Chl a NO3 Flag (NO3) \
		Date/Time
		1957-04-30 00:00:00 1.0 1 NaN NaN NaN NaN
		1957-04-30 00:00:00 1.0 1 NaN NaN NaN NaN
		1957-04-30 00:00:00 1.0 1 NaN NaN NaN NaN
		1957-04-30 00:00:00 1.0 1 NaN NaN NaN NaN
		1957-04-30 00:00:00 1.0 1 NaN NaN NaN NaN

		Flag (NO2) OXYGEN Flag (Oxygen) PO4 Flag (PO4) \
		Date/Time
		1957-04-30 00:00:00 NaN 321.9 NaN 0.00 NaN
		1957-04-30 00:00:00 NaN 325.0 NaN 0.01 NaN
		1957-04-30 00:00:00 NaN 325.0 NaN 0.02 NaN
		1957-04-30 00:00:00 NaN 318.8 NaN 0.03 NaN
		1957-04-30 00:00:00 NaN 300.0 NaN 0.06 NaN

		Sal SiO2 Flag (SiO2) Temp
		Date/Time
		1957-04-30 00:00:00 15.3 NaN NaN 7.7
		1957-04-30 00:00:00 15.3 NaN NaN 5.4
		1957-04-30 00:00:00 15.7 NaN NaN 6.1
		1957-04-30 00:00:00 16.4 NaN NaN 4.5
		1957-04-30 00:00:00 17.0 NaN NaN 4.3

		%% Cell type:markdown id: tags:

		## Versioning of changes
		If you made changes to this notebook that you want to keep track of, you can call git directly from within this notebook:

		%% Cell type:code id: tags:

		``` python
		!git status
		```

		%% Output

		On branch master
		Your branch is ahead of 'origin/master' by 6 commits.
		(use "git push" to publish your local commits)
		Changes not staged for commit:
		(use "git add <file>..." to update what will be committed)
		(use "git checkout -- <file>..." to discard changes in working directory)

		modified: boknis.ipynb[m
		modified: jupyter-notebook-for-OPeNDAP-data-access.ipynb[m
		modified: sql_db.ipynb[m

		Untracked files:
		(use "git add <file>..." to include in what will be committed)

		.ipynb_checkpoints/[m

		no changes added to commit (use "git add" and/or "git commit -a")

		%% Cell type:code id: tags:

		``` python
		!git add boknis.ipynb
		```

		%% Cell type:code id: tags:

		``` python
		!git commit -m'data processing'
		```