[PANGAEA](https://pangaea.de/) is a 'world data center', a long term archive for environmental data. It is a joint venture of AWI and marum. The GEOMAR data management team closely colaborates with PANGAEA, and it is the recommended main data archive for 'our' projects.
In this example, we are accessing the **Boknis Eck** Dataset, a long term monitoring campaign running since 1954, monitoring various environmental variables for a station in the Baltic close to Kiel.
There is a huge amount of very very usefull software libraries, called *packages* available for python. Whatever you want to do, chances are high that someone else had a similar problem and solved it already. Searching for a package to help you with your task before starting to program something yourself is highly recommended.
Installing packages is easy most of the time if you are using conda: `conda install package_XYZ` will do the trick most of the time. This will search packages from the official conda channel. For more variety use the community channel *conda-forge*:
```bash
conda install-c conda-forge package_XYZ
```
For larger projects or projects that you want to share with others, you should create a conda *environmet file* like the one that comes with this project:
```yaml
# file dm-tools_py.yml
name:dm-tools_py3
channels:
-conda-forge
dependencies:
-python=3.6
-anaconda-client
-basemap
-cmocean
-ipython
-jupyter
-matplotlib
-netCDF4
-numpy
-pandas
-psycopg2
-seaborn
-xarray
```
This allows you to recreate the environment and install all neccessary packages at once:
```bash
conda env create --file dm-tools_py.yml.yml
```
%% Cell type:code id: tags:
``` python
# import python packages
importpandasaspd# pandas for dealing with tabular data
importrequests# requests for getting data over the web
importmatplotlib.pyplotasplt# matplotlib and
importseabornassns# seaborn for data visualisation
importpprint# pretty_print for nicely formatted text output
# Jupyter has so-called *magic* commands. The command above sets up the notebook for plotting
%matplotlibinline
```
%% Cell type:markdown id: tags:
## Getting the data
The package [`requests`](http://docs.python-requests.org/en/master/) allows you to easily query services providing data over the web. In this example, I searched PANGAEA for the [Boknis Eck dataset](https://doi.pangaea.de/10.1594/PANGAEA.855693) and copied the Link behind
> [Download dataset as tab-delimited text](https://doi.pangaea.de/10.1594/PANGAEA.855693?format=textfile)
We need to ignore the data wehn digesting it with pandas later. To find out how many rows to skip, we could count them. Or we could write a small script to do it for us:
%% Cell type:code id: tags:
``` python
data_start=0
lines=data.split('\n')
# go through all lines in the data and search for the end of the header section
fori,lineinenumerate(lines):
ifline.startswith('*/'):# */ marks the end of the header
data_start=i# remember line number
break# no need to look further
data_start
```
%% Output
30
%% Cell type:markdown id: tags:
[Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) is a package for handling tabular data. It definitely is among the top 5 of the most usefull python libraries for data scientists. Although there is a learning curve, once you have understood the basics it is far mor usefull and practical then Excel.
Pandas can read and write a number of formats. Besides standard formats like `.xls` and `.csv` you can for examlpe connect directly to a SQL database. Try typing `pd.read<tab>` in a new code cell to see the available formats it can read from.
%% Cell type:code id: tags:
``` python
###
# pd.read_csv needs some parameters in order to understand our data:
# sep='\t': tells pandas that the columns are separated by tabs
# parse_dates=True: by setting this, pandas will try to interpret text columns as dates where possible
# index_col=0: tells pandas that the first column of the dataset (date) should be the index
# skiprows=data_start+1: ignore the header lines. We determined teh value of date_start before
Latitude Longitude Depth water [m] Cast Sample label \
Date/Time
1957-04-30 54.5295 10.0393 1 1.0 1
1957-04-30 54.5295 10.0393 5 1.0 1
1957-04-30 54.5295 10.0393 10 1.0 1
1957-04-30 54.5295 10.0393 15 1.0 1
1957-04-30 54.5295 10.0393 20 1.0 1
Chl a [µg/l] NO3 [µmol/l] Flag (NO3) [NO2]- [µmol/l] \
Date/Time
1957-04-30 NaN NaN NaN NaN
1957-04-30 NaN NaN NaN NaN
1957-04-30 NaN NaN NaN NaN
1957-04-30 NaN NaN NaN NaN
1957-04-30 NaN NaN NaN NaN
Flag (NO2) OXYGEN [µmol/kg] Flag (Oxygen) PO4 [µmol/l] \
Date/Time
1957-04-30 NaN 321.9 NaN 0.00
1957-04-30 NaN 325.0 NaN 0.01
1957-04-30 NaN 325.0 NaN 0.02
1957-04-30 NaN 318.8 NaN 0.03
1957-04-30 NaN 300.0 NaN 0.06
Flag (PO4) Sal SiO2 [µmol/l] Flag (SiO2) Temp [°C]
Date/Time
1957-04-30 NaN 15.3 NaN NaN 7.7
1957-04-30 NaN 15.3 NaN NaN 5.4
1957-04-30 NaN 15.7 NaN NaN 6.1
1957-04-30 NaN 16.4 NaN NaN 4.5
1957-04-30 NaN 17.0 NaN NaN 4.3
%% Cell type:markdown id: tags:
## Exploring and processing the data
We got a first look at the data by printing a few rows of the table. From that we can see that we have measurements of several hydrobiochemical parameters and that samples were taken at different depth for each sampling day.
To get a better impression of the data, it is a good idea to make some plots. Here, we use [matplotlib](http://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) for visualising the data. Matplotlib is a powerfull plotting library. Searborne is build on top of matplotlib and provides a quick way to produce statisticat data visualisation.
So let's get started and make a boxplot of temperature at the measured depth:
%% Cell type:code id: tags:
``` python
sns.boxplot(x='Depth water [m]',y='Temp [°C]',data=df)
```
%% Output
<matplotlib.axes._subplots.AxesSubplot at 0x11096b588>
%% Cell type:markdown id: tags:
Ups. The plot looks messy. There are a lot of measureing depth, but the distributions are skewed for many of them.
Lets have a look at the number of measurements for each depth. We use pandas `groupby` function to rearrange the data:
%% Cell type:code id: tags:
``` python
# group data into bins by values in column'Depth water [m]'. Count number of measurements for each depth.
df.groupby('Depth water [m]')['Depth water [m]'].count()
```
%% Output
Depth water [m]
1 839
2 99
3 94
4 4
5 831
7 2
8 3
9 10
10 832
11 3
12 5
13 8
14 8
15 870
17 8
18 3
19 6
20 840
21 4
22 3
23 3
24 13
25 561
26 276
27 52
28 15
29 1
35 1
Name: Depth water [m], dtype: int64
%% Cell type:markdown id: tags:
As we can see, there are a lot of measurements for 1, 5, 10, 15, 20, 25 and 26 meters. It ist possibly safe to ignore all other depth.
%% Cell type:code id: tags:
``` python
# pandas ways of filtering and selecting data can be confusing at first...
df=df[df['Depth water [m]'].isin([1,5,10,15,20,25,26])]
df.groupby('Depth water [m]')['Depth water [m]'].count()
```
%% Output
Depth water [m]
1 839
5 831
10 832
15 870
20 840
25 561
26 276
Name: Depth water [m], dtype: int64
%% Cell type:markdown id: tags:
OK, we've got that sorted. But something strange is going on for 25 and 26 meters. Let's plot some data for these depth.
%% Cell type:code id: tags:
``` python
# plot temp @25m depth
df[df['Depth water [m]']==25]['Temp [°C]'].plot(style='k>',label='Temp at depth 25m')
# plot temp @26m depth
df[df['Depth water [m]']==26]['Temp [°C]'].plot(style='g<',label='Temp at depth 26m')
# add a legend to the plot
plt.legend()
```
%% Output
<matplotlib.legend.Legend at 0x111dfc128>
%% Cell type:markdown id: tags:
Hm, looks like they measured mainly at 26m before the 1980s and at 25m after that. Strange, look into this later. For now, let's just sort all 26m measurements into the 25m bin.
%% Cell type:code id: tags:
``` python
df.loc[df['Depth water [m]']==26,'Depth water [m]']=25
df.groupby('Depth water [m]')['Depth water [m]'].count()
```
%% Output
Depth water [m]
1 839
5 831
10 832
15 870
20 840
25 837
Name: Depth water [m], dtype: int64
%% Cell type:markdown id: tags:
Great, this looks more reasonable. The number of measurements for each depth are pretty comparable now (note to self: might want to check the temporal distributions later to avoid bias)
Let's make some boxplots with the cleaned data.
%% Cell type:code id: tags:
``` python
sns.boxplot(x='Depth water [m]',y='Temp [°C]',data=df)
# issuing plt.show() tells matplotlib to make a new plot for the next data series
# instead of adding the series to the existing plot.
plt.show()
sns.boxplot(x='Depth water [m]',y='PO4 [µmol/l]',data=df)
plt.show()
sns.boxplot(x='Depth water [m]',y='NO3 [µmol/l]',data=df)
plt.show()
sns.boxplot(x='Depth water [m]',y='OXYGEN [µmol/kg]',data=df)
plt.show()
sns.boxplot(x='Depth water [m]',y='Chl a [µg/l]',data=df)
```
%% Output
<matplotlib.axes._subplots.AxesSubplot at 0x1147180b8>
%% Cell type:markdown id: tags:
Nice. Interpret the plots or make some more. Have a look at the [seaborn tutorial](https://seaborn.pydata.org/tutorial.html) too see some examples or check out how pandas makes working with [time series](http://earthpy.org/pandas-basics.html) easier.
df_db = pd.read_sql_query(con=conn, sql='SELECT * FROM boknis')
df_db.index = df_db['Date/Time']
df_db.head()
```
%% Output
/Users/cfaber/anaconda/envs/dm-tools_py3/lib/python3.6/site-packages/pandas/core/generic.py:1201: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.