added Jupyter NBs (e2d93b6b) · Commits · isos / hands-on

boknis.ipynb

0 → 100644

+270 −0

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Getting and processing PANGAEA dataset
		[PANGAEA](https://pangaea.de/) is a 'world data center', a long term archive for environmental data. It is a joint venture of AWI and marum. The GEOMAR data management team closely colaborates with PANGAEA, and it is the recommended main data archive for 'our' projects.

		In this example, we are accessing the Boknis Eck Dataset, a long term monitoring campaign running since 1954, monitoring various environmental variables for a station in the Baltic close to Kiel.

		![alt text](https://www.bokniseck.de/documents/895129/940374/map_BoknisEck.jpg/2ff8ed1b-1956-487c-8d1d-59b22b0633d6?t=1377032189033 "Location Booknis Eck")

		%% Cell type:markdown id: tags:

		## Import python packages
		There is a huge amount of very very usefull software libraries, called packages available for python. Whatever you want to do, chances are high that someone else had a similar problem and solved it already. Searching for a package to help you with your task before starting to program something yourself is highly recommended.

		Installing packages is easy most of the time if you are using conda: `conda install package_XYZ` will do the trick most of the time. This will search packages from the official conda channel. For more variety use the community channel conda-forge:
		```bash
		conda install -c conda-forge package_XYZ
		```
		For larger projects or projects that you want to share with others, you should create a conda environmet file like the one that comes with this project:
		```yaml
		# file dm-tools_py.yml
		name: dm-tools_py3
		channels:
		- conda-forge
		dependencies:
		- python=3.6
		- anaconda-client
		- basemap
		- cmocean
		- ipython
		- jupyter
		- matplotlib
		- netCDF4
		- numpy
		- pandas
		- psycopg2
		- seaborn
		- xarray
		```
		This allows you to recreate the environment and install all neccessary packages at once:
		```bash
		conda env create --file dm-tools_py.yml.yml
		```

		%% Cell type:code id: tags:

		``` python
		# import python packages
		import pandas as pd # pandas for dealing with tabular data
		import requests # requests for getting data over the web
		import matplotlib.pyplot as plt # matplotlib and
		import seaborn as sns # seaborn for data visualisation
		import pprint # pretty_print for nicely formatted text output
		# Jupyter has so-called magic commands. The command above sets up the notebook for plotting
		%matplotlib inline
		```

		%% Cell type:markdown id: tags:

		## Getting the data
		The package [`requests`](http://docs.python-requests.org/en/master/) allows you to easily query services providing data over the web. In this example, I searched PANGAEA for the [Boknis Eck dataset](https://doi.pangaea.de/10.1594/PANGAEA.855693) and copied the Link behind
		> [Download dataset as tab-delimited text](https://doi.pangaea.de/10.1594/PANGAEA.855693?format=textfile)

		at the bottom of the dataset's page

		%% Cell type:code id: tags:

		``` python
		# data url found by searching on PANGAEA website
		data_url = 'https://doi.pangaea.de/10.1594/PANGAEA.855693?format=text'

		# use requests to get the data
		r = requests.get(data_url)
		data = r.text

		# write data to disk. Note: there is a slight difference between python2 and python3
		with open('data/boknis.txt', 'w+') as outfile:
		#outfile.write(data.encode(r.encoding)) # use this syntax with python 2
		outfile.write(data) # use this syntax with python3
		```

		%% Cell type:markdown id: tags:

		## Evaluating the data
		Now that the data is saved locally, you can open it in a text editor and have a ook at it. We will just print the first 4000 characters hiere

		%% Cell type:code id: tags:

		``` python
		# use pprint instead of print to display line endings correctly here
		pprint.pprint(data[:4000])
		```

		%% Cell type:markdown id: tags:

		## Digesting the data
		As we can see, the has a header section marked by
		```
		/*
		HEADER
		*/
		```
		We need to ignore the data wehn digesting it with pandas later. To find out how many rows to skip, we could count them. Or we could write a small script to do it for us:

		%% Cell type:code id: tags:

		``` python
		data_start = 0
		lines = data.split('\n')
		# go through all lines in the data and search for the end of the header section
		for i, line in enumerate(lines):
		if line.startswith('/'): # / marks the end of the header
		data_start = i # remember line number
		break # no need to look further
		data_start
		```

		%% Cell type:markdown id: tags:

		[Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) is a package for handling tabular data. It definitely is among the top 5 of the most usefull python libraries for data scientists. Although there is a learning curve, once you have understood the basics it is far mor usefull and practical then Excel.

		Pandas can read and write a number of formats. Besides standard formats like `.xls` and `.csv` you can for examlpe connect directly to a SQL database. Try typing `pd.read<tab>` in a new code cell to see the available formats it can read from.

		%% Cell type:code id: tags:

		``` python
		###
		# pd.read_csv needs some parameters in order to understand our data:
		# sep='\t': tells pandas that the columns are separated by tabs
		# parse_dates=True: by setting this, pandas will try to interpret text columns as dates where possible
		# index_col=0: tells pandas that the first column of the dataset (date) should be the index
		# skiprows=data_start+1: ignore the header lines. We determined teh value of date_start before
		df = pd.read_csv('data/boknis.txt', sep='\t', parse_dates=True, index_col=0, skiprows=data_start+1)
		# display the first few rows of the dataset
		df.head()
		```

		%% Cell type:markdown id: tags:

		## Exploring and processing the data
		We got a first look at the data by printing a few rows of the table. From that we can see that we have measurements of several hydrobiochemical parameters and that samples were taken at different depth for each sampling day.

		To get a better impression of the data, it is a good idea to make some plots. Here, we use [matplotlib](http://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) for visualising the data. Matplotlib is a powerfull plotting library. Searborne is build on top of matplotlib and provides a quick way to produce statisticat data visualisation.

		So let's get started and make a boxplot of temperature at the measured depth:

		%% Cell type:code id: tags:

		``` python
		sns.boxplot(x='Depth water [m]', y='Temp [°C]', data=df)
		```

		%% Cell type:markdown id: tags:

		Ups. The plot looks messy. There are a lot of measureing depth, but the distributions are skewed for many of them.

		Lets have a look at the number of measurements for each depth. We use pandas `groupby` function to rearrange the data:

		%% Cell type:code id: tags:

		``` python
		# group data into bins by values in column'Depth water [m]'. Count number of measurements for each depth.
		df.groupby('Depth water [m]')['Depth water [m]'].count()
		```

		%% Cell type:markdown id: tags:

		As we can see, there are a lot of measurements for 1, 5, 10, 15, 20, 25 and 26 meters. It ist possibly safe to ignore all other depth.

		%% Cell type:code id: tags:

		``` python
		# pandas ways of filtering and selecting data can be confusing at first...
		df = df[df['Depth water [m]'].isin([1, 5, 10, 15, 20, 25, 26])]
		df.groupby('Depth water [m]')['Depth water [m]'].count()
		```

		%% Cell type:markdown id: tags:

		OK, we've got that sorted. But something strange is going on for 25 and 26 meters. Let's plot some data for these depth.

		%% Cell type:code id: tags:

		``` python
		# plot temp @25m depth
		df[df['Depth water [m]']==25]['Temp [°C]'].plot(style='k>', label='Temp at depth 25m')
		# plot temp @26m depth
		df[df['Depth water [m]']==26]['Temp [°C]'].plot(style='g<', label='Temp at depth 26m')
		# add a legend to the plot
		plt.legend()
		```

		%% Cell type:markdown id: tags:

		Hm, looks like they measured mainly at 26m before the 1980s and at 25m after that. Strange, look into this later. For now, let's just sort all 26m measurements into the 25m bin.

		%% Cell type:code id: tags:

		``` python
		df.loc[df['Depth water [m]'] == 26, 'Depth water [m]'] = 25
		df.groupby('Depth water [m]')['Depth water [m]'].count()
		```

		%% Cell type:markdown id: tags:

		Great, this looks more reasonable. The number of measurements for each depth are pretty comparable now (note to self: might want to check the temporal distributions later to avoid bias)

		Let's make some boxplots with the cleaned data.

		%% Cell type:code id: tags:

		``` python
		sns.boxplot(x='Depth water [m]', y='Temp [°C]', data=df)
		# issuing plt.show() tells matplotlib to make a new plot for the next data series
		# instead of adding the series to the existing plot.
		plt.show()
		sns.boxplot(x='Depth water [m]', y='PO4 [µmol/l]', data=df)
		plt.show()
		sns.boxplot(x='Depth water [m]', y='NO3 [µmol/l]', data=df)
		plt.show()
		sns.boxplot(x='Depth water [m]', y='OXYGEN [µmol/kg]', data=df)
		plt.show()
		sns.boxplot(x='Depth water [m]', y='Chl a [µg/l]', data=df)


		```

		%% Cell type:markdown id: tags:

		Nice. Interpret the plots or make some more. Have a look at the [seaborn tutorial](https://seaborn.pydata.org/tutorial.html) too see some examples or check out how pandas makes working with [time series](http://earthpy.org/pandas-basics.html) easier.

		%% Cell type:code id: tags:

		``` python
		sns.jointplot(x='NO3 [µmol/l]', y='OXYGEN [µmol/kg]', data=df[df['Depth water [m]']==25], kind="kde");
		```

		%% Cell type:code id: tags:

		``` python
		# slice data to contain only one sampling depth
		data_slice = df[df['Depth water [m]']==25]
		# select a time period of interest
		data_slice = data_slice['1990':'2010']
		# plot full resolution data
		data_slice['NO3 [µmol/l]'].plot()
		plt.show()
		# plot monthly mean values
		data_slice.resample('m').mean()['NO3 [µmol/l]'].plot()
		plt.show()
		# plot annual mean values
		data_slice.resample('a').mean()['NO3 [µmol/l]'].plot()
		plt.show()
		```

		%% Cell type:code id: tags:

		``` python
		import sqlite3
		conn = sqlite3.connect('data/boknis.db')
		df.to_sql(con=conn, name='boknis')
		df_db = pd.read_sql_query(con=conn, sql='SELECT * FROM boknis')
		df_db.index = df_db['Date/Time']
		df_db.head()
		```

		%% Cell type:code id: tags:

		``` python
		```

boknis_start.ipynb

0 → 100644

+335 −0

File added.

Preview size limit exceeded, changes collapsed.

jupyter-notebook-basics.ipynb

0 → 100644

+12 −0

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		## Links
		* A gallery of interesting Jupyter Notebooks : https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
		* Can't stop using it… or Python for Excel : https://www.xlwings.org/

		%% Cell type:markdown id: tags:

		# Install requirements
		* anaconda - is installed
		* git - is installed

jupyter-notebook-for-OPeNDAP-data-access.ipynb

0 → 100644

+202 −0

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		Official website for OPeNDAP: http://www.opendap.org/
		Code site: https://github.com/opendap
		Notebook adapted from Python-Intro course by Wili Rath

		%% Cell type:markdown id: tags:

		## Modules

		Get a couple of modules needed for the task at hand:

		- [netCDF4](http://unidata.github.io/netcdf4-python/) --- netCDF4 library
		- [pyplot](http://matplotlib.org/api/pyplot_api.html) --- Standard plotting routines
		- [basemap](http://matplotlib.org/basemap/) --- Map projections
		- [cmocean](http://matplotlib.org/cmocean/) --- Very nice ocean colormaps
		- [numpy](http://docs.scipy.org/doc/numpy/reference/) --- Numerical toolbox

		We provide aliases for the imported modules. In one case (`Basemap`), we only import one object.

		%% Cell type:code id: tags:

		``` python
		import netCDF4 as nc
		import matplotlib.pyplot as plt
		from mpl_toolkits.basemap import Basemap
		import cmocean as co
		import numpy as np
		```

		%% Cell type:markdown id: tags:

		## Initialization

		Initialize the notebook. We want plots to be displayed inline. Also set default figure size.

		%% Cell type:code id: tags:

		``` python
		%matplotlib inline
		plt.rcParams['figure.figsize'] = (15, 9)
		```

		%% Cell type:markdown id: tags:

		## Input Data
		In this example, we are using a publically available dataset we get over the web.

		- We use the netCDF4 library to open the netCDF data set using [nc.Dataset()](http://unidata.github.io/netcdf4-python/#netCDF4.Dataset.__init__). Try
		```python
		nc.Dataset?
		```
		and in particular
		```python
		nc.Dataset.__init__?
		```
		to read the documentation shipped with the modules. You can also access data from local files or from other (private) repositories in a similar way.

		%% Cell type:code id: tags:

		``` python
		# This is an example of sea surface temperature data for day 73 in 2017 retrieved from NASA's OPeNDAP enabled servers
		url = 'https://oceandata.sci.gsfc.nasa.gov:443/opendap/MODIST/L3SMI/2017/073/T2017073.L3m_DAY_SST_sst_9km.nc'
		data_set = nc.Dataset(url)
		```

		%% Cell type:code id: tags:

		``` python
		print(data_set)
		```

		%% Cell type:code id: tags:

		``` python
		# get (pointers to) variables
		lon = data_set.variables['lon']
		lat = data_set.variables['lat']
		sst = data_set.variables['sst']

		# understand the difference between lon and lon[:]
		print(type(lon))
		print(type(lon[:]))
		```

		%% Cell type:code id: tags:

		``` python
		# print the actual data
		print("Meta-data:\n\n", lon) # meta data
		print("Actual data:\n\n", lon[:]) # actual numeric data
		```

		%% Cell type:markdown id: tags:

		# Shapes of the variables

		Inspect the shapes of `lon`, `lat`, `ssh`.

		%% Cell type:code id: tags:

		``` python
		print(lon.shape, lat.shape, sst.shape)
		```

		%% Cell type:markdown id: tags:

		## A simple plot

		- Use `pyplot`'s `pcolormesh` to plot the 2-dimensional data. Print the documentation with:
		```python
		plt.pcolormesh?
		```


		- Try `plt.colorbar()` and `plt.title()`.

		- There are also `plt.[x,y]label`, `plt.axis`, and `plt.grid`.

		%% Cell type:code id: tags:

		``` python
		plt.pcolormesh(lon, lat, sst[:].squeeze(), cmap='inferno')
		plt.colorbar()
		plt.title('MODIS SST')
		plt.axis('tight')
		plt.xlabel('longitude')
		plt.xlabel('latitude')
		plt.grid()
		```

		%% Cell type:markdown id: tags:

		## A nicer projection

		- Use [Basemap](http://matplotlib.org/basemap/api/basemap_api.html#mpl_toolkits.basemap.Basemap) to create a Mercator projection covering the globe between `70S` and `70N`.

		- With `m = Basemap(...)` you can later use `m.pcolormesh(...)` to plot the data on the map projection.

		- We need 2-dimensional representations of the coordinates:
		```python
		mlon, mlat = np.meshgrid(lon, lat)
		```

		- To add coast lines, filled land masses, a grid, etc., you can play with the following:
		```python
		m.drawcoastlines()
		m.fillcontinents()
		m.drawparallels(np.arange(-70,90,20),labels=[1,0,0,0]);
		m.drawmeridians(np.arange(0,420,60),labels=[0,0,0,1]);
		m.colorbar()
		plt.title("AVISO ssh (m), %04d" % (year))
		```
		- Use the [cmocean](http://matplotlib.org/cmocean/) color maps:
		```python
		cmap=co.cm.<mapname goes here>
		```

		- Save a file using
		```python
		plt.savefig(plot_file_name)
		```
		where `plot_file_name` should be a string containing the year we picked above.

		%% Cell type:code id: tags:

		``` python
		# set up projection:
		m = Basemap(projection='merc',
		llcrnrlat=-70.0, llcrnrlon=-180.0,
		urcrnrlat=+70.0, urcrnrlon=+180.0)

		# 2d coordinates
		mlon, mlat = np.meshgrid(lon, lat)

		# 2d plot in the projection:
		m.pcolormesh(mlon, mlat, sst[:].squeeze(),
		rasterized=True,
		cmap=co.cm.balance,
		latlon=True)

		# grid etc. (see above)
		m.drawcoastlines()
		m.fillcontinents()
		m.drawcountries()
		m.drawparallels(np.arange(-70,90,20),labels=[1,0,0,0]);
		m.drawmeridians(np.arange(0,420,60),labels=[0,0,0,1]);
		m.colorbar()

		# add formatted title
		title = 'MODIS SST [°C]'
		plt.gca().set_title(title)

		# save file
		plot_file_name = 'modis_sst.pdf'
		plt.savefig(plot_file_name)
		```

		%% Cell type:code id: tags:

		``` python
		```

sql_db.ipynb

0 → 100644

+176 −0

File added.

Preview size limit exceeded, changes collapsed.