Commit 3b3e9f0a authored by Willi Rath's avatar Willi Rath

Clean-up and reset talk title

parent 22dc4b48
pages:
stage: deploy
script:
- mkdir public
- cp towards_reproducible_science.html public/index.html
- cp towards_reproducible_science_remark_included.html public/index_remark_included.html
- cp towards_reproducible_science.md public/
- cp slides.pdf public/.
- cp -r images/ public/.
- cp -r data/ public/.
- cp -r notebooks/ public/.
artifacts:
paths:
- public
only:
- master
# Barnes, Publish your computer code: it is good enough, 2010
Central quote:
> That the code is a little raw is one of the main reasons scientists give for
> not sharing it with others. Yet, software in all trades is written to be good
> enough for the job intended. So if your code is good enough to do the job,
> then it is good enough to release — and releasing it will help your research
> and your field.
And:
> **It is not common practice.** As explained above, this must change in climate science and should do so across all fields. Some disciplines, such as bioinformatics, are already changing.
> **People will pick holes and demand support and bug fixes.** Publishing code may see you accused of sloppiness. Not publishing can draw allegations of fraud. Which is worse? Nobody is entitled to demand technical support for freely provided code: if the feedback is unhelpful, ignore it.
> **The code is valuable intellectual property that belongs to my institution.** Really, that little MATLAB routine to calculate a two-part fit is worth money? Frankly, I doubt it. Some code may have long-term commercial potential, but almost all the value lies in your expertise. My industry has a name for code not backed by skilled experts: abandonware. Institutions should support publishing; those who refuse are blocking progress.
> **It is too much work to polish the code.** For scientists, the word publication is totemic, and signifies perfectionism. But your papers need not include meticulous pages of Fortran; the original code can be published as supplementary information, available from an institutional or journal website.
[Barnes2010]: https://www.nature.com/news/2010/101013/full/467753a.html
# Bhardwaj, DataHub: Collaborative Data Science & Dataset Version Management at Scale, 2014
> Inspired by software version control systems like git, we propose (a) a
> dataset version control system, giving users the ability to create, branch,
> merge, difference and search large, divergent collections of datasets, and
> (b) a platform, D ATA H UB , that gives users the ability to perform
> collaborative data analysis building on this version control system.
See also <https://github.com/datahuborg/datahub> and links given in the README.
[Bhadrwaj2014]: https://arxiv.org/abs/1409.0798
# Easterbrook, Open code for open science?, 2014
- "technical debt:"
> why invest time writing beautifully engineered code from the outset, if
> you're not sure that what you're trying to do is even possible?
- Manate technical dept:
> such debts have to be managed carefully, to prevent them spiralling out of
> control
- Central thesis:
> I argue that open source policies are unlikely to usher in an era of much
> greater sharing and reproducibility, because there are many barriers beyond
> the basic requirement of being able to read the code (see Box 1). Instead,
> such policies have an important role to play in improving the quality of
> scientific software by nudging scientists to manage their technical debt
> more carefully.
- Barriers to sharing:
- (Lack of) Portability
- (Lack of) Configurability
- (Lack of) Entrenchment
- Model-data blur
- Provenance
- Repeatability vs. reproducibility
- repeatability: Obtain different or same results with same code
- reproducibilty: Obtain same results with same or different code
- reapeatable and reproducible: Obtain same results with same code
- all others
![Venn diagram reproducibility vs. repeatability.](Easterbrook2014_ngeo2283-f1.jpg)
- "Myth of many eyes"
- Majority of published code will not be looked at / used at all.
- "On a different note, in the polarized context of climate research, making
code available to public scrutiny holds the potential to improve trust."
- "denial-of-service attacks"
- "Making code available can therefore only work on the understanding that it
does not involve the obligation to support others in repeating the
computations."
> Building on such a culture of openness, an environment may eventually develop
> where small data sets and new software tools can be more readily discovered,
> and where reproducibility is achieved more easily.
[Easterbrook2014]: http://www.nature.com/ngeo/journal/v7/n11/full/ngeo2283.html
# Hinsen, Why bitwise reproducibility matters, 2015.
- Distinction Replicability vs. reproducibility
- **Replicability** aims at "getting the same result from running the same program on the same data"
- **Reproducibility** aims at "verifying a result with similar but not identical methods and tools"
- Bitwise *replicability* is not unrealistic: "[It]’s a lot of work, but not a technical challenge. We know how to do it, but we are not (yet) willing to invest the effort to make it happen."
- In software testing, testing often means checking against an expected result, "at the bit level".
- "Most scientific programmers are unaware that [floating point arithmetics] is an approximation that they should understand and control. [...] Compiler writers and language specification authors take advantage of this ignorance and declare this step their business, profiting from the many optimization possibilities it offers."
- Gives nice example `a+b+c=(a+b)+c=...` to illustrate why we don't have bitwise replicability.
- No call for bitwise replicability across platforms in *production*, "but it should be possible to reproduce one unique result identically on all platforms".
[Hinsen2015]: https://khinsen.wordpress.com/2015/01/07/why-bitwise-reproducibility-matters/
# Hinsen, Which mistakes do we actually make in scientific code?, 2017
Experiment:
> Have two scientists, or two teams of scientists, write code for the same
> task, described in plain English as it would appear in a paper, and then
> compare the results produced by the two programs.
Differences:
1. "discrepancies between the informal description for human readers and the executable implementation"
2. "typos in numerical constants and in variable names"
3. "off-by-one-or-two errors in loops and in array indices"
[Hinsen2017]: http://blog.khinsen.net/posts/2017/05/04/which-mistakes-do-we-actually-make-in-scientific-code/
# Hutton, Most computational hydrology is not reproducible, so is it really science?, 2016
## Original Paper
[Hutton2016]: http://onlinelibrary.wiley.com/doi/10.1002/2016WR019285/full
[Anel2016]: http://onlinelibrary.wiley.com/doi/10.1002/2016WR020190/full
[Hutton2017a]: http://onlinelibrary.wiley.com/doi/10.1002/2017WR020480/full
[Melsen2017]: http://onlinelibrary.wiley.com/doi/10.1002/2016WR020208/full
[Hutton2017b]: http://onlinelibrary.wiley.com/doi/10.1002/2017WR020476/full
# Irving, A Minimum Standard for Publishing Computational Results in the
# Weather and Climate Sciences, 2015
Central point is addding the following to the author guidelines of any
publication in the field:
> If computer code is central to any of the paper’s major conclusions, then the
> following is required as a minimum standard:
>
> 1. A statement describing whether (and where) that code is available and
> setting out any restrictions on accessibility.
>
> 2. A high-level description of the software used to execute that code
> (including citations for any academic papers written to describe that
> software).
>
> 3. A supplementary file outlining the precise version of the software
> packages and operating system used. This information should be presented in
> the following format: name, version number, release date, institution, and
> DOI or URL.
>
> 4. A supplementary log file for each major result (including key figures)
> listing all computational steps taken from the initial download/attainment of
> the data to the final result (i.e., the log files describe how the code and
> software were used to produce the major results).
- Provides good and comprehensive overview of the meaning of reproducibility in
the field.
- Stresses that this is only a starting point and a basic requirements (which
would be "a big improvement of the current state of affairs").
[Irving2015]: http://journals.ametsoc.org/doi/full/10.1175/BAMS-D-15-00010.1
# Irving, Data Management in the Ocean, Weather, and Climate Sciences, Data Provenance
An example on how to (semi-)automate adding a full info on data provenance to a
netCDF file that is produced by a script.
Essentials:
- Include full call to the tool
- Include Git commit of the library
Main problem IMHO: This adds boilerplate to everything the user does and hence
will have a hard time sticking at least for scripts in actual analyses. Good
thing for libraries, though.
[Irving_carpentry]: http://damienirving.github.io/capstone-oceanography/03-data-provenance.html
# Merali, ...ERROR ...why scientific programming does not compute, 2010
Relatively pessimistic and focused on (lack of) programming skills rather than
documentation and availability of code.
> "Bringing industrial software-development
> practices into the lab cannot come too soon"
[Merali2010]: https://www.nature.com/doifinder/10.1038/467775a
# Nature - Code share, 2014.
At Nature, they "want to encourage as much sharing as possible."
Points to `doi:10.5194/gmd-6-1233-2013` for a more rigorous standard.
Points to <http://www.nature.com/ngeo/journal/v7/n11/full/ngeo2283.html?foxtrotcallback=true> for a critical review of expectations towards code and data access.
[Nature_CodeShare]: https://www.nature.com/news/code-share-1.16232
# Sandve, Ten Simple Rules for Reproducible Computational Research, 2013
Very concise list!
> - **Rule 1:** For Every Result, Keep Track of How It Was Produced
> - **Rule 2:** Avoid Manual Data Manipulation Steps
> - **Rule 3:** Archive the Exact Versions of All External Programs Used
> - **Rule 4:** Version Control All Custom Scripts
> - **Rule 5:** Record All Intermediate Results, When Possible in Standardized
> Formats
> - **Rule 6:** For Analyses That Include Randomness, Note Underlying Random
> Seeds
> - **Rule 7:** Always Store Raw Data behind Plots
> - **Rule 8:** Generate Hierarchical Analysis Output, Allowing Layers of
> Increasing Detail to Be Inspected
> - **Rule 9:** Connect Textual Statements to Underlying Results
> - **Rule 10:** Provide Public Access to Scripts, Runs, and Results
[Sandve2013]: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285
# Wilson, Best Practices for Scientific Computing, 2012
> 1. Write programs for people, not computers.
> - (a) A program should not require its readers to hold more than a handful
> of facts in memory at once.
> - (b) Make names consistent, distinctive, and meaningful.
> - (c) Make code style and formatting consistent.
> 2. Let the computer do the work.
> - (a) Make the computer repeat tasks.
> - (b) Save recent commands in a file for re-use.
> - (c) Use a build tool to automate workflows.
> 3. Make incremental changes.
> - (a) Work in small steps with frequent feedback and course correction.
> - (b) Use a version control system.
> - (c) Put everything that has been created manually in version control.
> 4. Don’t repeat yourself (or others).
> - (a) Every piece of data must have a single authoritative representation
> in the system.
> - (b) Modularize code rather than copying and pasting.
> - (c) Re-use code instead of rewriting it.
> 5. Plan for mistakes.
> - (a) Add assertions to programs to check their operation.
> - (b) Use an off-the-shelf unit testing library.
> - (c) Turn bugs into test cases.
> - (d) Use a symbolic debugger.
> 6. Optimize software only after it works correctly.
> - (a) Use a profiler to identify bottlenecks.
> - (b) Write code in the highest-level language possible.
> 7. Document design and purpose, not mechanics.
> - (a) Document interfaces and reasons, not implementations.
> - (b) Refactor code in preference to explaining how it works.
> - (c) Embed the documentation for a piece of software in that software.
> 8. Collaborate.
> - (a) Use pre-merge code reviews.
> - (b) Use pair programming when bringing someone new up to speed and when
> tackling particularly tricky problems.
> - (c) Use an issue tracking tool.
[Wilson2012]: https://arxiv.org/abs/1210.0530
## References / Reading list
- [x] [ Barnes2010 ][Barnes2010]
- [x] [ Bhadrwaj2014 ][Bhadrwaj2014]
- [ ] [( Chavan2015 ) ][Chavan2015]
- [x] [ Easterbrook2014 ][Easterbrook2014]
- [x] [ Hinsen2015 ][Hinsen2015]
- [x] [ Hinsen2017 ][Hinsen2017] http:// blog.khinsen.net/posts/2017/05/04/which-mistakes-do-we-actually-make-in-scientific-code/
- [x] [ Irving_carpentry ][Irving_carpentry]
- [x] [ Irving2015 ][Irving2015]
- [x] [ Merali2010 ][Merali2010]
- [x] [ MIAME ][MIAME]
- [x] [ MPI_good_scientific_practice ][MPI_good_scientific_practice] http:// www.mpimet.mpg.de/en/science/publications/good-scientific-practice.html
- [x] [ Nature_CodeShare ][Nature_CodeShare]
- [x] [ Sandve2013 ][Sandve2013]
- [ ] [ Stodden2010 ][Stodden2010]
- [x] [ Wilson2012 ][Wilson2012]
- [ ] [ XSEDE2014_repro ][XSEDE2014_repro]
- [ ] [ Hutton2016 ][Hutton2016]
- [ ] [ Anel2016 ][Anel2016]
- [ ] [ Hutton2017a ][Hutton2017a]
- [ ] [ Melsen2017 ][Melsen2017]
- [ ] [ Hutton2017b ][Hutton2017b]
- [ ] [ Atmanspacher2016 ][Atmanspacher2016]
[Barnes2010]: https://www.nature.com/news/2010/101013/full/467753a.html
[Bhadrwaj2014]: https://arxiv.org/abs/1409.0798
[Chavan2015]: https://arxiv.org/abs/1506.04815
[Easterbrook2014]: http://www.nature.com/ngeo/journal/v7/n11/full/ngeo2283.html
[Hinsen2015]: https://khinsen.wordpress.com/2015/01/07/why-bitwise-reproducibility-matters/
[Hinsen2017]: http://blog.khinsen.net/posts/2017/05/04/which-mistakes-do-we-actually-make-in-scientific-code/
[Irving_carpentry]: http://damienirving.github.io/capstone-oceanography/03-data-provenance.html
[Irving2015]: http://journals.ametsoc.org/doi/full/10.1175/BAMS-D-15-00010.1
[Merali2010]: https://www.nature.com/doifinder/10.1038/467775a
[MIAME]: http://fged.org/projects/miame/
[MPI_good_scientific_practice]: http://www.mpimet.mpg.de/en/science/publications/good-scientific-practice.html
[Nature_CodeShare]: https://www.nature.com/news/code-share-1.16232
[Sandve2013]: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285
[Stodden2010]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1550193
[Wilson2012]: https://arxiv.org/abs/1210.0530
[XSEDE2014_repro]: https://www.xsede.org/documents/659353/d90df1cb-62b5-47c7-9936-2de11113a40f
[Hutton2016]: http://onlinelibrary.wiley.com/doi/10.1002/2016WR019285/full
[Anel2016]: http://onlinelibrary.wiley.com/doi/10.1002/2016WR020190/full
[Hutton2017a]: http://onlinelibrary.wiley.com/doi/10.1002/2017WR020480/full
[Melsen2017]: http://onlinelibrary.wiley.com/doi/10.1002/2016WR020208/full
[Hutton2017b]: http://onlinelibrary.wiley.com/doi/10.1002/2017WR020476/full
---
# Towards _Efficient & Reproducible_ Science
[![pipeline status](https://gitlab.com/willirath/towards_reproducible_science/badges/master/pipeline.svg)](https://gitlab.com/willirath/towards_reproducible_science/commits/master)
Latest version of the slides: <https://willirath.gitlab.io/towards_reproducible_science/>
# Towards Climate-Data Analysis on Large Distributed Systems
> ## Abstract
>
> This presentation is about the every-day benefits of reproducible science:
> making the scientific workflow more efficient by facilitating communication
> and collaboration between individual scientists or among small groups. This
> is often overlooked as scientists are more and more forced to enact
> reproducibility by a growing public debate on the “reproducibilty crisis”, by
> journals demanding data published along with the manuscript, or by funding
> agencies more vigorously enforcing open-data policies.
>
> After demonstrating how reproducibility is often undermined, the talk seeks
> to provide a simple framework for assessing the reproducibility of scientific
> workflows and to give an overview of existing building blocks for
> reproducible science at GEOMAR and beyond.
>
> This talks addresses scientists at any stage in their career and in any
> organisational role, and is also suited for students and technical support
> staff.
> This presentation is about ...
<!DOCTYPE html>
<html>
<head>
<title>Towards Efficient &amp; Reproducible Science | Willi Rath</title>
<title>Towards Climate-Data Analysis on Large Distributed Systems | Willi Rath</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<style type="text/css">
@import url(https://fonts.googleapis.com/css?family=Vollkorn);
......@@ -59,7 +59,7 @@
</script>
<script type="text/javascript">
var slideshow = remark.create({
sourceUrl: 'towards_reproducible_science.md'
sourceUrl: 'climate-data-analysis-on-large-distributed-systems.md'
});
</script>
</body>
......
<!DOCTYPE html>
<html>
<head>
<title>Towards Efficient &amp; Reproducible Science | Willi Rath</title>
<title>Towards Climate-Data Analysis on Large Distributed Systems | Willi Rath</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<style type="text/css">
@import url(https://fonts.googleapis.com/css?family=Vollkorn);
......@@ -59,7 +59,7 @@
</script>
<script type="text/javascript">
var slideshow = remark.create({
sourceUrl: 'towards_reproducible_science.md'
sourceUrl: 'climate-data-analysis-on-large-distributed-systems.md'
});
</script>
<script type="text/javascript">
......
Subject: Thank you for your order with RightsLink / Nature Publishing Group
From: <no-reply@copyright.com>
Date: 10/17/2017 08:23 PM
To: <wrath@geomar.de>
Copyright Clearance Center
Header
Thank you for your order!
Dear Dr. Willi Rath,
Thank you for placing your order through Copyright Clearance Center’s RightsLink® service.
Order Summary
Licensee: Geomar
Order Date: Oct 17, 2017
Order Number: 4211480858139
Publication: Nature News
Title: 1,500 scientists lift the lid on reproducibility
Type of Use: post on a website
Order Total: 0.00 USD
View or print complete details of your order and the publisher's terms and conditions.
Sincerely,
Copyright Clearance Center
How was your experience? Fill out this survey to let us know.
Tel: +1-855-239-3415 / +1-978-646-2777
customercare@copyright.com
https://myaccount.copyright.com
Copyright Clearance Center Copyright Clearance Center
Subject: Thank you for your order with RightsLink / Nature Publishing Group
From: <no-reply@copyright.com>
Date: 10/17/2017 08:25 PM
To: <wrath@geomar.de>
Copyright Clearance Center
Header
Thank you for your order!
Dear Dr. Willi Rath,
Thank you for placing your order through Copyright Clearance Center’s RightsLink® service.
Order Summary
Licensee: Geomar
Order Date: Oct 17, 2017
Order Number: 4211480970187
Publication: Nature Geoscience
Title: Open code for open science?
Type of Use: post on a website
Order Total: 0.00 USD
View or print complete details of your order and the publisher's terms and conditions.
Sincerely,
Copyright Clearance Center
How was your experience? Fill out this survey to let us know.
Tel: +1-855-239-3415 / +1-978-646-2777
customercare@copyright.com
https://myaccount.copyright.com
Copyright Clearance Center Copyright Clearance Center
CC0-licenced on <https://pixabay.com/en/checkout-retro-antique-590358/>
CC0-licenced on <https://pixabay.com/en/building-blocks-stones-colorful-1563961/>
CC0-licenced from <https://pixabay.com/en/cooking-ingredient-cuisine-kitchen-1013455/>
CC0-licenced from <https://pixabay.com/en/wintry-mountain-snow-snow-landscape-2068298/>
CC0-licenced on <https://www.pexels.com/photo/black-metal-tools-hanged-on-a-rack-near-table-162631/>
date,reproducibility crisis,replicability crisis,replication crisis,Sum: rep... crisis
2004-12-31,0.0,0.0,0.0,0.0
2005-12-31,0.0,0.0,13.25,13.25
2006-12-31,0.0,0.0,0.0,0.0
2007-12-31,0.0,0.0,0.0,0.0
2008-12-31,0.0,0.0,2.25,2.25
2009-12-31,0.0,0.0,5.166666666666667,5.166666666666667
2010-12-31,0.6666666666666666,0.5833333333333334,2.0,3.25
2011-12-31,0.4166666666666667,0.0,3.3333333333333335,3.75
2012-12-31,0.3333333333333333,0.0,4.333333333333333,4.666666666666666
2013-12-31,0.25,1.0,5.166666666666667,6.416666666666667
2014-12-31,1.3333333333333333,1.5833333333333333,7.0,9.916666666666666
2015-12-31,5.25,1.9166666666666667,7.833333333333333,15.0
2016-12-31,9.916666666666666,5.916666666666667,23.166666666666668,39.0
2017-12-31,13.2,7.0,31.3,51.5
This diff is collapsed.
This diff is collapsed.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
File deleted
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment