Reproducible research & Docker

Discussion:

John Stanton-Geddes

2014-09-10 13:07:46 UTC

Hi Carl and rOpenSci,

Apologies for jumping in late here (and let me know if this should be asked
elsewhere or a new topic) but I've also recently discovered and become
intrigued by Docker for facilitating reproducible research.

My question: what's the advantage of Docker over an amazon EC2 machine
image?

I've moved my analyses to EC2 for better than my local university cluster.
Doesn't my machine image achieve Carl's acid test of allowing others to
build and extend on work? What do I gain by making a Dockerfile on my
already existing EC2 image? Being new to all this, the only clear
advantage I see is a Dockerfile is much smaller than a machine image, but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.

thanks,
John

Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to support
the little guy. But as with anything, there is a huge diversity of AMIs and
greater discoverability on EC2, at least for now.

Hmm, looks like DO is planning on it, but not possible yet. Do go upvote
this feature
https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Nice, we could work on this working privately, then when sharing is
available, boom.

Great idea. Yeah, should be possible. Does the DO API support a way to
launch a job on the instance, or otherwise a way to share a custom machine
image publicly? (e.g. the way Amazon EC2 lets you make an AMI public from
an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker image they
have, but that we would then need a wrapper to ssh into the DO machine and
execute the single command needed to bring up the RStudio instance in the
browser.

Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my Digital
Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Scott

Hi folks,
Just thought I'd share an update on this thread -- I've gotten RStudio
Server working in the ropensci-docker
<https://github.com/ropensci/docker-ubuntu-r/blob/master/add-r-ropensci/Dockerfile>
image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your browser
at localhost:8787. (Change the first number after the -p to have a
different address). You can log in with username:pw rstudio:rstudio and
have fun.
One thing I like about this is the ease with which I can now get an
RStudio server up and running in the cloud (e.g. I took this for sail on
DigitalOcean.com today). This means in few minutes and 1 penny you have a
URL that you and any collaborators could use to interact with R using the
familiar RStudio interface, already provisioned with your data and
dependencies in place.
To keep this brief-ish, I've restricted further commentary to my blog
http://www.carlboettiger.info/lab-notebook.html
Cheers,
Carl

Thanks Rich! some further thoughts / questions below

<javascript:>>

Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies", at

least

in it's current form. It requires running on Linux for starters -
I've got it running on a virtual machine on OSX via virtualbox, but
the amount of faffing about there is pretty intimidating. I believe
it's possible to get it running via vagrant (which is in theory

going

to be easier to distribute) but at that point it's all getting a bit
silly. It's enlightening to ask a random ecologist to go to the
website for docker (or heroku or vagrant or chef or any of these
newfangled tools) and ask them to guess what they do. We're down a
rabbit hole here.

Completely agree here. Anything that cannot be installed by

downloading and

clicking on something is dead in the water. It looks like Docker is

just

download and click on Macs or Windows. (Haven't tested, I have only

linux

boxes handy). So I'm not sure that the regular user needs to know

that it's

running a linux virtual machine under the hood when they aren't on a

linux

box.
So I'm optimistic think the installation faffing will largely go

away, if it

hasn't yet. I'm more worried about the faffing after it is

installed.

I've been getting drone (https://github.com/drone/drone ) up and
running here for one of our currently-closed projects. It uses

docker

as a way of insulating the build/tests from the rest of the system,
but it's still far from ready to recommend for general use. The
advantages I see there are: our test suite can run for several hours
without worrying about running up against allowed times, and working
for projects that are not yet open source. It also simplifies

getting

things off the container, but I think there are a bunch of ways of
doing that easily enough. However, I'm on the lookout for something
much simpler to set up, especially for local use and/or behind NAT.

can post the dockerfile at some point (it's apparently not on this
computer!) but it's similarly simple to yours.

Very cool! Yeah, I think there's great promise that we'll see more
easy-to-use tools being built on docker. Is Drone ubuntu-only at

the moment

then?

As I see it, the great advantage of all these types of approaches,
independent of the technology, is the recipe-based approach to
documenting dependencies. With travis, drone, docker, etc, you
document your dependencies and if it works for you it will probably
work for someone else.

Definitely. I guess this is the heart of the "DevOpts" approach (at

least

according the BCE paper I linked -- they have nice examples that use

these

tools, but also include case studies of big collaborative science

projects

that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I like

about

Docker is the versioned images. If you re-run my build scripts even

5 days

from now, you'll get a different image due to ubuntu repo updates,

etc. But

it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with locking in
particular versions in the build script itself (a la packrat) or

just hoping

the most recent version is good enough (a la CRAN).

I'm OK with this being nerd only for a bit, because (like travis

etc)

it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over into
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds to

work

with large data sets that are tedious to download: pull the raw data
as one RUN command, wrangle as another. Then a separate wrangle

step

will reuse the intermediate container (I believe). This is sort of

different way of doing the types of things that Ethan's "eco data
retriever" aims to do. There's some overlap here with make, but in

way that would let you jump in at a point in the analysis in a fresh
environment.

Great point, hadn't thought about that.

I don't think that people will jump to using virtual environments

for

the sake of it - there has to be some pay off. Isolating the build
from the rest of your machine or digging into a 5 year old project
probably does not have widespread appeal to non-desk types either!

Definitely agree with that. I'd like to hear more about your

perspective on

CI tools though -- of course we love them, but do you think that CI

has a

larger appeal to the average ecologist than other potential

'benefits'? I

think the tangible payoffs are: (Cribbing heavily from that Berkeley
1) For instructors: having students in a consistent and optimized
environment with little effort. That environment can become a

resource

maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming the tool

is as

easy to use on the desktop as whatever they currently do -- clearly

we're

not there yet).
3) Easier to get collaborators / readers to use & re-use. (I think

that

only happens if lots of people are performing research and/or

teaching using

these environments -- just like sharing code written in Go just

isn't that

useful among ecologists. Clearly we may never get here.)

I
think that the biggest potential draws are the CI-type tools, but
there are probably other tools that require isolation/virtualisation
that will appeal broadly. Then people will accidentally end up with
reproducible work :)
Cheers,
Rich

Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible research

in the

comments of Rich et alâs recent post on the rOpenSci blog where

quite a

few
of people mentioned the potential for Docker as a way to

facilitate

this.
Iâve only just started playing around with Docker, and though Iâm

quite

impressed, Iâm still rather skeptical that non-crazies would ever

use it

productively. Nevertheless, Iâve worked up some Dockerfiles to

explore

how
one might use this approach to transparently document and manage a
computational environment, and I was hoping to get some feedback

from

all of
you.
For those of you who are already much more familiar with Docker

than me

(or
are looking for an excuse to explore!), Iâd love to get your

feedback on

some of the particulars. For everyone, Iâd be curious what you

think

about
the general concept.
So far Iâve created a dockerfile and image
If you have docker up and running, perhaps you can give it a test
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This image

builds

on
Dirk Eddelbuettelâs R docker images and serves as a starting

point to

test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a bear

then

usual
to run, since it needs rJava (requires external libs), Sxslt (only
available
on Omegahat and requires extra libs) and latest phytools (a

tar.gz file

from
Liamâs website), along with the usual mess of pandoc/latex

environment

to
compile the manuscript itself. By building on ropensci-docker, we

need a

docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run

rmarkdown::render("manuscript.Rmd"). This

will recompile the manuscript from cache and leave you to

interactively

explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user can run

the code

without dependency hell (often not as much an R problem as it is

Python,
but nevertheless one that I hit frequently, particularly as my

projects

age), and also without altering their personal R environment.

Third (in

principle) this makes it easy to run the code on a cloud server,

scaling

the
computing resources appropriately.
I think the real acid test for this is not merely that it

recreates the

results, but that others can build and extend on the work (with

fewer

rather
than more barriers than usual). I believe most of that has

nothing to do

with this whole software image thing â providing the methods you

use as

general-purpose functions in an R package, or publishing the raw

processed) data to Dryad with good documentation will always make

work

more
modular and easier to re-use than cracking open someoneâs virtual
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up whatever a
researcher
or group is already doing into something portable and extensible.

So, is

this really portable and extensible?
This presupposes someone can run docker on their OS â and from the
command
line at that. Perhaps thatâs the biggest barrier to entry right

now,

(though
given dockerâs virulent popularity, maybe something smart people

with

big
money might soon solve).
The only way to interact with thing is through a bash shell

running on

the
container. An RStudio server might be much nicer, but I havenât

been

able to
get that running. Anyone know how to run RStudio server from

docker?
https://github.com/mingfang/docker-druid/issues/2)

I donât see how users can move local files on and off the docker
container.
In some ways this is a great virtue â forcing all code to use

fully

resolved
paths like pulling data from Dryad instead of their hard-drive,

and

pushing
results to a (possibly private) online site to view them. But

obviously

a
barrier to entry. Is there a better way to do this?
Alternative strategies
1) Docker is just one of many ways to do this (particularly if

youâre

not
concerned about maximum performance speed), and quite probably

not the

easiest. Our friends at Berkeley D-Lab opted for a GUI-driven

virtual

machine instead, built with Packer and run in Virtualbox, after

their

experience proved that students were much more comfortable with

the

mouse-driven installation and a pixel-identical environment to the
instructorâs (see their excellen paper on this).
2) Will/should researchers be willing to work and develop in

virtual

environments? In some cases, the virtual environment can be

closely

coupled
to the native one â you use your own editors etc to do all the

writing,

and
then execute in the virtual environment (seems this is easier in
docker/vagrant approach than in the BCE.
[^1]: friends ccâd above: Weâre reviving this ropensci-discuss

list to

chat
about various issues related to our packages, our goals, and more

broad

scientific workflow issues. Iâd encourage you to sign up for the
https://groups.google.com/forum/#!forum/ropensci-discuss
â
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.

Carl Boettiger

2014-09-10 16:40:45 UTC

Permalink

Hi John,

Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.

1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.

2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).

For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].

3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.

4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.

[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl

If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker

Cheers,

Carl

On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes

Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should be asked elsewhere or a new topic) but I've also recently discovered and become intrigued by Docker for facilitating reproducible research.
My question: what's the advantage of Docker over an amazon EC2 machine image?
I've moved my analyses to EC2 for better than my local university cluster. Doesn't my machine image achieve Carl's acid test of allowing others to build and extend on work? What do I gain by making a Dockerfile on my already existing EC2 image? Being new to all this, the only clear advantage I see is a Dockerfile is much smaller than a machine image, but this seems like a rather trivial concern in comparison to 100s of gigs of sequence data associated with my project.
thanks,
John

Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to support the little guy. But as with anything, there is a huge diversity of AMIs and greater discoverability on EC2, at least for now.

Hmm, looks like DO is planning on it, but not possible yet. Do go upvote this feature https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Nice, we could work on this working privately, then when sharing is available, boom.

Great idea. Yeah, should be possible. Does the DO API support a way to launch a job on the instance, or otherwise a way to share a custom machine image publicly? (e.g. the way Amazon EC2 lets you make an AMI public from an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker image they have, but that we would then need a wrapper to ssh into the DO machine and execute the single command needed to bring up the RStudio instance in the browser.

Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my Digital Ocean client so that a user never needs to leave R? https://github.com/sckott/analogsea
Scott

Hi folks,
Just thought I'd share an update on this thread -- I've gotten RStudio Server working in the ropensci-docker image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your browser at localhost:8787. (Change the first number after the -p to have a different address). You can log in with username:pw rstudio:rstudio and have fun.
One thing I like about this is the ease with which I can now get an RStudio server up and running in the cloud (e.g. I took this for sail on DigitalOcean.com today). This means in few minutes and 1 penny you have a URL that you and any collaborators could use to interact with R using the familiar RStudio interface, already provisioned with your data and dependencies in place.
To keep this brief-ish, I've restricted further commentary to my blog notebook (today's post should be up shortly): http://www.carlboettiger.info/lab-notebook.html
Cheers,
Carl

Thanks Rich! some further thoughts / questions below

Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies", at least
in it's current form. It requires running on Linux for starters -
I've got it running on a virtual machine on OSX via virtualbox, but
the amount of faffing about there is pretty intimidating. I believe
it's possible to get it running via vagrant (which is in theory going
to be easier to distribute) but at that point it's all getting a bit
silly. It's enlightening to ask a random ecologist to go to the
website for docker (or heroku or vagrant or chef or any of these
newfangled tools) and ask them to guess what they do. We're down a
rabbit hole here.

Completely agree here. Anything that cannot be installed by downloading and
clicking on something is dead in the water. It looks like Docker is just
download and click on Macs or Windows. (Haven't tested, I have only linux
boxes handy). So I'm not sure that the regular user needs to know that it's
running a linux virtual machine under the hood when they aren't on a linux
box.
So I'm optimistic think the installation faffing will largely go away, if it
hasn't yet. I'm more worried about the faffing after it is installed.

I've been getting drone (https://github.com/drone/drone ) up and
running here for one of our currently-closed projects. It uses docker
as a way of insulating the build/tests from the rest of the system,
but it's still far from ready to recommend for general use. The
advantages I see there are: our test suite can run for several hours
without worrying about running up against allowed times, and working
for projects that are not yet open source. It also simplifies getting
things off the container, but I think there are a bunch of ways of
doing that easily enough. However, I'm on the lookout for something
much simpler to set up, especially for local use and/or behind NAT. I
can post the dockerfile at some point (it's apparently not on this
computer!) but it's similarly simple to yours.

Very cool! Yeah, I think there's great promise that we'll see more
easy-to-use tools being built on docker. Is Drone ubuntu-only at the moment
then?

Definitely. I guess this is the heart of the "DevOpts" approach (at least
according the BCE paper I linked -- they have nice examples that use these
tools, but also include case studies of big collaborative science projects
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I like about
Docker is the versioned images. If you re-run my build scripts even 5 days
from now, you'll get a different image due to ubuntu repo updates, etc. But
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with locking in
particular versions in the build script itself (a la packrat) or just hoping
the most recent version is good enough (a la CRAN).

I'm OK with this being nerd only for a bit, because (like travis etc)
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over into
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds to work
with large data sets that are tedious to download: pull the raw data
as one RUN command, wrangle as another. Then a separate wrangle step
will reuse the intermediate container (I believe). This is sort of a
different way of doing the types of things that Ethan's "eco data
retriever" aims to do. There's some overlap here with make, but in a
way that would let you jump in at a point in the analysis in a fresh
environment.

Great point, hadn't thought about that.

I don't think that people will jump to using virtual environments for
the sake of it - there has to be some pay off. Isolating the build
from the rest of your machine or digging into a 5 year old project
probably does not have widespread appeal to non-desk types either!

Definitely agree with that. I'd like to hear more about your perspective on
CI tools though -- of course we love them, but do you think that CI has a
larger appeal to the average ecologist than other potential 'benefits'? I
think the tangible payoffs are: (Cribbing heavily from that Berkeley
1) For instructors: having students in a consistent and optimized
environment with little effort. That environment can become a resource
maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming the tool is as
easy to use on the desktop as whatever they currently do -- clearly we're
not there yet).
3) Easier to get collaborators / readers to use & re-use. (I think that
only happens if lots of people are performing research and/or teaching using
these environments -- just like sharing code written in Go just isn't that
useful among ecologists. Clearly we may never get here.)

Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible research in the
comments of Rich et al’s recent post on the rOpenSci blog where quite a
few
of people mentioned the potential for Docker as a way to facilitate
this.
I’ve only just started playing around with Docker, and though I’m quite
impressed, I’m still rather skeptical that non-crazies would ever use it
productively. Nevertheless, I’ve worked up some Dockerfiles to explore
how
one might use this approach to transparently document and manage a
computational environment, and I was hoping to get some feedback from
all of
you.
For those of you who are already much more familiar with Docker than me
(or
are looking for an excuse to explore!), I’d love to get your feedback on
some of the particulars. For everyone, I’d be curious what you think
about
the general concept.
So far I’ve created a dockerfile and image
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This image builds
on
Dirk Eddelbuettel’s R docker images and serves as a starting point to
test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a bear then
usual
to run, since it needs rJava (requires external libs), Sxslt (only
available
on Omegahat and requires extra libs) and latest phytools (a tar.gz file
from
Liam’s website), along with the usual mess of pandoc/latex environment
to
compile the manuscript itself. By building on ropensci-docker, we need a
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run rmarkdown::render("manuscript.Rmd"). This
will recompile the manuscript from cache and leave you to interactively
explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user can run the code
without dependency hell (often not as much an R problem as it is in
Python,
but nevertheless one that I hit frequently, particularly as my projects
age), and also without altering their personal R environment. Third (in
principle) this makes it easy to run the code on a cloud server, scaling
the
computing resources appropriately.
I think the real acid test for this is not merely that it recreates the
results, but that others can build and extend on the work (with fewer
rather
than more barriers than usual). I believe most of that has nothing to do
with this whole software image thing — providing the methods you use as
general-purpose functions in an R package, or publishing the raw (&
processed) data to Dryad with good documentation will always make work
more
modular and easier to re-use than cracking open someone’s virtual
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up whatever a
researcher
or group is already doing into something portable and extensible. So, is
this really portable and extensible?
This presupposes someone can run docker on their OS — and from the
command
line at that. Perhaps that’s the biggest barrier to entry right now,
(though
given docker’s virulent popularity, maybe something smart people with
big
money might soon solve).
The only way to interact with thing is through a bash shell running on
the
container. An RStudio server might be much nicer, but I haven’t been
able to
get that running. Anyone know how to run RStudio server from docker?
(I tried & failed: https://github.com/mingfang/docker-druid/issues/2)
I don’t see how users can move local files on and off the docker
container.
In some ways this is a great virtue — forcing all code to use fully
resolved
paths like pulling data from Dryad instead of their hard-drive, and
pushing
results to a (possibly private) online site to view them. But obviously
a
barrier to entry. Is there a better way to do this?
Alternative strategies
1) Docker is just one of many ways to do this (particularly if you’re
not
concerned about maximum performance speed), and quite probably not the
easiest. Our friends at Berkeley D-Lab opted for a GUI-driven virtual
machine instead, built with Packer and run in Virtualbox, after their
experience proved that students were much more comfortable with the
mouse-driven installation and a pixel-identical environment to the
instructor’s (see their excellen paper on this).
2) Will/should researchers be willing to work and develop in virtual
environments? In some cases, the virtual environment can be closely
coupled
to the native one — you use your own editors etc to do all the writing,
and
then execute in the virtual environment (seems this is easier in
docker/vagrant approach than in the BCE.
[^1]: friends cc’d above: We’re reviving this ropensci-discuss list to
chat
about various issues related to our packages, our goals, and more broad
scientific workflow issues. I’d encourage you to sign up for the
https://groups.google.com/forum/#!forum/ropensci-discuss
—
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
For more options, visit https://groups.google.com/d/optout.

--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.

John Stanton-Geddes

2014-09-10 17:49:47 UTC

Permalink

Thanks for clarifying, Carl. I'm now fairly convinced that Docker is worth
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.

My one residual concern, which is more practical/technological than (open
sci) philosophical is that I still have to be using a system that I can
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported for
64-bit. If I go through the (not always necessary) effort of spinning up an
AMI, I can access it through anything with ssh. The easy solution is to run
Docker on the AMI.

Titus also responded directly to me with the following:

the argument here --

Post by Carl Boettiger
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly, judging
by the hostile reactions over the years ;) -- is that it doesn't really
matter
*which* approach you choose, so much as whether or not the approach you do
choose permits understanding and remixing. So I would argue that neither
an
AMI nor a fully-baked Docker image is sufficient; what I really want is a
*recipe*. In that sense the Docker community seems to be doing a better
job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which is the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.
Hi John,
Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes

Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should be

asked elsewhere or a new topic) but I've also recently discovered and
become intrigued by Docker for facilitating reproducible research.

Post by John Stanton-Geddes
My question: what's the advantage of Docker over an amazon EC2 machine

image?

Post by John Stanton-Geddes
I've moved my analyses to EC2 for better than my local university

cluster. Doesn't my machine image achieve Carl's acid test of allowing
others to build and extend on work? What do I gain by making a Dockerfile
on my already existing EC2 image? Being new to all this, the only clear
advantage I see is a Dockerfile is much smaller than a machine image, but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.

Post by John Stanton-Geddes
thanks,
John

Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to

support the little guy. But as with anything, there is a huge diversity of
AMIs and greater discoverability on EC2, at least for now.