Discussion:
Reproducible research & Docker
John Stanton-Geddes
2014-09-10 13:07:46 UTC
Permalink
Hi Carl and rOpenSci,

Apologies for jumping in late here (and let me know if this should be asked
elsewhere or a new topic) but I've also recently discovered and become
intrigued by Docker for facilitating reproducible research.

My question: what's the advantage of Docker over an amazon EC2 machine
image?

I've moved my analyses to EC2 for better than my local university cluster.
Doesn't my machine image achieve Carl's acid test of allowing others to
build and extend on work? What do I gain by making a Dockerfile on my
already existing EC2 image? Being new to all this, the only clear
advantage I see is a Dockerfile is much smaller than a machine image, but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.

thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to support
the little guy. But as with anything, there is a huge diversity of AMIs and
greater discoverability on EC2, at least for now.
Hmm, looks like DO is planning on it, but not possible yet. Do go upvote
this feature
https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Nice, we could work on this working privately, then when sharing is
available, boom.
Great idea. Yeah, should be possible. Does the DO API support a way to
launch a job on the instance, or otherwise a way to share a custom machine
image publicly? (e.g. the way Amazon EC2 lets you make an AMI public from
an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker image they
have, but that we would then need a wrapper to ssh into the DO machine and
execute the single command needed to bring up the RStudio instance in the
browser.
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my Digital
Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Scott
Hi folks,
Just thought I'd share an update on this thread -- I've gotten RStudio
Server working in the ropensci-docker
<https://github.com/ropensci/docker-ubuntu-r/blob/master/add-r-ropensci/Dockerfile>
image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your browser
at localhost:8787. (Change the first number after the -p to have a
different address). You can log in with username:pw rstudio:rstudio and
have fun.
One thing I like about this is the ease with which I can now get an
RStudio server up and running in the cloud (e.g. I took this for sail on
DigitalOcean.com today). This means in few minutes and 1 penny you have a
URL that you and any collaborators could use to interact with R using the
familiar RStudio interface, already provisioned with your data and
dependencies in place.
To keep this brief-ish, I've restricted further commentary to my blog
http://www.carlboettiger.info/lab-notebook.html
Cheers,
Carl
Thanks Rich! some further thoughts / questions below
<javascript:>>
Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies", at
least
in it's current form. It requires running on Linux for starters -
I've got it running on a virtual machine on OSX via virtualbox, but
the amount of faffing about there is pretty intimidating. I believe
it's possible to get it running via vagrant (which is in theory
going
to be easier to distribute) but at that point it's all getting a bit
silly. It's enlightening to ask a random ecologist to go to the
website for docker (or heroku or vagrant or chef or any of these
newfangled tools) and ask them to guess what they do. We're down a
rabbit hole here.
Completely agree here. Anything that cannot be installed by
downloading and
clicking on something is dead in the water. It looks like Docker is
just
download and click on Macs or Windows. (Haven't tested, I have only
linux
boxes handy). So I'm not sure that the regular user needs to know
that it's
running a linux virtual machine under the hood when they aren't on a
linux
box.
So I'm optimistic think the installation faffing will largely go
away, if it
hasn't yet. I'm more worried about the faffing after it is
installed.
I've been getting drone (https://github.com/drone/drone ) up and
running here for one of our currently-closed projects. It uses
docker
as a way of insulating the build/tests from the rest of the system,
but it's still far from ready to recommend for general use. The
advantages I see there are: our test suite can run for several hours
without worrying about running up against allowed times, and working
for projects that are not yet open source. It also simplifies
getting
things off the container, but I think there are a bunch of ways of
doing that easily enough. However, I'm on the lookout for something
much simpler to set up, especially for local use and/or behind NAT.
I
can post the dockerfile at some point (it's apparently not on this
computer!) but it's similarly simple to yours.
Very cool! Yeah, I think there's great promise that we'll see more
easy-to-use tools being built on docker. Is Drone ubuntu-only at
the moment
then?
As I see it, the great advantage of all these types of approaches,
independent of the technology, is the recipe-based approach to
documenting dependencies. With travis, drone, docker, etc, you
document your dependencies and if it works for you it will probably
work for someone else.
Definitely. I guess this is the heart of the "DevOpts" approach (at
least
according the BCE paper I linked -- they have nice examples that use
these
tools, but also include case studies of big collaborative science
projects
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I like
about
Docker is the versioned images. If you re-run my build scripts even
5 days
from now, you'll get a different image due to ubuntu repo updates,
etc. But
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with locking in
particular versions in the build script itself (a la packrat) or
just hoping
the most recent version is good enough (a la CRAN).
I'm OK with this being nerd only for a bit, because (like travis
etc)
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over into
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds to
work
with large data sets that are tedious to download: pull the raw data
as one RUN command, wrangle as another. Then a separate wrangle
step
will reuse the intermediate container (I believe). This is sort of
a
different way of doing the types of things that Ethan's "eco data
retriever" aims to do. There's some overlap here with make, but in
a
way that would let you jump in at a point in the analysis in a fresh
environment.
Great point, hadn't thought about that.
I don't think that people will jump to using virtual environments
for
the sake of it - there has to be some pay off. Isolating the build
from the rest of your machine or digging into a 5 year old project
probably does not have widespread appeal to non-desk types either!
Definitely agree with that. I'd like to hear more about your
perspective on
CI tools though -- of course we love them, but do you think that CI
has a
larger appeal to the average ecologist than other potential
'benefits'? I
think the tangible payoffs are: (Cribbing heavily from that Berkeley
1) For instructors: having students in a consistent and optimized
environment with little effort. That environment can become a
resource
maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming the tool
is as
easy to use on the desktop as whatever they currently do -- clearly
we're
not there yet).
3) Easier to get collaborators / readers to use & re-use. (I think
that
only happens if lots of people are performing research and/or
teaching using
these environments -- just like sharing code written in Go just
isn't that
useful among ecologists. Clearly we may never get here.)
I
think that the biggest potential draws are the CI-type tools, but
there are probably other tools that require isolation/virtualisation
that will appeal broadly. Then people will accidentally end up with
reproducible work :)
Cheers,
Rich
Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible research
in the
comments of Rich et al’s recent post on the rOpenSci blog where
quite a
few
of people mentioned the potential for Docker as a way to
facilitate
this.
I’ve only just started playing around with Docker, and though I’m
quite
impressed, I’m still rather skeptical that non-crazies would ever
use it
productively. Nevertheless, I’ve worked up some Dockerfiles to
explore
how
one might use this approach to transparently document and manage a
computational environment, and I was hoping to get some feedback
from
all of
you.
For those of you who are already much more familiar with Docker
than me
(or
are looking for an excuse to explore!), I’d love to get your
feedback on
some of the particulars. For everyone, I’d be curious what you
think
about
the general concept.
So far I’ve created a dockerfile and image
If you have docker up and running, perhaps you can give it a test
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This image
builds
on
Dirk Eddelbuettel’s R docker images and serves as a starting
point to
test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a bear
then
usual
to run, since it needs rJava (requires external libs), Sxslt (only
available
on Omegahat and requires extra libs) and latest phytools (a
tar.gz file
from
Liam’s website), along with the usual mess of pandoc/latex
environment
to
compile the manuscript itself. By building on ropensci-docker, we
need a
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run
rmarkdown::render("manuscript.Rmd"). This
will recompile the manuscript from cache and leave you to
interactively
explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user can run
the code
without dependency hell (often not as much an R problem as it is
in
Python,
but nevertheless one that I hit frequently, particularly as my
projects
age), and also without altering their personal R environment.
Third (in
principle) this makes it easy to run the code on a cloud server,
scaling
the
computing resources appropriately.
I think the real acid test for this is not merely that it
recreates the
results, but that others can build and extend on the work (with
fewer
rather
than more barriers than usual). I believe most of that has
nothing to do
with this whole software image thing — providing the methods you
use as
general-purpose functions in an R package, or publishing the raw
(&
processed) data to Dryad with good documentation will always make
work
more
modular and easier to re-use than cracking open someone’s virtual
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up whatever a
researcher
or group is already doing into something portable and extensible.
So, is
this really portable and extensible?
This presupposes someone can run docker on their OS — and from the
command
line at that. Perhaps that’s the biggest barrier to entry right
now,
(though
given docker’s virulent popularity, maybe something smart people
with
big
money might soon solve).
The only way to interact with thing is through a bash shell
running on
the
container. An RStudio server might be much nicer, but I haven’t
been
able to
get that running. Anyone know how to run RStudio server from
docker?
https://github.com/mingfang/docker-druid/issues/2)
I don’t see how users can move local files on and off the docker
container.
In some ways this is a great virtue — forcing all code to use
fully
resolved
paths like pulling data from Dryad instead of their hard-drive,
and
pushing
results to a (possibly private) online site to view them. But
obviously
a
barrier to entry. Is there a better way to do this?
Alternative strategies
1) Docker is just one of many ways to do this (particularly if
you’re
not
concerned about maximum performance speed), and quite probably
not the
easiest. Our friends at Berkeley D-Lab opted for a GUI-driven
virtual
machine instead, built with Packer and run in Virtualbox, after
their
experience proved that students were much more comfortable with
the
mouse-driven installation and a pixel-identical environment to the
instructor’s (see their excellen paper on this).
2) Will/should researchers be willing to work and develop in
virtual
environments? In some cases, the virtual environment can be
closely
coupled
to the native one — you use your own editors etc to do all the
writing,
and
then execute in the virtual environment (seems this is easier in
docker/vagrant approach than in the BCE.
[^1]: friends cc’d above: We’re reviving this ropensci-discuss
list to
chat
about various issues related to our packages, our goals, and more
broad
scientific workflow issues. I’d encourage you to sign up for the
https://groups.google.com/forum/#!forum/ropensci-discuss
—
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Carl Boettiger
2014-09-10 16:40:45 UTC
Permalink
Hi John,

Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.

1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.

2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).

For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].

3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.

4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.

[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl

If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker


Cheers,

Carl



On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should be asked elsewhere or a new topic) but I've also recently discovered and become intrigued by Docker for facilitating reproducible research.
My question: what's the advantage of Docker over an amazon EC2 machine image?
I've moved my analyses to EC2 for better than my local university cluster. Doesn't my machine image achieve Carl's acid test of allowing others to build and extend on work? What do I gain by making a Dockerfile on my already existing EC2 image? Being new to all this, the only clear advantage I see is a Dockerfile is much smaller than a machine image, but this seems like a rather trivial concern in comparison to 100s of gigs of sequence data associated with my project.
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to support the little guy. But as with anything, there is a huge diversity of AMIs and greater discoverability on EC2, at least for now.
Hmm, looks like DO is planning on it, but not possible yet. Do go upvote this feature https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Nice, we could work on this working privately, then when sharing is available, boom.
Great idea. Yeah, should be possible. Does the DO API support a way to launch a job on the instance, or otherwise a way to share a custom machine image publicly? (e.g. the way Amazon EC2 lets you make an AMI public from an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker image they have, but that we would then need a wrapper to ssh into the DO machine and execute the single command needed to bring up the RStudio instance in the browser.
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my Digital Ocean client so that a user never needs to leave R? https://github.com/sckott/analogsea
Scott
Hi folks,
Just thought I'd share an update on this thread -- I've gotten RStudio Server working in the ropensci-docker image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your browser at localhost:8787. (Change the first number after the -p to have a different address). You can log in with username:pw rstudio:rstudio and have fun.
One thing I like about this is the ease with which I can now get an RStudio server up and running in the cloud (e.g. I took this for sail on DigitalOcean.com today). This means in few minutes and 1 penny you have a URL that you and any collaborators could use to interact with R using the familiar RStudio interface, already provisioned with your data and dependencies in place.
To keep this brief-ish, I've restricted further commentary to my blog notebook (today's post should be up shortly): http://www.carlboettiger.info/lab-notebook.html
Cheers,
Carl
Thanks Rich! some further thoughts / questions below
Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies", at least
in it's current form. It requires running on Linux for starters -
I've got it running on a virtual machine on OSX via virtualbox, but
the amount of faffing about there is pretty intimidating. I believe
it's possible to get it running via vagrant (which is in theory going
to be easier to distribute) but at that point it's all getting a bit
silly. It's enlightening to ask a random ecologist to go to the
website for docker (or heroku or vagrant or chef or any of these
newfangled tools) and ask them to guess what they do. We're down a
rabbit hole here.
Completely agree here. Anything that cannot be installed by downloading and
clicking on something is dead in the water. It looks like Docker is just
download and click on Macs or Windows. (Haven't tested, I have only linux
boxes handy). So I'm not sure that the regular user needs to know that it's
running a linux virtual machine under the hood when they aren't on a linux
box.
So I'm optimistic think the installation faffing will largely go away, if it
hasn't yet. I'm more worried about the faffing after it is installed.
I've been getting drone (https://github.com/drone/drone ) up and
running here for one of our currently-closed projects. It uses docker
as a way of insulating the build/tests from the rest of the system,
but it's still far from ready to recommend for general use. The
advantages I see there are: our test suite can run for several hours
without worrying about running up against allowed times, and working
for projects that are not yet open source. It also simplifies getting
things off the container, but I think there are a bunch of ways of
doing that easily enough. However, I'm on the lookout for something
much simpler to set up, especially for local use and/or behind NAT. I
can post the dockerfile at some point (it's apparently not on this
computer!) but it's similarly simple to yours.
Very cool! Yeah, I think there's great promise that we'll see more
easy-to-use tools being built on docker. Is Drone ubuntu-only at the moment
then?
As I see it, the great advantage of all these types of approaches,
independent of the technology, is the recipe-based approach to
documenting dependencies. With travis, drone, docker, etc, you
document your dependencies and if it works for you it will probably
work for someone else.
Definitely. I guess this is the heart of the "DevOpts" approach (at least
according the BCE paper I linked -- they have nice examples that use these
tools, but also include case studies of big collaborative science projects
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I like about
Docker is the versioned images. If you re-run my build scripts even 5 days
from now, you'll get a different image due to ubuntu repo updates, etc. But
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with locking in
particular versions in the build script itself (a la packrat) or just hoping
the most recent version is good enough (a la CRAN).
I'm OK with this being nerd only for a bit, because (like travis etc)
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over into
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds to work
with large data sets that are tedious to download: pull the raw data
as one RUN command, wrangle as another. Then a separate wrangle step
will reuse the intermediate container (I believe). This is sort of a
different way of doing the types of things that Ethan's "eco data
retriever" aims to do. There's some overlap here with make, but in a
way that would let you jump in at a point in the analysis in a fresh
environment.
Great point, hadn't thought about that.
I don't think that people will jump to using virtual environments for
the sake of it - there has to be some pay off. Isolating the build
from the rest of your machine or digging into a 5 year old project
probably does not have widespread appeal to non-desk types either!
Definitely agree with that. I'd like to hear more about your perspective on
CI tools though -- of course we love them, but do you think that CI has a
larger appeal to the average ecologist than other potential 'benefits'? I
think the tangible payoffs are: (Cribbing heavily from that Berkeley
1) For instructors: having students in a consistent and optimized
environment with little effort. That environment can become a resource
maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming the tool is as
easy to use on the desktop as whatever they currently do -- clearly we're
not there yet).
3) Easier to get collaborators / readers to use & re-use. (I think that
only happens if lots of people are performing research and/or teaching using
these environments -- just like sharing code written in Go just isn't that
useful among ecologists. Clearly we may never get here.)
I
think that the biggest potential draws are the CI-type tools, but
there are probably other tools that require isolation/virtualisation
that will appeal broadly. Then people will accidentally end up with
reproducible work :)
Cheers,
Rich
Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible research in the
comments of Rich et al’s recent post on the rOpenSci blog where quite a
few
of people mentioned the potential for Docker as a way to facilitate
this.
I’ve only just started playing around with Docker, and though I’m quite
impressed, I’m still rather skeptical that non-crazies would ever use it
productively. Nevertheless, I’ve worked up some Dockerfiles to explore
how
one might use this approach to transparently document and manage a
computational environment, and I was hoping to get some feedback from
all of
you.
For those of you who are already much more familiar with Docker than me
(or
are looking for an excuse to explore!), I’d love to get your feedback on
some of the particulars. For everyone, I’d be curious what you think
about
the general concept.
So far I’ve created a dockerfile and image
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This image builds
on
Dirk Eddelbuettel’s R docker images and serves as a starting point to
test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a bear then
usual
to run, since it needs rJava (requires external libs), Sxslt (only
available
on Omegahat and requires extra libs) and latest phytools (a tar.gz file
from
Liam’s website), along with the usual mess of pandoc/latex environment
to
compile the manuscript itself. By building on ropensci-docker, we need a
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run rmarkdown::render("manuscript.Rmd"). This
will recompile the manuscript from cache and leave you to interactively
explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user can run the code
without dependency hell (often not as much an R problem as it is in
Python,
but nevertheless one that I hit frequently, particularly as my projects
age), and also without altering their personal R environment. Third (in
principle) this makes it easy to run the code on a cloud server, scaling
the
computing resources appropriately.
I think the real acid test for this is not merely that it recreates the
results, but that others can build and extend on the work (with fewer
rather
than more barriers than usual). I believe most of that has nothing to do
with this whole software image thing — providing the methods you use as
general-purpose functions in an R package, or publishing the raw (&
processed) data to Dryad with good documentation will always make work
more
modular and easier to re-use than cracking open someone’s virtual
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up whatever a
researcher
or group is already doing into something portable and extensible. So, is
this really portable and extensible?
This presupposes someone can run docker on their OS — and from the
command
line at that. Perhaps that’s the biggest barrier to entry right now,
(though
given docker’s virulent popularity, maybe something smart people with
big
money might soon solve).
The only way to interact with thing is through a bash shell running on
the
container. An RStudio server might be much nicer, but I haven’t been
able to
get that running. Anyone know how to run RStudio server from docker?
(I tried & failed: https://github.com/mingfang/docker-druid/issues/2)
I don’t see how users can move local files on and off the docker
container.
In some ways this is a great virtue — forcing all code to use fully
resolved
paths like pulling data from Dryad instead of their hard-drive, and
pushing
results to a (possibly private) online site to view them. But obviously
a
barrier to entry. Is there a better way to do this?
Alternative strategies
1) Docker is just one of many ways to do this (particularly if you’re
not
concerned about maximum performance speed), and quite probably not the
easiest. Our friends at Berkeley D-Lab opted for a GUI-driven virtual
machine instead, built with Packer and run in Virtualbox, after their
experience proved that students were much more comfortable with the
mouse-driven installation and a pixel-identical environment to the
instructor’s (see their excellen paper on this).
2) Will/should researchers be willing to work and develop in virtual
environments? In some cases, the virtual environment can be closely
coupled
to the native one — you use your own editors etc to do all the writing,
and
then execute in the virtual environment (seems this is easier in
docker/vagrant approach than in the BCE.
[^1]: friends cc’d above: We’re reviving this ropensci-discuss list to
chat
about various issues related to our packages, our goals, and more broad
scientific workflow issues. I’d encourage you to sign up for the
https://groups.google.com/forum/#!forum/ropensci-discuss

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
John Stanton-Geddes
2014-09-10 17:49:47 UTC
Permalink
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is worth
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.

My one residual concern, which is more practical/technological than (open
sci) philosophical is that I still have to be using a system that I can
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported for
64-bit. If I go through the (not always necessary) effort of spinning up an
AMI, I can access it through anything with ssh. The easy solution is to run
Docker on the AMI.

Titus also responded directly to me with the following:

the argument here --
Post by Carl Boettiger
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly, judging
by the hostile reactions over the years ;) -- is that it doesn't really
matter
*which* approach you choose, so much as whether or not the approach you do
choose permits understanding and remixing. So I would argue that neither
an
AMI nor a fully-baked Docker image is sufficient; what I really want is a
*recipe*. In that sense the Docker community seems to be doing a better
job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which is the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.
Hi John,
Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should be
asked elsewhere or a new topic) but I've also recently discovered and
become intrigued by Docker for facilitating reproducible research.
Post by John Stanton-Geddes
My question: what's the advantage of Docker over an amazon EC2 machine
image?
Post by John Stanton-Geddes
I've moved my analyses to EC2 for better than my local university
cluster. Doesn't my machine image achieve Carl's acid test of allowing
others to build and extend on work? What do I gain by making a Dockerfile
on my already existing EC2 image? Being new to all this, the only clear
advantage I see is a Dockerfile is much smaller than a machine image, but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.
Post by John Stanton-Geddes
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to
support the little guy. But as with anything, there is a huge diversity of
AMIs and greater discoverability on EC2, at least for now.
Post by John Stanton-Geddes
On Tue, Aug 12, 2014 at 11:01 AM, Scott Chamberlain <
Hmm, looks like DO is planning on it, but not possible yet. Do go
upvote this feature
https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Post by John Stanton-Geddes
Nice, we could work on this working privately, then when sharing is
available, boom.
Post by John Stanton-Geddes
Great idea. Yeah, should be possible. Does the DO API support a way
to launch a job on the instance, or otherwise a way to share a custom
machine image publicly? (e.g. the way Amazon EC2 lets you make an AMI
public from an S3 bucket?)
Post by John Stanton-Geddes
I suspect we can just droplets_new() with the ubuntu_docker image
they have, but that we would then need a wrapper to ssh into the DO machine
and execute the single command needed to bring up the RStudio instance in
the browser.
Post by John Stanton-Geddes
On Tue, Aug 12, 2014 at 10:35 AM, Scott Chamberlain <
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my
Digital Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Post by John Stanton-Geddes
Scott
Hi folks,
Just thought I'd share an update on this thread -- I've gotten
RStudio Server working in the ropensci-docker image.
Post by John Stanton-Geddes
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your
browser at localhost:8787. (Change the first number after the -p to have a
different address). You can log in with username:pw rstudio:rstudio and
have fun.
Post by John Stanton-Geddes
One thing I like about this is the ease with which I can now get an
RStudio server up and running in the cloud (e.g. I took this for sail on
DigitalOcean.com today). This means in few minutes and 1 penny you have a
URL that you and any collaborators could use to interact with R using the
familiar RStudio interface, already provisioned with your data and
dependencies in place.
Post by John Stanton-Geddes
To keep this brief-ish, I've restricted further commentary to my
http://www.carlboettiger.info/lab-notebook.html
Post by John Stanton-Geddes
Cheers,
Carl
Thanks Rich! some further thoughts / questions below
On Thu, Aug 7, 2014 at 3:11 PM, Rich FitzJohn <
Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies", at
least
Post by John Stanton-Geddes
in it's current form. It requires running on Linux for starters
-
Post by John Stanton-Geddes
I've got it running on a virtual machine on OSX via virtualbox,
but
Post by John Stanton-Geddes
the amount of faffing about there is pretty intimidating. I
believe
Post by John Stanton-Geddes
it's possible to get it running via vagrant (which is in theory
going
Post by John Stanton-Geddes
to be easier to distribute) but at that point it's all getting a
bit
Post by John Stanton-Geddes
silly. It's enlightening to ask a random ecologist to go to the
website for docker (or heroku or vagrant or chef or any of these
newfangled tools) and ask them to guess what they do. We're
down a
Post by John Stanton-Geddes
rabbit hole here.
Completely agree here. Anything that cannot be installed by
downloading and
Post by John Stanton-Geddes
clicking on something is dead in the water. It looks like Docker
is just
Post by John Stanton-Geddes
download and click on Macs or Windows. (Haven't tested, I have
only linux
Post by John Stanton-Geddes
boxes handy). So I'm not sure that the regular user needs to
know that it's
Post by John Stanton-Geddes
running a linux virtual machine under the hood when they aren't
on a linux
Post by John Stanton-Geddes
box.
So I'm optimistic think the installation faffing will largely go
away, if it
Post by John Stanton-Geddes
hasn't yet. I'm more worried about the faffing after it is
installed.
Post by John Stanton-Geddes
I've been getting drone (https://github.com/drone/drone ) up
and
Post by John Stanton-Geddes
running here for one of our currently-closed projects. It uses
docker
Post by John Stanton-Geddes
as a way of insulating the build/tests from the rest of the
system,
Post by John Stanton-Geddes
but it's still far from ready to recommend for general use. The
advantages I see there are: our test suite can run for several
hours
Post by John Stanton-Geddes
without worrying about running up against allowed times, and
working
Post by John Stanton-Geddes
for projects that are not yet open source. It also simplifies
getting
Post by John Stanton-Geddes
things off the container, but I think there are a bunch of ways
of
Post by John Stanton-Geddes
doing that easily enough. However, I'm on the lookout for
something
Post by John Stanton-Geddes
much simpler to set up, especially for local use and/or behind
NAT. I
Post by John Stanton-Geddes
can post the dockerfile at some point (it's apparently not on
this
Post by John Stanton-Geddes
computer!) but it's similarly simple to yours.
Very cool! Yeah, I think there's great promise that we'll see
more
Post by John Stanton-Geddes
easy-to-use tools being built on docker. Is Drone ubuntu-only at
the moment
Post by John Stanton-Geddes
then?
As I see it, the great advantage of all these types of
approaches,
Post by John Stanton-Geddes
independent of the technology, is the recipe-based approach to
documenting dependencies. With travis, drone, docker, etc, you
document your dependencies and if it works for you it will
probably
Post by John Stanton-Geddes
work for someone else.
Definitely. I guess this is the heart of the "DevOpts" approach
(at least
Post by John Stanton-Geddes
according the BCE paper I linked -- they have nice examples that
use these
Post by John Stanton-Geddes
tools, but also include case studies of big collaborative science
projects
Post by John Stanton-Geddes
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I
like about
Post by John Stanton-Geddes
Docker is the versioned images. If you re-run my build scripts
even 5 days
Post by John Stanton-Geddes
from now, you'll get a different image due to ubuntu repo
updates, etc. But
Post by John Stanton-Geddes
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with
locking in
Post by John Stanton-Geddes
particular versions in the build script itself (a la packrat) or
just hoping
Post by John Stanton-Geddes
the most recent version is good enough (a la CRAN).
I'm OK with this being nerd only for a bit, because (like travis
etc)
Post by John Stanton-Geddes
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over
into
Post by John Stanton-Geddes
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds
to work
Post by John Stanton-Geddes
with large data sets that are tedious to download: pull the raw
data
Post by John Stanton-Geddes
as one RUN command, wrangle as another. Then a separate wrangle
step
Post by John Stanton-Geddes
will reuse the intermediate container (I believe). This is sort
of a
Post by John Stanton-Geddes
different way of doing the types of things that Ethan's "eco
data
Post by John Stanton-Geddes
retriever" aims to do. There's some overlap here with make, but
in a
Post by John Stanton-Geddes
way that would let you jump in at a point in the analysis in a
fresh
Post by John Stanton-Geddes
environment.
Great point, hadn't thought about that.
I don't think that people will jump to using virtual
environments for
Post by John Stanton-Geddes
the sake of it - there has to be some pay off. Isolating the
build
Post by John Stanton-Geddes
from the rest of your machine or digging into a 5 year old
project
Post by John Stanton-Geddes
probably does not have widespread appeal to non-desk types
either!
Post by John Stanton-Geddes
Definitely agree with that. I'd like to hear more about your
perspective on
Post by John Stanton-Geddes
CI tools though -- of course we love them, but do you think that
CI has a
Post by John Stanton-Geddes
larger appeal to the average ecologist than other potential
'benefits'? I
Post by John Stanton-Geddes
think the tangible payoffs are: (Cribbing heavily from that
Berkeley
Post by John Stanton-Geddes
1) For instructors: having students in a consistent and optimized
environment with little effort. That environment can become a
resource
Post by John Stanton-Geddes
maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming the
tool is as
Post by John Stanton-Geddes
easy to use on the desktop as whatever they currently do --
clearly we're
Post by John Stanton-Geddes
not there yet).
3) Easier to get collaborators / readers to use & re-use. (I
think that
Post by John Stanton-Geddes
only happens if lots of people are performing research and/or
teaching using
Post by John Stanton-Geddes
these environments -- just like sharing code written in Go just
isn't that
Post by John Stanton-Geddes
useful among ecologists. Clearly we may never get here.)
I
think that the biggest potential draws are the CI-type tools,
but
Post by John Stanton-Geddes
there are probably other tools that require
isolation/virtualisation
Post by John Stanton-Geddes
that will appeal broadly. Then people will accidentally end up
with
Post by John Stanton-Geddes
reproducible work :)
Cheers,
Rich
Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible
research in the
Post by John Stanton-Geddes
comments of Rich et al’s recent post on the rOpenSci blog
where quite a
Post by John Stanton-Geddes
few
of people mentioned the potential for Docker as a way to
facilitate
Post by John Stanton-Geddes
this.
I’ve only just started playing around with Docker, and though
I’m quite
Post by John Stanton-Geddes
impressed, I’m still rather skeptical that non-crazies would
ever use it
Post by John Stanton-Geddes
productively. Nevertheless, I’ve worked up some Dockerfiles to
explore
Post by John Stanton-Geddes
how
one might use this approach to transparently document and
manage a
Post by John Stanton-Geddes
computational environment, and I was hoping to get some
feedback from
Post by John Stanton-Geddes
all of
you.
For those of you who are already much more familiar with
Docker than me
Post by John Stanton-Geddes
(or
are looking for an excuse to explore!), I’d love to get your
feedback on
Post by John Stanton-Geddes
some of the particulars. For everyone, I’d be curious what you
think
Post by John Stanton-Geddes
about
the general concept.
So far I’ve created a dockerfile and image
If you have docker up and running, perhaps you can give it a
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This
image builds
Post by John Stanton-Geddes
on
Dirk Eddelbuettel’s R docker images and serves as a starting
point to
Post by John Stanton-Geddes
test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a
bear then
Post by John Stanton-Geddes
usual
to run, since it needs rJava (requires external libs), Sxslt
(only
Post by John Stanton-Geddes
available
on Omegahat and requires extra libs) and latest phytools (a
tar.gz file
Post by John Stanton-Geddes
from
Liam’s website), along with the usual mess of pandoc/latex
environment
Post by John Stanton-Geddes
to
compile the manuscript itself. By building on ropensci-docker,
we need a
Post by John Stanton-Geddes
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run
rmarkdown::render("manuscript.Rmd"). This
Post by John Stanton-Geddes
will recompile the manuscript from cache and leave you to
interactively
Post by John Stanton-Geddes
explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user can
run the code
Post by John Stanton-Geddes
without dependency hell (often not as much an R problem as it
is in
Post by John Stanton-Geddes
Python,
but nevertheless one that I hit frequently, particularly as my
projects
Post by John Stanton-Geddes
age), and also without altering their personal R environment.
Third (in
Post by John Stanton-Geddes
principle) this makes it easy to run the code on a cloud
server, scaling
Post by John Stanton-Geddes
the
computing resources appropriately.
I think the real acid test for this is not merely that it
recreates the
Post by John Stanton-Geddes
results, but that others can build and extend on the work
(with fewer
Post by John Stanton-Geddes
rather
than more barriers than usual). I believe most of that has
nothing to do
Post by John Stanton-Geddes
with this whole software image thing — providing the methods
you use as
Post by John Stanton-Geddes
general-purpose functions in an R package, or publishing the
raw (&
Post by John Stanton-Geddes
processed) data to Dryad with good documentation will always
make work
Post by John Stanton-Geddes
more
modular and easier to re-use than cracking open someone’s
virtual
Post by John Stanton-Geddes
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up
whatever a
Post by John Stanton-Geddes
researcher
or group is already doing into something portable and
extensible. So, is
Post by John Stanton-Geddes
this really portable and extensible?
This presupposes someone can run docker on their OS — and from
the
Post by John Stanton-Geddes
command
line at that. Perhaps that’s the biggest barrier to entry
right now,
Post by John Stanton-Geddes
(though
given docker’s virulent popularity, maybe something smart
people with
Post by John Stanton-Geddes
big
money might soon solve).
The only way to interact with thing is through a bash shell
running on
Post by John Stanton-Geddes
the
container. An RStudio server might be much nicer, but I
haven’t been
Post by John Stanton-Geddes
able to
get that running. Anyone know how to run RStudio server from
docker?
https://github.com/mingfang/docker-druid/issues/2)
Post by John Stanton-Geddes
I don’t see how users can move local files on and off the
docker
Post by John Stanton-Geddes
container.
In some ways this is a great virtue — forcing all code to use
fully
Post by John Stanton-Geddes
resolved
paths like pulling data from Dryad instead of their
hard-drive, and
Post by John Stanton-Geddes
pushing
results to a (possibly private) online site to view them. But
obviously
Post by John Stanton-Geddes
a
barrier to entry. Is there a better way to do this?
Alternative strategies
1) Docker is just one of many ways to do this (particularly if
you’re
Post by John Stanton-Geddes
not
concerned about maximum performance speed), and quite probably
not the
Post by John Stanton-Geddes
easiest. Our friends at Berkeley D-Lab opted for a GUI-driven
virtual
Post by John Stanton-Geddes
machine instead, built with Packer and run in Virtualbox,
after their
Post by John Stanton-Geddes
experience proved that students were much more comfortable
with the
Post by John Stanton-Geddes
mouse-driven installation and a pixel-identical environment to
the
Post by John Stanton-Geddes
instructor’s (see their excellen paper on this).
2) Will/should researchers be willing to work and develop in
virtual
Post by John Stanton-Geddes
environments? In some cases, the virtual environment can be
closely
Post by John Stanton-Geddes
coupled
to the native one — you use your own editors etc to do all the
writing,
Post by John Stanton-Geddes
and
then execute in the virtual environment (seems this is easier
in
Post by John Stanton-Geddes
docker/vagrant approach than in the BCE.
[^1]: friends cc’d above: We’re reviving this ropensci-discuss
list to
Post by John Stanton-Geddes
chat
about various issues related to our packages, our goals, and
more broad
Post by John Stanton-Geddes
scientific workflow issues. I’d encourage you to sign up for
the
Post by John Stanton-Geddes
https://groups.google.com/forum/#!forum/ropensci-discuss
—
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Carl Boettiger
2014-09-10 19:14:01 UTC
Permalink
John,

Thanks again for your input. Yeah, lack of support for 32 bit hosts is a
problem; though since you were speaking about AMIs I imagine you were
already used to not working locally, so you can of course try it out on an
amazon image or digitalocean droplet.

Yeah, Titus makes a great point. If we only distributed docker images as 2
GB binary tar files, we'd not be doing much better on the open / remixable
side than a binary VM image. And docker isn't the only way to provide a
this kind of script, as I mentioned earlier.

Nevertheless, I believe there is a technical difference and not just a
cultural one. Docker is not a virtual machine; containers are designed
expressly to be remixable blocks. You can put an R engine on one container
and a mysql database on another and connect them. Docker philosophy aims at
one function per container to maximize this reuse. of course it's up to
you to build this way rather than a single monolithic dockerfile, but the
idea of linking containers is a technical concept at the heart of docker
that offers a second and very different way to address the 'remix' problem
of VMs.

---
Carl Boettiger
http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos
On Sep 10, 2014 10:49 AM, "John Stanton-Geddes" <
Post by John Stanton-Geddes
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is worth
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.
My one residual concern, which is more practical/technological than (open
sci) philosophical is that I still have to be using a system that I can
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported for
64-bit. If I go through the (not always necessary) effort of spinning up an
AMI, I can access it through anything with ssh. The easy solution is to run
Docker on the AMI.
the argument here --
Post by Carl Boettiger
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly, judging
by the hostile reactions over the years ;) -- is that it doesn't really
matter
*which* approach you choose, so much as whether or not the approach you do
choose permits understanding and remixing. So I would argue that neither
an
AMI nor a fully-baked Docker image is sufficient; what I really want is a
*recipe*. In that sense the Docker community seems to be doing a better
job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which is the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.
Hi John,
Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should be
asked elsewhere or a new topic) but I've also recently discovered and
become intrigued by Docker for facilitating reproducible research.
Post by John Stanton-Geddes
My question: what's the advantage of Docker over an amazon EC2 machine
image?
Post by John Stanton-Geddes
I've moved my analyses to EC2 for better than my local university
cluster. Doesn't my machine image achieve Carl's acid test of allowing
others to build and extend on work? What do I gain by making a Dockerfile
on my already existing EC2 image? Being new to all this, the only clear
advantage I see is a Dockerfile is much smaller than a machine image, but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.
Post by John Stanton-Geddes
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to
support the little guy. But as with anything, there is a huge diversity of
AMIs and greater discoverability on EC2, at least for now.
Post by John Stanton-Geddes
On Tue, Aug 12, 2014 at 11:01 AM, Scott Chamberlain <
Hmm, looks like DO is planning on it, but not possible yet. Do go
upvote this feature https://digitalocean.uservoice.com/forums/136585-
digitalocean/suggestions/3249642-share-an-image-w-another-account
Post by John Stanton-Geddes
Nice, we could work on this working privately, then when sharing is
available, boom.
Post by John Stanton-Geddes
Great idea. Yeah, should be possible. Does the DO API support a
way to launch a job on the instance, or otherwise a way to share a custom
machine image publicly? (e.g. the way Amazon EC2 lets you make an AMI
public from an S3 bucket?)
Post by John Stanton-Geddes
I suspect we can just droplets_new() with the ubuntu_docker image
they have, but that we would then need a wrapper to ssh into the DO machine
and execute the single command needed to bring up the RStudio instance in
the browser.
Post by John Stanton-Geddes
On Tue, Aug 12, 2014 at 10:35 AM, Scott Chamberlain <
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my
Digital Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Post by John Stanton-Geddes
Scott
Hi folks,
Just thought I'd share an update on this thread -- I've gotten
RStudio Server working in the ropensci-docker image.
Post by John Stanton-Geddes
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your
browser at localhost:8787. (Change the first number after the -p to have a
different address). You can log in with username:pw rstudio:rstudio and
have fun.
Post by John Stanton-Geddes
One thing I like about this is the ease with which I can now get
an RStudio server up and running in the cloud (e.g. I took this for sail on
DigitalOcean.com today). This means in few minutes and 1 penny you have a
URL that you and any collaborators could use to interact with R using the
familiar RStudio interface, already provisioned with your data and
dependencies in place.
Post by John Stanton-Geddes
To keep this brief-ish, I've restricted further commentary to my
http://www.carlboettiger.info/lab-notebook.html
Post by John Stanton-Geddes
Cheers,
Carl
Thanks Rich! some further thoughts / questions below
On Thu, Aug 7, 2014 at 3:11 PM, Rich FitzJohn <
Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies", at
least
Post by John Stanton-Geddes
in it's current form. It requires running on Linux for
starters -
Post by John Stanton-Geddes
I've got it running on a virtual machine on OSX via virtualbox,
but
Post by John Stanton-Geddes
the amount of faffing about there is pretty intimidating. I
believe
Post by John Stanton-Geddes
it's possible to get it running via vagrant (which is in theory
going
Post by John Stanton-Geddes
to be easier to distribute) but at that point it's all getting
a bit
Post by John Stanton-Geddes
silly. It's enlightening to ask a random ecologist to go to
the
Post by John Stanton-Geddes
website for docker (or heroku or vagrant or chef or any of
these
Post by John Stanton-Geddes
newfangled tools) and ask them to guess what they do. We're
down a
Post by John Stanton-Geddes
rabbit hole here.
Completely agree here. Anything that cannot be installed by
downloading and
Post by John Stanton-Geddes
clicking on something is dead in the water. It looks like
Docker is just
Post by John Stanton-Geddes
download and click on Macs or Windows. (Haven't tested, I have
only linux
Post by John Stanton-Geddes
boxes handy). So I'm not sure that the regular user needs to
know that it's
Post by John Stanton-Geddes
running a linux virtual machine under the hood when they aren't
on a linux
Post by John Stanton-Geddes
box.
So I'm optimistic think the installation faffing will largely go
away, if it
Post by John Stanton-Geddes
hasn't yet. I'm more worried about the faffing after it is
installed.
Post by John Stanton-Geddes
I've been getting drone (https://github.com/drone/drone ) up
and
Post by John Stanton-Geddes
running here for one of our currently-closed projects. It uses
docker
Post by John Stanton-Geddes
as a way of insulating the build/tests from the rest of the
system,
Post by John Stanton-Geddes
but it's still far from ready to recommend for general use.
The
Post by John Stanton-Geddes
advantages I see there are: our test suite can run for several
hours
Post by John Stanton-Geddes
without worrying about running up against allowed times, and
working
Post by John Stanton-Geddes
for projects that are not yet open source. It also simplifies
getting
Post by John Stanton-Geddes
things off the container, but I think there are a bunch of ways
of
Post by John Stanton-Geddes
doing that easily enough. However, I'm on the lookout for
something
Post by John Stanton-Geddes
much simpler to set up, especially for local use and/or behind
NAT. I
Post by John Stanton-Geddes
can post the dockerfile at some point (it's apparently not on
this
Post by John Stanton-Geddes
computer!) but it's similarly simple to yours.
Very cool! Yeah, I think there's great promise that we'll see
more
Post by John Stanton-Geddes
easy-to-use tools being built on docker. Is Drone ubuntu-only
at the moment
Post by John Stanton-Geddes
then?
As I see it, the great advantage of all these types of
approaches,
Post by John Stanton-Geddes
independent of the technology, is the recipe-based approach to
documenting dependencies. With travis, drone, docker, etc, you
document your dependencies and if it works for you it will
probably
Post by John Stanton-Geddes
work for someone else.
Definitely. I guess this is the heart of the "DevOpts" approach
(at least
Post by John Stanton-Geddes
according the BCE paper I linked -- they have nice examples that
use these
Post by John Stanton-Geddes
tools, but also include case studies of big collaborative
science projects
Post by John Stanton-Geddes
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I
like about
Post by John Stanton-Geddes
Docker is the versioned images. If you re-run my build scripts
even 5 days
Post by John Stanton-Geddes
from now, you'll get a different image due to ubuntu repo
updates, etc. But
Post by John Stanton-Geddes
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with
locking in
Post by John Stanton-Geddes
particular versions in the build script itself (a la packrat) or
just hoping
Post by John Stanton-Geddes
the most recent version is good enough (a la CRAN).
I'm OK with this being nerd only for a bit, because (like
travis etc)
Post by John Stanton-Geddes
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over
into
Post by John Stanton-Geddes
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds
to work
Post by John Stanton-Geddes
with large data sets that are tedious to download: pull the raw
data
Post by John Stanton-Geddes
as one RUN command, wrangle as another. Then a separate
wrangle step
Post by John Stanton-Geddes
will reuse the intermediate container (I believe). This is
sort of a
Post by John Stanton-Geddes
different way of doing the types of things that Ethan's "eco
data
Post by John Stanton-Geddes
retriever" aims to do. There's some overlap here with make,
but in a
Post by John Stanton-Geddes
way that would let you jump in at a point in the analysis in a
fresh
Post by John Stanton-Geddes
environment.
Great point, hadn't thought about that.
I don't think that people will jump to using virtual
environments for
Post by John Stanton-Geddes
the sake of it - there has to be some pay off. Isolating the
build
Post by John Stanton-Geddes
from the rest of your machine or digging into a 5 year old
project
Post by John Stanton-Geddes
probably does not have widespread appeal to non-desk types
either!
Post by John Stanton-Geddes
Definitely agree with that. I'd like to hear more about your
perspective on
Post by John Stanton-Geddes
CI tools though -- of course we love them, but do you think that
CI has a
Post by John Stanton-Geddes
larger appeal to the average ecologist than other potential
'benefits'? I
Post by John Stanton-Geddes
think the tangible payoffs are: (Cribbing heavily from that
Berkeley
Post by John Stanton-Geddes
1) For instructors: having students in a consistent and
optimized
Post by John Stanton-Geddes
environment with little effort. That environment can become a
resource
Post by John Stanton-Geddes
maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming the
tool is as
Post by John Stanton-Geddes
easy to use on the desktop as whatever they currently do --
clearly we're
Post by John Stanton-Geddes
not there yet).
3) Easier to get collaborators / readers to use & re-use. (I
think that
Post by John Stanton-Geddes
only happens if lots of people are performing research and/or
teaching using
Post by John Stanton-Geddes
these environments -- just like sharing code written in Go just
isn't that
Post by John Stanton-Geddes
useful among ecologists. Clearly we may never get here.)
I
think that the biggest potential draws are the CI-type tools,
but
Post by John Stanton-Geddes
there are probably other tools that require
isolation/virtualisation
Post by John Stanton-Geddes
that will appeal broadly. Then people will accidentally end up
with
Post by John Stanton-Geddes
reproducible work :)
Cheers,
Rich
Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible
research in the
Post by John Stanton-Geddes
comments of Rich et al’s recent post on the rOpenSci blog
where quite a
Post by John Stanton-Geddes
few
of people mentioned the potential for Docker as a way to
facilitate
Post by John Stanton-Geddes
this.
I’ve only just started playing around with Docker, and though
I’m quite
Post by John Stanton-Geddes
impressed, I’m still rather skeptical that non-crazies would
ever use it
Post by John Stanton-Geddes
productively. Nevertheless, I’ve worked up some Dockerfiles
to explore
Post by John Stanton-Geddes
how
one might use this approach to transparently document and
manage a
Post by John Stanton-Geddes
computational environment, and I was hoping to get some
feedback from
Post by John Stanton-Geddes
all of
you.
For those of you who are already much more familiar with
Docker than me
Post by John Stanton-Geddes
(or
are looking for an excuse to explore!), I’d love to get your
feedback on
Post by John Stanton-Geddes
some of the particulars. For everyone, I’d be curious what
you think
Post by John Stanton-Geddes
about
the general concept.
So far I’ve created a dockerfile and image
If you have docker up and running, perhaps you can give it a
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This
image builds
Post by John Stanton-Geddes
on
Dirk Eddelbuettel’s R docker images and serves as a starting
point to
Post by John Stanton-Geddes
test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a
bear then
Post by John Stanton-Geddes
usual
to run, since it needs rJava (requires external libs), Sxslt
(only
Post by John Stanton-Geddes
available
on Omegahat and requires extra libs) and latest phytools (a
tar.gz file
Post by John Stanton-Geddes
from
Liam’s website), along with the usual mess of pandoc/latex
environment
Post by John Stanton-Geddes
to
compile the manuscript itself. By building on
ropensci-docker, we need a
Post by John Stanton-Geddes
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run rmarkdown::render("manuscript.Rmd").
This
Post by John Stanton-Geddes
will recompile the manuscript from cache and leave you to
interactively
Post by John Stanton-Geddes
explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user can
run the code
Post by John Stanton-Geddes
without dependency hell (often not as much an R problem as it
is in
Post by John Stanton-Geddes
Python,
but nevertheless one that I hit frequently, particularly as
my projects
Post by John Stanton-Geddes
age), and also without altering their personal R environment.
Third (in
Post by John Stanton-Geddes
principle) this makes it easy to run the code on a cloud
server, scaling
Post by John Stanton-Geddes
the
computing resources appropriately.
I think the real acid test for this is not merely that it
recreates the
Post by John Stanton-Geddes
results, but that others can build and extend on the work
(with fewer
Post by John Stanton-Geddes
rather
than more barriers than usual). I believe most of that has
nothing to do
Post by John Stanton-Geddes
with this whole software image thing — providing the methods
you use as
Post by John Stanton-Geddes
general-purpose functions in an R package, or publishing the
raw (&
Post by John Stanton-Geddes
processed) data to Dryad with good documentation will always
make work
Post by John Stanton-Geddes
more
modular and easier to re-use than cracking open someone’s
virtual
Post by John Stanton-Geddes
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up
whatever a
Post by John Stanton-Geddes
researcher
or group is already doing into something portable and
extensible. So, is
Post by John Stanton-Geddes
this really portable and extensible?
This presupposes someone can run docker on their OS — and
from the
Post by John Stanton-Geddes
command
line at that. Perhaps that’s the biggest barrier to entry
right now,
Post by John Stanton-Geddes
(though
given docker’s virulent popularity, maybe something smart
people with
Post by John Stanton-Geddes
big
money might soon solve).
The only way to interact with thing is through a bash shell
running on
Post by John Stanton-Geddes
the
container. An RStudio server might be much nicer, but I
haven’t been
Post by John Stanton-Geddes
able to
get that running. Anyone know how to run RStudio server from
docker?
Post by John Stanton-Geddes
(I tried & failed: https://github.com/mingfang/
docker-druid/issues/2)
Post by John Stanton-Geddes
I don’t see how users can move local files on and off the
docker
Post by John Stanton-Geddes
container.
In some ways this is a great virtue — forcing all code to use
fully
Post by John Stanton-Geddes
resolved
paths like pulling data from Dryad instead of their
hard-drive, and
Post by John Stanton-Geddes
pushing
results to a (possibly private) online site to view them. But
obviously
Post by John Stanton-Geddes
a
barrier to entry. Is there a better way to do this?
Alternative strategies
1) Docker is just one of many ways to do this (particularly
if you’re
Post by John Stanton-Geddes
not
concerned about maximum performance speed), and quite
probably not the
Post by John Stanton-Geddes
easiest. Our friends at Berkeley D-Lab opted for a GUI-driven
virtual
Post by John Stanton-Geddes
machine instead, built with Packer and run in Virtualbox,
after their
Post by John Stanton-Geddes
experience proved that students were much more comfortable
with the
Post by John Stanton-Geddes
mouse-driven installation and a pixel-identical environment
to the
Post by John Stanton-Geddes
instructor’s (see their excellen paper on this).
2) Will/should researchers be willing to work and develop in
virtual
Post by John Stanton-Geddes
environments? In some cases, the virtual environment can be
closely
Post by John Stanton-Geddes
coupled
to the native one — you use your own editors etc to do all
the writing,
Post by John Stanton-Geddes
and
then execute in the virtual environment (seems this is easier
in
Post by John Stanton-Geddes
docker/vagrant approach than in the BCE.
[^1]: friends cc’d above: We’re reviving this
ropensci-discuss list to
Post by John Stanton-Geddes
chat
about various issues related to our packages, our goals, and
more broad
Post by John Stanton-Geddes
scientific workflow issues. I’d encourage you to sign up for
the
Post by John Stanton-Geddes
https://groups.google.com/forum/#!forum/ropensci-discuss
—
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Carsten Behring
2014-09-27 22:22:50 UTC
Permalink
Hi everybody,

some of you mentioned in the previous posts, that the hurdle of using
docker is still to high for a lot of researchers.

One way to lower the hurdle is to use the cloud.

So instead of asking collaborators to install and use docker on their PCs
in order to reproduce something, we do either:

1. We install (with the help of an little automatic tool using cloud
services API's) our docker image to share in the "cloud" (digitalocean /
Amazon EC2 others).
The we only send ip address and port to others and then they can start
immediately in Rstudio Browser environment.
In this case the distributor pays .... for the cloud service

2. We send to a collaborator the name of our image (and the registry) and
he uses the same little tool to generate a cloud server containing the
RStudio environment.

-> Their are costs involved for the cloud providers.

In both cases the same tool gets as input:

- docker image name (based on RStudio, containing all data / R code of the
analysis)
- cloud provider credentials (for billing ...)

and it returns:
- IP address and port of RStudio, ready to use

I did a proof of concept for this with digitalocean and the docker
image mgymrek/docker-reproducibility-example.


With a simple call to digital ocean API, like this:

(create-droplet "my token" {:name "core1":region "ams3" :size "512mb"
:image 6373176 :ssh_keys [42550] :user_data user-data})

where the "user-data" contains some info for coreos operating system to
start a certain docker image on boot:

#cloud-config

coreos:
units:
- name: docker-rs.service
command: start
content: |
[Unit]
Description=RStudio service container
Author=Me
After=docker.service

[Service]
Restart=always
ExecStart=/usr/bin/docker run -p 49000:8787 --name "rstudio"
mgymrek/docker-reproducibility-example
ExecStop=/usr/bin/docker stop rstudio


and, voila, on boot it starts RStudio on a new digitalocean server, ready
to use.

It should work the same way for Amazon EC2 or others.
So the tool could allow to select the cloud provider.


I am pretty sure as well, that a tool could be done as well which does the
same for the installation on local PC. (Windows, Linux, OS2)

I will start some development / info
here: https://github.com/behrica/ropensciCloud

The big added value of docker compared to classical virtual machines is,
that it solved the distribution problem of the images.
By just specifying "image name", "tag" and "registry" (if not docker hub is
used) each docker client knows how to get the image.


By using common base images it would be even very fast to download the
images (after the first download happened)

Maybe ropensci could host its own image registry somewhere...



Casten
Post by Carl Boettiger
John,
Thanks again for your input. Yeah, lack of support for 32 bit hosts is a
problem; though since you were speaking about AMIs I imagine you were
already used to not working locally, so you can of course try it out on an
amazon image or digitalocean droplet.
Yeah, Titus makes a great point. If we only distributed docker images as
2 GB binary tar files, we'd not be doing much better on the open /
remixable side than a binary VM image. And docker isn't the only way to
provide a this kind of script, as I mentioned earlier.
Nevertheless, I believe there is a technical difference and not just a
cultural one. Docker is not a virtual machine; containers are designed
expressly to be remixable blocks. You can put an R engine on one container
and a mysql database on another and connect them. Docker philosophy aims at
one function per container to maximize this reuse. of course it's up to
you to build this way rather than a single monolithic dockerfile, but the
idea of linking containers is a technical concept at the heart of docker
that offers a second and very different way to address the 'remix' problem
of VMs.
---
Carl Boettiger
http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos
On Sep 10, 2014 10:49 AM, "John Stanton-Geddes" <
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is worth
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.
My one residual concern, which is more practical/technological than (open
sci) philosophical is that I still have to be using a system that I can
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported for
64-bit. If I go through the (not always necessary) effort of spinning up an
AMI, I can access it through anything with ssh. The easy solution is to run
Docker on the AMI.
the argument here --
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly, judging
by the hostile reactions over the years ;) -- is that it doesn't really
matter
*which* approach you choose, so much as whether or not the approach you do
choose permits understanding and remixing. So I would argue that neither
an
AMI nor a fully-baked Docker image is sufficient; what I really want is a
*recipe*. In that sense the Docker community seems to be doing a better
job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which is the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.
Hi John,
Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should be
asked elsewhere or a new topic) but I've also recently discovered and
become intrigued by Docker for facilitating reproducible research.
Post by John Stanton-Geddes
My question: what's the advantage of Docker over an amazon EC2 machine
image?
Post by John Stanton-Geddes
I've moved my analyses to EC2 for better than my local university
cluster. Doesn't my machine image achieve Carl's acid test of allowing
others to build and extend on work? What do I gain by making a Dockerfile
on my already existing EC2 image? Being new to all this, the only clear
advantage I see is a Dockerfile is much smaller than a machine image, but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.
Post by John Stanton-Geddes
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to
support the little guy. But as with anything, there is a huge diversity of
AMIs and greater discoverability on EC2, at least for now.
Post by John Stanton-Geddes
On Tue, Aug 12, 2014 at 11:01 AM, Scott Chamberlain <
Hmm, looks like DO is planning on it, but not possible yet. Do go
upvote this feature https://digitalocean.uservoice.com/forums/136585-
digitalocean/suggestions/3249642-share-an-image-w-another-account
Post by John Stanton-Geddes
Nice, we could work on this working privately, then when sharing is
available, boom.
Post by John Stanton-Geddes
Great idea. Yeah, should be possible. Does the DO API support a way
to launch a job on the instance, or otherwise a way to share a custom
machine image publicly? (e.g. the way Amazon EC2 lets you make an AMI
public from an S3 bucket?)
Post by John Stanton-Geddes
I suspect we can just droplets_new() with the ubuntu_docker image
they have, but that we would then need a wrapper to ssh into the DO machine
and execute the single command needed to bring up the RStudio instance in
the browser.
Post by John Stanton-Geddes
On Tue, Aug 12, 2014 at 10:35 AM, Scott Chamberlain <
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my
Digital Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Post by John Stanton-Geddes
Scott
Hi folks,
Just thought I'd share an update on this thread -- I've gotten
RStudio Server working in the ropensci-docker image.
Post by John Stanton-Geddes
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your
browser at localhost:8787. (Change the first number after the -p to have a
different address). You can log in with username:pw rstudio:rstudio and
have fun.
Post by John Stanton-Geddes
One thing I like about this is the ease with which I can now get an
RStudio server up and running in the cloud (e.g. I took this for sail on
DigitalOcean.com today). This means in few minutes and 1 penny you have a
URL that you and any collaborators could use to interact with R using the
familiar RStudio interface, already provisioned with your data and
dependencies in place.
Post by John Stanton-Geddes
To keep this brief-ish, I've restricted further commentary to my
http://www.carlboettiger.info/lab-notebook.html
Post by John Stanton-Geddes
Cheers,
Carl
Thanks Rich! some further thoughts / questions below
On Thu, Aug 7, 2014 at 3:11 PM, Rich FitzJohn <
Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies", at
least
Post by John Stanton-Geddes
in it's current form. It requires running on Linux for starters
-
Post by John Stanton-Geddes
I've got it running on a virtual machine on OSX via virtualbox,
but
Post by John Stanton-Geddes
the amount of faffing about there is pretty intimidating. I
believe
Post by John Stanton-Geddes
it's possible to get it running via vagrant (which is in theory
going
Post by John Stanton-Geddes
to be easier to distribute) but at that point it's all getting a
bit
Post by John Stanton-Geddes
silly. It's enlightening to ask a random ecologist to go to the
website for docker (or heroku or vagrant or chef or any of these
newfangled tools) and ask them to guess what they do. We're
down a
Post by John Stanton-Geddes
rabbit hole here.
Completely agree here. Anything that cannot be installed by
downloading and
Post by John Stanton-Geddes
clicking on something is dead in the water. It looks like Docker
is just
Post by John Stanton-Geddes
download and click on Macs or Windows. (Haven't tested, I have
only linux
Post by John Stanton-Geddes
boxes handy). So I'm not sure that the regular user needs to
know that it's
Post by John Stanton-Geddes
running a linux virtual machine under the hood when they aren't
on a linux
Post by John Stanton-Geddes
box.
So I'm optimistic think the installation faffing will largely go
away, if it
Post by John Stanton-Geddes
hasn't yet. I'm more worried about the faffing after it is
installed.
Post by John Stanton-Geddes
I've been getting drone (https://github.com/drone/drone ) up
and
Post by John Stanton-Geddes
running here for one of our currently-closed projects. It uses
docker
Post by John Stanton-Geddes
as a way of insulating the build/tests from the rest of the
system,
Post by John Stanton-Geddes
but it's still far from ready to recommend for general use. The
advantages I see there are: our test suite can run for several
hours
Post by John Stanton-Geddes
without worrying about running up against allowed times, and
working
Post by John Stanton-Geddes
for projects that are not yet open source. It also simplifies
getting
Post by John Stanton-Geddes
things off the container, but I think there are a bunch of ways
of
Post by John Stanton-Geddes
doing that easily enough. However, I'm on the lookout for
something
Post by John Stanton-Geddes
much simpler to set up, especially for local use and/or behind
NAT. I
Post by John Stanton-Geddes
can post the dockerfile at some point (it's apparently not on
this
Post by John Stanton-Geddes
computer!) but it's similarly simple to yours.
Very cool! Yeah, I think there's great promise that we'll see
more
Post by John Stanton-Geddes
easy-to-use tools being built on docker. Is Drone ubuntu-only at
the moment
Post by John Stanton-Geddes
then?
As I see it, the great advantage of all these types of
approaches,
Post by John Stanton-Geddes
independent of the technology, is the recipe-based approach to
documenting dependencies. With travis, drone, docker, etc, you
document your dependencies and if it works for you it will
probably
Post by John Stanton-Geddes
work for someone else.
Definitely. I guess this is the heart of the "DevOpts" approach
(at least
Post by John Stanton-Geddes
according the BCE paper I linked -- they have nice examples that
use these
Post by John Stanton-Geddes
tools, but also include case studies of big collaborative science
projects
Post by John Stanton-Geddes
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I
like about
Post by John Stanton-Geddes
Docker is the versioned images. If you re-run my build scripts
even 5 days
Post by John Stanton-Geddes
from now, you'll get a different image due to ubuntu repo
updates, etc. But
Post by John Stanton-Geddes
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with
locking in
Post by John Stanton-Geddes
particular versions in the build script itself (a la packrat) or
just hoping
Post by John Stanton-Geddes
the most recent version is good enough (a la CRAN).
I'm OK with this being nerd only for a bit, because (like travis
etc)
Post by John Stanton-Geddes
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over
into
Post by John Stanton-Geddes
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds
to work
Post by John Stanton-Geddes
with large data sets that are tedious to download: pull the raw
data
Post by John Stanton-Geddes
as one RUN command, wrangle as another. Then a separate wrangle
step
Post by John Stanton-Geddes
will reuse the intermediate container (I believe). This is sort
of a
Post by John Stanton-Geddes
different way of doing the types of things that Ethan's "eco
data
Post by John Stanton-Geddes
retriever" aims to do. There's some overlap here with make, but
in a
Post by John Stanton-Geddes
way that would let you jump in at a point in the analysis in a
fresh
Post by John Stanton-Geddes
environment.
Great point, hadn't thought about that.
I don't think that people will jump to using virtual
environments for
Post by John Stanton-Geddes
the sake of it - there has to be some pay off. Isolating the
build
Post by John Stanton-Geddes
from the rest of your machine or digging into a 5 year old
project
Post by John Stanton-Geddes
probably does not have widespread appeal to non-desk types
either!
Post by John Stanton-Geddes
Definitely agree with that. I'd like to hear more about your
perspective on
Post by John Stanton-Geddes
CI tools though -- of course we love them, but do you think that
CI has a
Post by John Stanton-Geddes
larger appeal to the average ecologist than other potential
'benefits'? I
Post by John Stanton-Geddes
think the tangible payoffs are: (Cribbing heavily from that
Berkeley
Post by John Stanton-Geddes
1) For instructors: having students in a consistent and optimized
environment with little effort. That environment can become a
resource
Post by John Stanton-Geddes
maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming the
tool is as
Post by John Stanton-Geddes
easy to use on the desktop as whatever they currently do --
clearly we're
Post by John Stanton-Geddes
not there yet).
3) Easier to get collaborators / readers to use & re-use. (I
think that
Post by John Stanton-Geddes
only happens if lots of people are performing research and/or
teaching using
Post by John Stanton-Geddes
these environments -- just like sharing code written in Go just
isn't that
Post by John Stanton-Geddes
useful among ecologists. Clearly we may never get here.)
I
think that the biggest potential draws are the CI-type tools,
but
Post by John Stanton-Geddes
there are probably other tools that require
isolation/virtualisation
Post by John Stanton-Geddes
that will appeal broadly. Then people will accidentally end up
with
Post by John Stanton-Geddes
reproducible work :)
Cheers,
Rich
Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible
research in the
Post by John Stanton-Geddes
comments of Rich et al’s recent post on the rOpenSci blog
where quite a
Post by John Stanton-Geddes
few
of people mentioned the potential for Docker as a way to
facilitate
Post by John Stanton-Geddes
this.
I’ve only just started playing around with Docker, and though
I’m quite
Post by John Stanton-Geddes
impressed, I’m still rather skeptical that non-crazies would
ever use it
Post by John Stanton-Geddes
productively. Nevertheless, I’ve worked up some Dockerfiles to
explore
Post by John Stanton-Geddes
how
one might use this approach to transparently document and
manage a
Post by John Stanton-Geddes
computational environment, and I was hoping to get some
feedback from
Post by John Stanton-Geddes
all of
you.
For those of you who are already much more familiar with
Docker than me
Post by John Stanton-Geddes
(or
are looking for an excuse to explore!), I’d love to get your
feedback on
Post by John Stanton-Geddes
some of the particulars. For everyone, I’d be curious what you
think
Post by John Stanton-Geddes
about
the general concept.
So far I’ve created a dockerfile and image
If you have docker up and running, perhaps you can give it a
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This
image builds
Post by John Stanton-Geddes
on
Dirk Eddelbuettel’s R docker images and serves as a starting
point to
Post by John Stanton-Geddes
test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a
bear then
Post by John Stanton-Geddes
usual
to run, since it needs rJava (requires external libs), Sxslt
(only
Post by John Stanton-Geddes
available
on Omegahat and requires extra libs) and latest phytools (a
tar.gz file
Post by John Stanton-Geddes
from
Liam’s website), along with the usual mess of pandoc/latex
environment
Post by John Stanton-Geddes
to
compile the manuscript itself. By building on ropensci-docker,
we need a
Post by John Stanton-Geddes
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run rmarkdown::render("manuscript.Rmd").
This
Post by John Stanton-Geddes
will recompile the manuscript from cache and leave you to
interactively
Post by John Stanton-Geddes
explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user can
run the code
Post by John Stanton-Geddes
without dependency hell (often not as much an R problem as it
is in
Post by John Stanton-Geddes
Python,
but nevertheless one that I hit frequently, particularly as my
projects
Post by John Stanton-Geddes
age), and also without altering their personal R environment.
Third (in
Post by John Stanton-Geddes
principle) this makes it easy to run the code on a cloud
server, scaling
Post by John Stanton-Geddes
the
computing resources appropriately.
I think the real acid test for this is not merely that it
recreates the
Post by John Stanton-Geddes
results, but that others can build and extend on the work
(with fewer
Post by John Stanton-Geddes
rather
than more barriers than usual). I believe most of that has
nothing to do
Post by John Stanton-Geddes
with this whole software image thing — providing the methods
you use as
Post by John Stanton-Geddes
general-purpose functions in an R package, or publishing the
raw (&
Post by John Stanton-Geddes
processed) data to Dryad with good documentation will always
make work
Post by John Stanton-Geddes
more
modular and easier to re-use than cracking open someone’s
virtual
Post by John Stanton-Geddes
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up
whatever a
Post by John Stanton-Geddes
researcher
or group is already doing into something portable and
extensible. So, is
Post by John Stanton-Geddes
this really portable and extensible?
This presupposes someone can run docker on their OS — and from
the
Post by John Stanton-Geddes
command
line at that. Perhaps that’s the biggest barrier to entry
right now,
Post by John Stanton-Geddes
(though
given docker’s virulent popularity, maybe something smart
people with
Post by John Stanton-Geddes
big
money might soon solve).
The only way to interact with thing is through a bash shell
running on
Post by John Stanton-Geddes
the
container. An RStudio server might be much nicer, but I
haven’t been
Post by John Stanton-Geddes
able to
get that running. Anyone know how to run RStudio server from
docker?
Post by John Stanton-Geddes
(I tried & failed: https://github.com/mingfang/
docker-druid/issues/2)
Post by John Stanton-Geddes
I don’t see how users can move local files on and off the
docker
Post by John Stanton-Geddes
container.
In some ways this is a great virtue — forcing all code to use
fully
Post by John Stanton-Geddes
resolved
paths like pulling data from Dryad instead of their
hard-drive, and
Post by John Stanton-Geddes
pushing
results to a (possibly private) online site to view them. But
obviously
Post by John Stanton-Geddes
a
barrier to entry. Is there a better way to do this?
Alternative strategies
1) Docker is just one of many ways to do this (particularly if
you’re
Post by John Stanton-Geddes
not
concerned about maximum performance speed), and quite probably
not the
Post by John Stanton-Geddes
easiest. Our friends at Berkeley D-Lab opted for a GUI-driven
virtual
Post by John Stanton-Geddes
machine instead, built with Packer and run in Virtualbox,
after their
Post by John Stanton-Geddes
experience proved that students were much more comfortable
with the
Post by John Stanton-Geddes
mouse-driven installation and a pixel-identical environment to
the
Post by John Stanton-Geddes
instructor’s (see their excellen paper on this).
2) Will/should researchers be willing to work and develop in
virtual
Post by John Stanton-Geddes
environments? In some cases, the virtual environment can be
closely
Post by John Stanton-Geddes
coupled
to the native one — you use your own editors etc to do all the
writing,
Post by John Stanton-Geddes
and
then execute in the virtual environment (seems this is easier
in
Post by John Stanton-Geddes
docker/vagrant approach than in the BCE.
[^1]: friends cc’d above: We’re reviving this ropensci-discuss
list to
Post by John Stanton-Geddes
chat
about various issues related to our packages, our goals, and
more broad
Post by John Stanton-Geddes
scientific workflow issues. I’d encourage you to sign up for
the
Post by John Stanton-Geddes
https://groups.google.com/forum/#!forum/ropensci-discuss
—
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
Post by John Stanton-Geddes
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/*</u*
<https://groups.google.com/d/optout>
...
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Carl Boettiger
2014-09-28 21:49:16 UTC
Permalink
Casten,

Thanks for joining the discussion and sharing your experiences, it's
really nice to see how others are using these approaches.

I agree entirely that cloud platforms like digitalocean give a really
nice user experience coupled with docker and RStudio-server.
Certainly ropensci could host it's own hub, but the Docker Hub works
rather nicely already so I'm not sure what the advantage of that might
be? I also agree that docker has many advantages for reproducible
research, though it is worth noting that other 'classical' virtual
machines can offer a very similar solution to the distribution problem
you describe -- e.g. Vagrant Hub.

Nonetheless, I still see running docker locally as an important part
of the equation. In my experience, most researchers still do some
(most/all) of their development locally. If we are to have a
containerized, portable, reproducible development environment, that
means running docker locally (as well as in the cloud).

The reproducible research angle has the most to gain when people can
both build on existing Dockerfiles / docker images, as you mention,
but also when they can write their own Dockerfiles particular to their
packages. I don't think that's too high a barrier -- it's easier than
writing a successful .travis.yml or other CI file for sure -- but it
does mean being able to do more than just access an RStudio server
instance that just happens to be running in a docker container.

On a linux machine, this kind of containerized workflow is remarkably
seamless. I think boot2docker has come a long way in making this
easier, but not there yet.

To that end, we're still doing lots of work on defining some useful
base Dockerfiles and images that others could build from, including
the rstudio example. Current development is at
https://github.com/eddelbuettel/rocker

Thanks for sharing the API call example, that's awesome! (btw, Scott
has an R package in the works for the digitalocean API that may also
be of interest: https://github.com/sckott/analogsea).




On Sat, Sep 27, 2014 at 3:22 PM, Carsten Behring
Post by Carsten Behring
Hi everybody,
some of you mentioned in the previous posts, that the hurdle of using docker
is still to high for a lot of researchers.
One way to lower the hurdle is to use the cloud.
So instead of asking collaborators to install and use docker on their PCs in
1. We install (with the help of an little automatic tool using cloud
services API's) our docker image to share in the "cloud" (digitalocean /
Amazon EC2 others).
The we only send ip address and port to others and then they can start
immediately in Rstudio Browser environment.
In this case the distributor pays .... for the cloud service
2. We send to a collaborator the name of our image (and the registry) and he
uses the same little tool to generate a cloud server containing the RStudio
environment.
-> Their are costs involved for the cloud providers.
- docker image name (based on RStudio, containing all data / R code of the
analysis)
- cloud provider credentials (for billing ...)
- IP address and port of RStudio, ready to use
I did a proof of concept for this with digitalocean and the docker image
mgymrek/docker-reproducibility-example.
(create-droplet "my token" {:name "core1":region "ams3" :size "512mb"
:image 6373176 :ssh_keys [42550] :user_data user-data})
where the "user-data" contains some info for coreos operating system to
#cloud-config
- name: docker-rs.service
command: start
content: |
[Unit]
Description=RStudio service container
Author=Me
After=docker.service
[Service]
Restart=always
ExecStart=/usr/bin/docker run -p 49000:8787 --name "rstudio"
mgymrek/docker-reproducibility-example
ExecStop=/usr/bin/docker stop rstudio
and, voila, on boot it starts RStudio on a new digitalocean server, ready to
use.
It should work the same way for Amazon EC2 or others.
So the tool could allow to select the cloud provider.
I am pretty sure as well, that a tool could be done as well which does the
same for the installation on local PC. (Windows, Linux, OS2)
https://github.com/behrica/ropensciCloud
The big added value of docker compared to classical virtual machines is,
that it solved the distribution problem of the images.
By just specifying "image name", "tag" and "registry" (if not docker hub is
used) each docker client knows how to get the image.
By using common base images it would be even very fast to download the
images (after the first download happened)
Maybe ropensci could host its own image registry somewhere...
Casten
Post by Carl Boettiger
John,
Thanks again for your input. Yeah, lack of support for 32 bit hosts is a
problem; though since you were speaking about AMIs I imagine you were
already used to not working locally, so you can of course try it out on an
amazon image or digitalocean droplet.
Yeah, Titus makes a great point. If we only distributed docker images as
2 GB binary tar files, we'd not be doing much better on the open / remixable
side than a binary VM image. And docker isn't the only way to provide a this
kind of script, as I mentioned earlier.
Nevertheless, I believe there is a technical difference and not just a
cultural one. Docker is not a virtual machine; containers are designed
expressly to be remixable blocks. You can put an R engine on one container
and a mysql database on another and connect them. Docker philosophy aims at
one function per container to maximize this reuse. of course it's up to you
to build this way rather than a single monolithic dockerfile, but the idea
of linking containers is a technical concept at the heart of docker that
offers a second and very different way to address the 'remix' problem of
VMs.
---
Carl Boettiger
http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos
On Sep 10, 2014 10:49 AM, "John Stanton-Geddes"
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is worth
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.
My one residual concern, which is more practical/technological than (open
sci) philosophical is that I still have to be using a system that I can
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported for
64-bit. If I go through the (not always necessary) effort of spinning up an
AMI, I can access it through anything with ssh. The easy solution is to run
Docker on the AMI.
the argument here --
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly, judging
by the hostile reactions over the years ;) -- is that it doesn't really
matter
*which* approach you choose, so much as whether or not the approach you do
choose permits understanding and remixing. So I would argue that neither
an
AMI nor a fully-baked Docker image is sufficient; what I really want is a
*recipe*. In that sense the Docker community seems to be doing a better
job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which is the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.
Hi John,
Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should be
asked elsewhere or a new topic) but I've also recently discovered and become
intrigued by Docker for facilitating reproducible research.
My question: what's the advantage of Docker over an amazon EC2 machine image?
I've moved my analyses to EC2 for better than my local university
cluster. Doesn't my machine image achieve Carl's acid test of allowing
others to build and extend on work? What do I gain by making a Dockerfile on
my already existing EC2 image? Being new to all this, the only clear
advantage I see is a Dockerfile is much smaller than a machine image, but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to
support the little guy. But as with anything, there is a huge diversity of
AMIs and greater discoverability on EC2, at least for now.
On Tue, Aug 12, 2014 at 11:01 AM, Scott Chamberlain
Hmm, looks like DO is planning on it, but not possible yet. Do go
upvote this feature
https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Nice, we could work on this working privately, then when sharing is
available, boom.
Great idea. Yeah, should be possible. Does the DO API support a way
to launch a job on the instance, or otherwise a way to share a custom
machine image publicly? (e.g. the way Amazon EC2 lets you make an AMI public
from an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker image
they have, but that we would then need a wrapper to ssh into the DO machine
and execute the single command needed to bring up the RStudio instance in
the browser.
On Tue, Aug 12, 2014 at 10:35 AM, Scott Chamberlain
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my
Digital Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Scott
Hi folks,
Just thought I'd share an update on this thread -- I've gotten
RStudio Server working in the ropensci-docker image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your
browser at localhost:8787. (Change the first number after the -p to have a
different address). You can log in with username:pw rstudio:rstudio and
have fun.
One thing I like about this is the ease with which I can now get an
RStudio server up and running in the cloud (e.g. I took this for sail on
DigitalOcean.com today). This means in few minutes and 1 penny you have a
URL that you and any collaborators could use to interact with R using the
familiar RStudio interface, already provisioned with your data and
dependencies in place.
To keep this brief-ish, I've restricted further commentary to my
http://www.carlboettiger.info/lab-notebook.html
Cheers,
Carl
Thanks Rich! some further thoughts / questions below
On Thu, Aug 7, 2014 at 3:11 PM, Rich FitzJohn
Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies", at least
in it's current form. It requires running on Linux for starters -
I've got it running on a virtual machine on OSX via virtualbox, but
the amount of faffing about there is pretty intimidating. I believe
it's possible to get it running via vagrant (which is in theory going
to be easier to distribute) but at that point it's all getting a bit
silly. It's enlightening to ask a random ecologist to go to the
website for docker (or heroku or vagrant or chef or any of these
newfangled tools) and ask them to guess what they do. We're down a
rabbit hole here.
Completely agree here. Anything that cannot be installed by
downloading and
clicking on something is dead in the water. It looks like Docker
is just
download and click on Macs or Windows. (Haven't tested, I have
only linux
boxes handy). So I'm not sure that the regular user needs to
know that it's
running a linux virtual machine under the hood when they aren't
on a linux
box.
So I'm optimistic think the installation faffing will largely go
away, if it
hasn't yet. I'm more worried about the faffing after it is installed.
I've been getting drone (https://github.com/drone/drone ) up and
running here for one of our currently-closed projects. It uses
docker
as a way of insulating the build/tests from the rest of the system,
but it's still far from ready to recommend for general use. The
advantages I see there are: our test suite can run for several hours
without worrying about running up against allowed times, and working
for projects that are not yet open source. It also simplifies
getting
things off the container, but I think there are a bunch of ways of
doing that easily enough. However, I'm on the lookout for something
much simpler to set up, especially for local use and/or behind
NAT. I
can post the dockerfile at some point (it's apparently not on this
computer!) but it's similarly simple to yours.
Very cool! Yeah, I think there's great promise that we'll see more
easy-to-use tools being built on docker. Is Drone ubuntu-only at
the moment
then?
As I see it, the great advantage of all these types of approaches,
independent of the technology, is the recipe-based approach to
documenting dependencies. With travis, drone, docker, etc, you
document your dependencies and if it works for you it will probably
work for someone else.
Definitely. I guess this is the heart of the "DevOpts" approach
(at least
according the BCE paper I linked -- they have nice examples that
use these
tools, but also include case studies of big collaborative science
projects
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I
like about
Docker is the versioned images. If you re-run my build scripts
even 5 days
from now, you'll get a different image due to ubuntu repo
updates, etc. But
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with locking in
particular versions in the build script itself (a la packrat) or
just hoping
the most recent version is good enough (a la CRAN).
I'm OK with this being nerd only for a bit, because (like travis etc)
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over into
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds
to work
with large data sets that are tedious to download: pull the raw data
as one RUN command, wrangle as another. Then a separate wrangle step
will reuse the intermediate container (I believe). This is sort of a
different way of doing the types of things that Ethan's "eco data
retriever" aims to do. There's some overlap here with make, but in a
way that would let you jump in at a point in the analysis in a fresh
environment.
Great point, hadn't thought about that.
I don't think that people will jump to using virtual environments for
the sake of it - there has to be some pay off. Isolating the build
from the rest of your machine or digging into a 5 year old project
probably does not have widespread appeal to non-desk types either!
Definitely agree with that. I'd like to hear more about your
perspective on
CI tools though -- of course we love them, but do you think that
CI has a
larger appeal to the average ecologist than other potential
'benefits'? I
think the tangible payoffs are: (Cribbing heavily from that Berkeley
1) For instructors: having students in a consistent and optimized
environment with little effort. That environment can become a resource
maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming the
tool is as
easy to use on the desktop as whatever they currently do --
clearly we're
not there yet).
3) Easier to get collaborators / readers to use & re-use. (I
think that
only happens if lots of people are performing research and/or
teaching using
these environments -- just like sharing code written in Go just
isn't that
useful among ecologists. Clearly we may never get here.)
I
think that the biggest potential draws are the CI-type tools, but
there are probably other tools that require
isolation/virtualisation
that will appeal broadly. Then people will accidentally end up with
reproducible work :)
Cheers,
Rich
Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible
research in the
comments of Rich et al’s recent post on the rOpenSci blog
where quite a
few
of people mentioned the potential for Docker as a way to
facilitate
this.
I’ve only just started playing around with Docker, and though
I’m quite
impressed, I’m still rather skeptical that non-crazies would
ever use it
productively. Nevertheless, I’ve worked up some Dockerfiles to
explore
how
one might use this approach to transparently document and
manage a
computational environment, and I was hoping to get some
feedback from
all of
you.
For those of you who are already much more familiar with
Docker than me
(or
are looking for an excuse to explore!), I’d love to get your
feedback on
some of the particulars. For everyone, I’d be curious what you
think
about
the general concept.
So far I’ve created a dockerfile and image
If you have docker up and running, perhaps you can give it a
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This
image builds
on
Dirk Eddelbuettel’s R docker images and serves as a starting
point to
test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a
bear then
usual
to run, since it needs rJava (requires external libs), Sxslt
(only
available
on Omegahat and requires extra libs) and latest phytools (a
tar.gz file
from
Liam’s website), along with the usual mess of pandoc/latex
environment
to
compile the manuscript itself. By building on ropensci-docker,
we need a
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run
rmarkdown::render("manuscript.Rmd"). This
will recompile the manuscript from cache and leave you to
interactively
explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user can
run the code
without dependency hell (often not as much an R problem as it
is in
Python,
but nevertheless one that I hit frequently, particularly as my
projects
age), and also without altering their personal R environment.
Third (in
principle) this makes it easy to run the code on a cloud
server, scaling
the
computing resources appropriately.
I think the real acid test for this is not merely that it
recreates the
results, but that others can build and extend on the work
(with fewer
rather
than more barriers than usual). I believe most of that has
nothing to do
with this whole software image thing — providing the methods
you use as
general-purpose functions in an R package, or publishing the
raw (&
processed) data to Dryad with good documentation will always
make work
more
modular and easier to re-use than cracking open someone’s
virtual
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up whatever a
researcher
or group is already doing into something portable and
extensible. So, is
this really portable and extensible?
This presupposes someone can run docker on their OS — and from
the
command
line at that. Perhaps that’s the biggest barrier to entry
right now,
(though
given docker’s virulent popularity, maybe something smart
people with
big
money might soon solve).
The only way to interact with thing is through a bash shell
running on
the
container. An RStudio server might be much nicer, but I
haven’t been
able to
get that running. Anyone know how to run RStudio server from
docker?
https://github.com/mingfang/docker-druid/issues/2)
I don’t see how users can move local files on and off the
docker
container.
In some ways this is a great virtue — forcing all code to use
fully
resolved
paths like pulling data from Dryad instead of their
hard-drive, and
pushing
results to a (possibly private) online site to view them. But
obviously
a
barrier to entry. Is there a better way to do this?
Alternative strategies
1) Docker is just one of many ways to do this (particularly if
you’re
not
concerned about maximum performance speed), and quite probably
not the
easiest. Our friends at Berkeley D-Lab opted for a GUI-driven
virtual
machine instead, built with Packer and run in Virtualbox,
after their
experience proved that students were much more comfortable
with the
mouse-driven installation and a pixel-identical environment to
the
instructor’s (see their excellen paper on this).
2) Will/should researchers be willing to work and develop in
virtual
environments? In some cases, the virtual environment can be
closely
coupled
to the native one — you use your own editors etc to do all the
writing,
and
then execute in the virtual environment (seems this is easier in
docker/vagrant approach than in the BCE.
[^1]: friends cc’d above: We’re reviving this ropensci-discuss
list to
chat
about various issues related to our packages, our goals, and
more broad
scientific workflow issues. I’d encourage you to sign up for
the
https://groups.google.com/forum/#!forum/ropensci-discuss

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/</u
...
--
You received this message because you are subscribed to the Google Groups
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Carsten Behring
2014-09-29 08:12:09 UTC
Permalink
Dear Carl,

thanks for your support and comments.

I would like to reply to some of you points.

In a lot of organisations, getting Linux or even docker on windows is
impossible to achieve. Security concerns let a lot of admins not touch the
corporate PCs to install "strange" applications.
This will get worse in the coming years in my view.
That is why I would like to envision as well a reproducible research
workflow, which is completely independent of any local installed software,
and it only needs a web browser.

I believe that in certain organisations it is easier to get a Creditcard
and budget to pay cloud computing services, then get "special
virtualisation software like VMWare/Docker" on the standard corporate
windows PC of the user.
This means, we need to come to solutions which contains "RStudio in the
cloud" as one possible computing engine.
The same argumentation is true for getting access to "fast hardware with a
lot of memory".

Regarding the usage of docker hub:

Docker hub is for sure the best place for the kind of base images and its
Dockerfiles you are working on.

I was thinking that we could envision that "each individual study /
analysis" could be published as a docker image.
In this case, I would say docker hub is not the right place to store all of
those.
It would be nice to have a specific registry only to share "docker images
of statistical analysis" which could contain different features for
searching and son on.

So my ideal scenario would be this:

1. There would be one (or even several) docker registries dedicated to
"docker images containing Rstudio based images with individual analysis
projects"
2. Having the images there, means a user with a local docker installation
can use them as usual with his local Docker client
3. A user without a docker installation can "press a button" and he gets
automatically a new cloud server (Digitalocean, Amazon Ec2, others)
containing the Rstudio based image,
by giving his authentication data (so he pays for it). So he can login
immediately and look (and change) the analysis directly in the cloud.

What would be missing is, that in case 3), the user can not easily
republish his changes as a new Docker image. But this is solvable. It
would need a R package which can interact with the running cloud server
(over ssh ...) and re-creates and re-publishes a new version of the image
on request of the user.

So in this scenario, setting up somewhere a "docker hub based" registry for
"RStudio based reproducible research images" would be the starting point.
The code is here: https://github.com/docker/docker-registry

Please provide me with any comments you might have.
Post by Carl Boettiger
Casten,
Thanks for joining the discussion and sharing your experiences, it's
really nice to see how others are using these approaches.
I agree entirely that cloud platforms like digitalocean give a really
nice user experience coupled with docker and RStudio-server.
Certainly ropensci could host it's own hub, but the Docker Hub works
rather nicely already so I'm not sure what the advantage of that might
be? I also agree that docker has many advantages for reproducible
research, though it is worth noting that other 'classical' virtual
machines can offer a very similar solution to the distribution problem
you describe -- e.g. Vagrant Hub.
Nonetheless, I still see running docker locally as an important part
of the equation. In my experience, most researchers still do some
(most/all) of their development locally. If we are to have a
containerized, portable, reproducible development environment, that
means running docker locally (as well as in the cloud).
The reproducible research angle has the most to gain when people can
both build on existing Dockerfiles / docker images, as you mention,
but also when they can write their own Dockerfiles particular to their
packages. I don't think that's too high a barrier -- it's easier than
writing a successful .travis.yml or other CI file for sure -- but it
does mean being able to do more than just access an RStudio server
instance that just happens to be running in a docker container.
On a linux machine, this kind of containerized workflow is remarkably
seamless. I think boot2docker has come a long way in making this
easier, but not there yet.
To that end, we're still doing lots of work on defining some useful
base Dockerfiles and images that others could build from, including
the rstudio example. Current development is at
https://github.com/eddelbuettel/rocker
Thanks for sharing the API call example, that's awesome! (btw, Scott
has an R package in the works for the digitalocean API that may also
be of interest: https://github.com/sckott/analogsea).
On Sat, Sep 27, 2014 at 3:22 PM, Carsten Behring
Post by Carsten Behring
Hi everybody,
some of you mentioned in the previous posts, that the hurdle of using
docker
Post by Carsten Behring
is still to high for a lot of researchers.
One way to lower the hurdle is to use the cloud.
So instead of asking collaborators to install and use docker on their
PCs in
Post by Carsten Behring
1. We install (with the help of an little automatic tool using cloud
services API's) our docker image to share in the "cloud" (digitalocean
/
Post by Carsten Behring
Amazon EC2 others).
The we only send ip address and port to others and then they can start
immediately in Rstudio Browser environment.
In this case the distributor pays .... for the cloud service
2. We send to a collaborator the name of our image (and the registry)
and he
Post by Carsten Behring
uses the same little tool to generate a cloud server containing the
RStudio
Post by Carsten Behring
environment.
-> Their are costs involved for the cloud providers.
- docker image name (based on RStudio, containing all data / R code of
the
Post by Carsten Behring
analysis)
- cloud provider credentials (for billing ...)
- IP address and port of RStudio, ready to use
I did a proof of concept for this with digitalocean and the docker image
mgymrek/docker-reproducibility-example.
(create-droplet "my token" {:name "core1":region "ams3" :size "512mb"
:image 6373176 :ssh_keys [42550] :user_data user-data})
where the "user-data" contains some info for coreos operating system to
#cloud-config
- name: docker-rs.service
command: start
content: |
[Unit]
Description=RStudio service container
Author=Me
After=docker.service
[Service]
Restart=always
ExecStart=/usr/bin/docker run -p 49000:8787 --name "rstudio"
mgymrek/docker-reproducibility-example
ExecStop=/usr/bin/docker stop rstudio
and, voila, on boot it starts RStudio on a new digitalocean server,
ready to
Post by Carsten Behring
use.
It should work the same way for Amazon EC2 or others.
So the tool could allow to select the cloud provider.
I am pretty sure as well, that a tool could be done as well which does
the
Post by Carsten Behring
same for the installation on local PC. (Windows, Linux, OS2)
https://github.com/behrica/ropensciCloud
The big added value of docker compared to classical virtual machines is,
that it solved the distribution problem of the images.
By just specifying "image name", "tag" and "registry" (if not docker hub
is
Post by Carsten Behring
used) each docker client knows how to get the image.
By using common base images it would be even very fast to download the
images (after the first download happened)
Maybe ropensci could host its own image registry somewhere...
Casten
Post by Carl Boettiger
John,
Thanks again for your input. Yeah, lack of support for 32 bit hosts is
a
Post by Carsten Behring
Post by Carl Boettiger
problem; though since you were speaking about AMIs I imagine you were
already used to not working locally, so you can of course try it out on
an
Post by Carsten Behring
Post by Carl Boettiger
amazon image or digitalocean droplet.
Yeah, Titus makes a great point. If we only distributed docker images
as
Post by Carsten Behring
Post by Carl Boettiger
2 GB binary tar files, we'd not be doing much better on the open /
remixable
Post by Carsten Behring
Post by Carl Boettiger
side than a binary VM image. And docker isn't the only way to provide a
this
Post by Carsten Behring
Post by Carl Boettiger
kind of script, as I mentioned earlier.
Nevertheless, I believe there is a technical difference and not just a
cultural one. Docker is not a virtual machine; containers are
designed
Post by Carsten Behring
Post by Carl Boettiger
expressly to be remixable blocks. You can put an R engine on one
container
Post by Carsten Behring
Post by Carl Boettiger
and a mysql database on another and connect them. Docker philosophy
aims at
Post by Carsten Behring
Post by Carl Boettiger
one function per container to maximize this reuse. of course it's up
to you
Post by Carsten Behring
Post by Carl Boettiger
to build this way rather than a single monolithic dockerfile, but the
idea
Post by Carsten Behring
Post by Carl Boettiger
of linking containers is a technical concept at the heart of docker
that
Post by Carsten Behring
Post by Carl Boettiger
offers a second and very different way to address the 'remix' problem
of
Post by Carsten Behring
Post by Carl Boettiger
VMs.
---
Carl Boettiger
http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos
On Sep 10, 2014 10:49 AM, "John Stanton-Geddes"
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is
worth
Post by Carsten Behring
Post by Carl Boettiger
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.
My one residual concern, which is more practical/technological than
(open
Post by Carsten Behring
Post by Carl Boettiger
sci) philosophical is that I still have to be using a system that I can
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported for
64-bit. If I go through the (not always necessary) effort of spinning
up an
Post by Carsten Behring
Post by Carl Boettiger
AMI, I can access it through anything with ssh. The easy solution is to
run
Post by Carsten Behring
Post by Carl Boettiger
Docker on the AMI.
the argument here --
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly, judging
by the hostile reactions over the years ;) -- is that it doesn't really
matter
*which* approach you choose, so much as whether or not the approach you
do
Post by Carsten Behring
Post by Carl Boettiger
choose permits understanding and remixing. So I would argue that
neither
Post by Carsten Behring
Post by Carl Boettiger
an
AMI nor a fully-baked Docker image is sufficient; what I really want is
a
Post by Carsten Behring
Post by Carl Boettiger
*recipe*. In that sense the Docker community seems to be doing a
better
Post by Carsten Behring
Post by Carl Boettiger
job of
setting cultural expectations than the VM community: for Docker,
typically
Post by Carsten Behring
Post by Carl Boettiger
you provide some sort of install recipe for the whole thing, which is
the
Post by Carsten Behring
Post by Carl Boettiger
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural
expectations.
Post by Carsten Behring
Post by Carl Boettiger
On Wednesday, September 10, 2014 12:40:46 PM UTC-4, Carl Boettiger
Hi John,
Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should be
asked elsewhere or a new topic) but I've also recently discovered and
become
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
intrigued by Docker for facilitating reproducible research.
My question: what's the advantage of Docker over an amazon EC2
machine
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
image?
I've moved my analyses to EC2 for better than my local university
cluster. Doesn't my machine image achieve Carl's acid test of
allowing
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
others to build and extend on work? What do I gain by making a
Dockerfile on
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
my already existing EC2 image? Being new to all this, the only clear
advantage I see is a Dockerfile is much smaller than a machine image,
but
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
this seems like a rather trivial concern in comparison to 100s of
gigs of
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
sequence data associated with my project.
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to
support the little guy. But as with anything, there is a huge
diversity of
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
AMIs and greater discoverability on EC2, at least for now.
On Tue, Aug 12, 2014 at 11:01 AM, Scott Chamberlain
Hmm, looks like DO is planning on it, but not possible yet. Do go
upvote this feature
https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
Nice, we could work on this working privately, then when sharing is
available, boom.
Great idea. Yeah, should be possible. Does the DO API support a
way
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
to launch a job on the instance, or otherwise a way to share a
custom
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
machine image publicly? (e.g. the way Amazon EC2 lets you make an
AMI public
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
from an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker image
they have, but that we would then need a wrapper to ssh into the
DO machine
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
and execute the single command needed to bring up the RStudio
instance in
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
the browser.
On Tue, Aug 12, 2014 at 10:35 AM, Scott Chamberlain
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my
Digital Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Scott
Hi folks,
Just thought I'd share an update on this thread -- I've gotten
RStudio Server working in the ropensci-docker image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your
browser at localhost:8787. (Change the first number after the
-p to have a
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
different address). You can log in with username:pw
rstudio:rstudio and
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
have fun.
One thing I like about this is the ease with which I can now get
an
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
RStudio server up and running in the cloud (e.g. I took this for
sail on
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
DigitalOcean.com today). This means in few minutes and 1 penny
you have a
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
URL that you and any collaborators could use to interact with R
using the
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
familiar RStudio interface, already provisioned with your data
and
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
dependencies in place.
To keep this brief-ish, I've restricted further commentary to my
http://www.carlboettiger.info/lab-notebook.html
Cheers,
Carl
On Thu, Aug 7, 2014 at 3:44 PM, Carl Boettiger <
Thanks Rich! some further thoughts / questions below
On Thu, Aug 7, 2014 at 3:11 PM, Rich FitzJohn
Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies",
at
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
least
in it's current form. It requires running on Linux for
starters
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
-
I've got it running on a virtual machine on OSX via
virtualbox,
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
but
the amount of faffing about there is pretty intimidating. I
believe
it's possible to get it running via vagrant (which is in
theory
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
going
to be easier to distribute) but at that point it's all
getting a
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
bit
silly. It's enlightening to ask a random ecologist to go to
the
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
website for docker (or heroku or vagrant or chef or any of
these
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
newfangled tools) and ask them to guess what they do. We're
down a
rabbit hole here.
Completely agree here. Anything that cannot be installed by
downloading and
clicking on something is dead in the water. It looks like
Docker
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
is just
download and click on Macs or Windows. (Haven't tested, I
have
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
only linux
boxes handy). So I'm not sure that the regular user needs to
know that it's
running a linux virtual machine under the hood when they
aren't
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
on a linux
box.
So I'm optimistic think the installation faffing will largely
go
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
away, if it
hasn't yet. I'm more worried about the faffing after it is
installed.
I've been getting drone (https://github.com/drone/drone ) up
and
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
running here for one of our currently-closed projects. It
uses
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
docker
as a way of insulating the build/tests from the rest of the
system,
but it's still far from ready to recommend for general use.
The
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
advantages I see there are: our test suite can run for
several
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
hours
without worrying about running up against allowed times, and
working
for projects that are not yet open source. It also
simplifies
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
getting
things off the container, but I think there are a bunch of
ways
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
of
doing that easily enough. However, I'm on the lookout for
something
much simpler to set up, especially for local use and/or
behind
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
NAT. I
can post the dockerfile at some point (it's apparently not on
this
computer!) but it's similarly simple to yours.
Very cool! Yeah, I think there's great promise that we'll see more
easy-to-use tools being built on docker. Is Drone ubuntu-only
at
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
the moment
then?
As I see it, the great advantage of all these types of
approaches,
independent of the technology, is the recipe-based approach
to
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
documenting dependencies. With travis, drone, docker, etc,
you
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
document your dependencies and if it works for you it will
probably
work for someone else.
Definitely. I guess this is the heart of the "DevOpts"
approach
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
(at least
according the BCE paper I linked -- they have nice examples
that
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
use these
tools, but also include case studies of big collaborative
science
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
projects
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I
like about
Docker is the versioned images. If you re-run my build
scripts
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
even 5 days
from now, you'll get a different image due to ubuntu repo
updates, etc. But
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with
locking in
particular versions in the build script itself (a la packrat)
or
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
just hoping
the most recent version is good enough (a la CRAN).
I'm OK with this being nerd only for a bit, because (like
travis
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
etc)
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry
over
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
into
less nerdy activities. One that appeals to me would be to
take
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
advantage of the fancy way that Docker does incremental
builds
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
to work
with large data sets that are tedious to download: pull the
raw
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
data
as one RUN command, wrangle as another. Then a separate
wrangle
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
step
will reuse the intermediate container (I believe). This is
sort
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
of a
different way of doing the types of things that Ethan's "eco data
retriever" aims to do. There's some overlap here with make,
but
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
in a
way that would let you jump in at a point in the analysis in
a
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
fresh
environment.
Great point, hadn't thought about that.
I don't think that people will jump to using virtual
environments for
the sake of it - there has to be some pay off. Isolating the
build
from the rest of your machine or digging into a 5 year old
project
probably does not have widespread appeal to non-desk types
either!
Definitely agree with that. I'd like to hear more about your
perspective on
CI tools though -- of course we love them, but do you think
that
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
CI has a
larger appeal to the average ecologist than other potential
'benefits'? I
think the tangible payoffs are: (Cribbing heavily from that
Berkeley
1) For instructors: having students in a consistent and
optimized
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
environment with little effort. That environment can become a
resource
maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming the
tool is as
easy to use on the desktop as whatever they currently do --
clearly we're
not there yet).
3) Easier to get collaborators / readers to use & re-use. (I
think that
only happens if lots of people are performing research and/or
teaching using
these environments -- just like sharing code written in Go
just
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
isn't that
useful among ecologists. Clearly we may never get here.)
I
think that the biggest potential draws are the CI-type tools, but
there are probably other tools that require
isolation/virtualisation
that will appeal broadly. Then people will accidentally end
up
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
with
reproducible work :)
Cheers,
Rich
Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible
research in the
comments of Rich et al’s recent post on the rOpenSci blog
where quite a
few
of people mentioned the potential for Docker as a way to
facilitate
this.
I’ve only just started playing around with Docker, and
though
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
I’m quite
impressed, I’m still rather skeptical that non-crazies
would
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
ever use it
productively. Nevertheless, I’ve worked up some Dockerfiles
to
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
explore
how
one might use this approach to transparently document and
manage a
computational environment, and I was hoping to get some
feedback from
all of
you.
For those of you who are already much more familiar with
Docker than me
(or
are looking for an excuse to explore!), I’d love to get
your
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
feedback on
some of the particulars. For everyone, I’d be curious what
you
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
think
about
the general concept.
So far I’ve created a dockerfile and image
If you have docker up and running, perhaps you can give it
a
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This
image builds
on
Dirk Eddelbuettel’s R docker images and serves as a
starting
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
point to
test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a
bear then
usual
to run, since it needs rJava (requires external libs),
Sxslt
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
(only
available
on Omegahat and requires extra libs) and latest phytools (a
tar.gz file
from
Liam’s website), along with the usual mess of pandoc/latex
environment
to
compile the manuscript itself. By building on
ropensci-docker,
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
we need a
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run
rmarkdown::render("manuscript.Rmd"). This
will recompile the manuscript from cache and leave you to
interactively
explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user can
run the code
without dependency hell (often not as much an R problem as
it
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
is in
Python,
but nevertheless one that I hit frequently, particularly as
my
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
projects
age), and also without altering their personal R
environment.
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
Third (in
principle) this makes it easy to run the code on a cloud
server, scaling
the
computing resources appropriately.
I think the real acid test for this is not merely that it
recreates the
results, but that others can build and extend on the work
(with fewer
rather
than more barriers than usual). I believe most of that has
nothing to do
with this whole software image thing — providing the
methods
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
you use as
general-purpose functions in an R package, or publishing
the
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
raw (&
processed) data to Dryad with good documentation will
always
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
make work
more
modular and easier to re-use than cracking open someone’s
virtual
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up
whatever a
researcher
or group is already doing into something portable and
extensible. So, is
this really portable and extensible?
This presupposes someone can run docker on their OS — and
from
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
the
command
line at that. Perhaps that’s the biggest barrier to entry
right now,
(though
given docker’s virulent popularity, maybe something smart
people with
big
money might soon solve).
The only way to interact with thing is through a bash shell
running on
the
container. An RStudio server might be much nicer, but I
haven’t been
able to
get that running. Anyone know how to run RStudio server
from
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
docker?
https://github.com/mingfang/docker-druid/issues/2)
I don’t see how users can move local files on and off the
docker
container.
In some ways this is a great virtue — forcing all code to
use
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
fully
resolved
paths like pulling data from Dryad instead of their
hard-drive, and
pushing
results to a (possibly private) online site to view them.
But
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
obviously
a
barrier to entry. Is there a better way to do this?
Alternative strategies
1) Docker is just one of many ways to do this (particularly
if
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
you’re
not
concerned about maximum performance speed), and quite
probably
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
not the
easiest. Our friends at Berkeley D-Lab opted for a
GUI-driven
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
virtual
machine instead, built with Packer and run in Virtualbox,
after their
experience proved that students were much more comfortable
with the
mouse-driven installation and a pixel-identical environment
to
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
the
instructor’s (see their excellen paper on this).
2) Will/should researchers be willing to work and develop
in
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
virtual
environments? In some cases, the virtual environment can be
closely
coupled
to the native one — you use your own editors etc to do all
the
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
writing,
and
then execute in the virtual environment (seems this is
easier
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
in
docker/vagrant approach than in the BCE.
[^1]: friends cc’d above: We’re reviving this
ropensci-discuss
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
list to
chat
about various issues related to our packages, our goals,
and
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
more broad
scientific workflow issues. I’d encourage you to sign up
for
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
the
https://groups.google.com/forum/#!forum/ropensci-discuss
—
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the
Google
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from
it,
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the
Google
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/</u
...
--
You received this message because you are subscribed to the Google
Groups
Post by Carsten Behring
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
an
Post by Carsten Behring
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Carsten Behring
2014-09-29 15:29:33 UTC
Permalink
Dear all,

I deployed the very first version of my docker-to-the-cloud application.
It basically takes the name of an existing docker images and starts a new
digitalocean server, on which the image gets automatically installed and
started.

If you use an image based on this docker file:
cboettig <https://registry.hub.docker.com/u/cboettig/> / ropensci
<https://registry.hub.docker.com/u/cboettig/ropensci/>



It should make the rstudio available on port 49000 (mapped from 8787) of
the newly created droplet.

For the billing to work, you need to put your digitalocean token.

A valid ssh_id is needed as well.
In theory we do not need ssh, (nothing gets done via ssh), but the coreos
digitalocean image I use, requires an ssh_id to be specified.

the application is here: https://blooming-lake-3277.herokuapp.com

Example parameters could be:

Token: basidbasiucbauiocbaobca (long string)
image name: cboettig/ropensci
ssh_id: 12345 (is only visible via api ... to be replaced by "key name"
soon)

Send after some minutes the Rstudio should be available at http:// IP:49000

The code of the app is available here:

https://github.com/behrica/ropensciCloud

Please try it out and provide me with any comments you might have.
Post by Carsten Behring
Dear Carl,
thanks for your support and comments.
I would like to reply to some of you points.
In a lot of organisations, getting Linux or even docker on windows is
impossible to achieve. Security concerns let a lot of admins not touch the
corporate PCs to install "strange" applications.
This will get worse in the coming years in my view.
That is why I would like to envision as well a reproducible research
workflow, which is completely independent of any local installed software,
and it only needs a web browser.
I believe that in certain organisations it is easier to get a Creditcard
and budget to pay cloud computing services, then get "special
virtualisation software like VMWare/Docker" on the standard corporate
windows PC of the user.
This means, we need to come to solutions which contains "RStudio in the
cloud" as one possible computing engine.
The same argumentation is true for getting access to "fast hardware with
a lot of memory".
Docker hub is for sure the best place for the kind of base images and its
Dockerfiles you are working on.
I was thinking that we could envision that "each individual study /
analysis" could be published as a docker image.
In this case, I would say docker hub is not the right place to store all
of those.
It would be nice to have a specific registry only to share "docker images
of statistical analysis" which could contain different features for
searching and son on.
1. There would be one (or even several) docker registries dedicated to
"docker images containing Rstudio based images with individual analysis
projects"
2. Having the images there, means a user with a local docker installation
can use them as usual with his local Docker client
3. A user without a docker installation can "press a button" and he gets
automatically a new cloud server (Digitalocean, Amazon Ec2, others)
containing the Rstudio based image,
by giving his authentication data (so he pays for it). So he can login
immediately and look (and change) the analysis directly in the cloud.
What would be missing is, that in case 3), the user can not easily
republish his changes as a new Docker image. But this is solvable. It
would need a R package which can interact with the running cloud server
(over ssh ...) and re-creates and re-publishes a new version of the image
on request of the user.
So in this scenario, setting up somewhere a "docker hub based" registry
for "RStudio based reproducible research images" would be the starting
point.
The code is here: https://github.com/docker/docker-registry
Please provide me with any comments you might have.
Casten,
Thanks for joining the discussion and sharing your experiences, it's
really nice to see how others are using these approaches.
I agree entirely that cloud platforms like digitalocean give a really
nice user experience coupled with docker and RStudio-server.
Certainly ropensci could host it's own hub, but the Docker Hub works
rather nicely already so I'm not sure what the advantage of that might
be? I also agree that docker has many advantages for reproducible
research, though it is worth noting that other 'classical' virtual
machines can offer a very similar solution to the distribution problem
you describe -- e.g. Vagrant Hub.
Nonetheless, I still see running docker locally as an important part
of the equation. In my experience, most researchers still do some
(most/all) of their development locally. If we are to have a
containerized, portable, reproducible development environment, that
means running docker locally (as well as in the cloud).
The reproducible research angle has the most to gain when people can
both build on existing Dockerfiles / docker images, as you mention,
but also when they can write their own Dockerfiles particular to their
packages. I don't think that's too high a barrier -- it's easier than
writing a successful .travis.yml or other CI file for sure -- but it
does mean being able to do more than just access an RStudio server
instance that just happens to be running in a docker container.
On a linux machine, this kind of containerized workflow is remarkably
seamless. I think boot2docker has come a long way in making this
easier, but not there yet.
To that end, we're still doing lots of work on defining some useful
base Dockerfiles and images that others could build from, including
the rstudio example. Current development is at
https://github.com/eddelbuettel/rocker
Thanks for sharing the API call example, that's awesome! (btw, Scott
has an R package in the works for the digitalocean API that may also
be of interest: https://github.com/sckott/analogsea).
On Sat, Sep 27, 2014 at 3:22 PM, Carsten Behring
Post by Carsten Behring
Hi everybody,
some of you mentioned in the previous posts, that the hurdle of using
docker
Post by Carsten Behring
is still to high for a lot of researchers.
One way to lower the hurdle is to use the cloud.
So instead of asking collaborators to install and use docker on their
PCs in
Post by Carsten Behring
1. We install (with the help of an little automatic tool using cloud
services API's) our docker image to share in the "cloud" (digitalocean
/
Post by Carsten Behring
Amazon EC2 others).
The we only send ip address and port to others and then they can start
immediately in Rstudio Browser environment.
In this case the distributor pays .... for the cloud service
2. We send to a collaborator the name of our image (and the registry)
and he
Post by Carsten Behring
uses the same little tool to generate a cloud server containing the
RStudio
Post by Carsten Behring
environment.
-> Their are costs involved for the cloud providers.
- docker image name (based on RStudio, containing all data / R code of
the
Post by Carsten Behring
analysis)
- cloud provider credentials (for billing ...)
- IP address and port of RStudio, ready to use
I did a proof of concept for this with digitalocean and the docker image
mgymrek/docker-reproducibility-example.
(create-droplet "my token" {:name "core1":region "ams3" :size "512mb"
:image 6373176 :ssh_keys [42550] :user_data user-data})
where the "user-data" contains some info for coreos operating system to
#cloud-config
- name: docker-rs.service
command: start
content: |
[Unit]
Description=RStudio service container
Author=Me
After=docker.service
[Service]
Restart=always
ExecStart=/usr/bin/docker run -p 49000:8787 --name "rstudio"
mgymrek/docker-reproducibility-example
ExecStop=/usr/bin/docker stop rstudio
and, voila, on boot it starts RStudio on a new digitalocean server,
ready to
Post by Carsten Behring
use.
It should work the same way for Amazon EC2 or others.
So the tool could allow to select the cloud provider.
I am pretty sure as well, that a tool could be done as well which does
the
Post by Carsten Behring
same for the installation on local PC. (Windows, Linux, OS2)
https://github.com/behrica/ropensciCloud
The big added value of docker compared to classical virtual machines is,
that it solved the distribution problem of the images.
By just specifying "image name", "tag" and "registry" (if not docker hub
is
Post by Carsten Behring
used) each docker client knows how to get the image.
By using common base images it would be even very fast to download the
images (after the first download happened)
Maybe ropensci could host its own image registry somewhere...
Casten
Post by Carl Boettiger
John,
Thanks again for your input. Yeah, lack of support for 32 bit hosts is
a
Post by Carsten Behring
Post by Carl Boettiger
problem; though since you were speaking about AMIs I imagine you were
already used to not working locally, so you can of course try it out on
an
Post by Carsten Behring
Post by Carl Boettiger
amazon image or digitalocean droplet.
Yeah, Titus makes a great point. If we only distributed docker images
as
Post by Carsten Behring
Post by Carl Boettiger
2 GB binary tar files, we'd not be doing much better on the open /
remixable
Post by Carsten Behring
Post by Carl Boettiger
side than a binary VM image. And docker isn't the only way to provide a
this
Post by Carsten Behring
Post by Carl Boettiger
kind of script, as I mentioned earlier.
Nevertheless, I believe there is a technical difference and not just a
cultural one. Docker is not a virtual machine; containers are
designed
Post by Carsten Behring
Post by Carl Boettiger
expressly to be remixable blocks. You can put an R engine on one
container
Post by Carsten Behring
Post by Carl Boettiger
and a mysql database on another and connect them. Docker philosophy
aims at
Post by Carsten Behring
Post by Carl Boettiger
one function per container to maximize this reuse. of course it's up
to you
Post by Carsten Behring
Post by Carl Boettiger
to build this way rather than a single monolithic dockerfile, but the
idea
Post by Carsten Behring
Post by Carl Boettiger
of linking containers is a technical concept at the heart of docker
that
Post by Carsten Behring
Post by Carl Boettiger
offers a second and very different way to address the 'remix' problem
of
Post by Carsten Behring
Post by Carl Boettiger
VMs.
---
Carl Boettiger
http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos
On Sep 10, 2014 10:49 AM, "John Stanton-Geddes"
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is
worth
Post by Carsten Behring
Post by Carl Boettiger
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.
My one residual concern, which is more practical/technological than
(open
Post by Carsten Behring
Post by Carl Boettiger
sci) philosophical is that I still have to be using a system that I can
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported for
64-bit. If I go through the (not always necessary) effort of spinning
up an
Post by Carsten Behring
Post by Carl Boettiger
AMI, I can access it through anything with ssh. The easy solution is to
run
Post by Carsten Behring
Post by Carl Boettiger
Docker on the AMI.
the argument here --
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly, judging
by the hostile reactions over the years ;) -- is that it doesn't really
matter
*which* approach you choose, so much as whether or not the approach you
do
Post by Carsten Behring
Post by Carl Boettiger
choose permits understanding and remixing. So I would argue that
neither
Post by Carsten Behring
Post by Carl Boettiger
an
AMI nor a fully-baked Docker image is sufficient; what I really want is
a
Post by Carsten Behring
Post by Carl Boettiger
*recipe*. In that sense the Docker community seems to be doing a
better
Post by Carsten Behring
Post by Carl Boettiger
job of
setting cultural expectations than the VM community: for Docker,
typically
Post by Carsten Behring
Post by Carl Boettiger
you provide some sort of install recipe for the whole thing, which is
the
Post by Carsten Behring
Post by Carl Boettiger
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural
expectations.
Post by Carsten Behring
Post by Carl Boettiger
On Wednesday, September 10, 2014 12:40:46 PM UTC-4, Carl Boettiger
Hi John,
Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should be
asked elsewhere or a new topic) but I've also recently discovered and
become
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
intrigued by Docker for facilitating reproducible research.
My question: what's the advantage of Docker over an amazon EC2
machine
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
image?
I've moved my analyses to EC2 for better than my local university
cluster. Doesn't my machine image achieve Carl's acid test of
allowing
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
others to build and extend on work? What do I gain by making a
Dockerfile on
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
my already existing EC2 image? Being new to all this, the only clear
advantage I see is a Dockerfile is much smaller than a machine image,
but
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
this seems like a rather trivial concern in comparison to 100s of
gigs of
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
sequence data associated with my project.
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to
support the little guy. But as with anything, there is a huge
diversity of
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
AMIs and greater discoverability on EC2, at least for now.
On Tue, Aug 12, 2014 at 11:01 AM, Scott Chamberlain
Hmm, looks like DO is planning on it, but not possible yet. Do go
upvote this feature
https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
Nice, we could work on this working privately, then when sharing is
available, boom.
Great idea. Yeah, should be possible. Does the DO API support a
way
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
to launch a job on the instance, or otherwise a way to share a
custom
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
machine image publicly? (e.g. the way Amazon EC2 lets you make an
AMI public
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
from an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker image
they have, but that we would then need a wrapper to ssh into the
DO machine
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
and execute the single command needed to bring up the RStudio
instance in
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
the browser.
On Tue, Aug 12, 2014 at 10:35 AM, Scott Chamberlain
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my
Digital Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Scott
Hi folks,
Just thought I'd share an update on this thread -- I've gotten
RStudio Server working in the ropensci-docker image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your
browser at localhost:8787. (Change the first number after the
-p to have a
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
different address). You can log in with username:pw
rstudio:rstudio and
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
have fun.
One thing I like about this is the ease with which I can now get
an
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
RStudio server up and running in the cloud (e.g. I took this for
sail on
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
DigitalOcean.com today). This means in few minutes and 1 penny
you have a
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
URL that you and any collaborators could use to interact with R
using the
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
familiar RStudio interface, already provisioned with your data
and
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
dependencies in place.
To keep this brief-ish, I've restricted further commentary to my
http://www.carlboettiger.info/lab-notebook.html
Cheers,
Carl
On Thu, Aug 7, 2014 at 3:44 PM, Carl Boettiger <
Thanks Rich! some further thoughts / questions below
On Thu, Aug 7, 2014 at 3:11 PM, Rich FitzJohn
Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies",
at
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
least
in it's current form. It requires running on Linux for
starters
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
-
I've got it running on a virtual machine on OSX via
virtualbox,
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
but
the amount of faffing about there is pretty intimidating. I
believe
it's possible to get it running via vagrant (which is in
theory
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
going
to be easier to distribute) but at that point it's all
getting a
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
bit
silly. It's enlightening to ask a random ecologist to go to
the
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
website for docker (or heroku or vagrant or chef or any of
these
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
newfangled tools) and ask them to guess what they do. We're
down a
rabbit hole here.
Completely agree here. Anything that cannot be installed by
downloading and
clicking on something is dead in the water. It looks like
Docker
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
is just
download and click on Macs or Windows. (Haven't tested, I
have
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
only linux
boxes handy). So I'm not sure that the regular user needs to
know that it's
running a linux virtual machine under the hood when they
aren't
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
on a linux
box.
So I'm optimistic think the installation faffing will largely
go
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
away, if it
hasn't yet. I'm more worried about the faffing after it is
installed.
I've been getting drone (https://github.com/drone/drone ) up
and
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
running here for one of our currently-closed projects. It
uses
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
docker
as a way of insulating the build/tests from the rest of the
system,
but it's still far from ready to recommend for general use.
The
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
advantages I see there are: our test suite can run for
several
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
hours
without worrying about running up against allowed times, and
working
for projects that are not yet open source. It also
simplifies
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
getting
things off the container, but I think there are a bunch of
ways
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
of
doing that easily enough. However, I'm on the lookout for
something
much simpler to set up, especially for local use and/or
behind
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
NAT. I
can post the dockerfile at some point (it's apparently not on
this
computer!) but it's similarly simple to yours.
Very cool! Yeah, I think there's great promise that we'll see more
easy-to-use tools being built on docker. Is Drone ubuntu-only
at
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
the moment
then?
As I see it, the great advantage of all these types of
approaches,
independent of the technology, is the recipe-based approach
to
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
documenting dependencies. With travis, drone, docker, etc,
you
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
document your dependencies and if it works for you it will
probably
work for someone else.
Definitely. I guess this is the heart of the "DevOpts"
approach
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
(at least
according the BCE paper I linked -- they have nice examples
that
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
use these
tools, but also include case studies of big collaborative
science
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
projects
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I
like about
Docker is the versioned images. If you re-run my build
scripts
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
even 5 days
from now, you'll get a different image due to ubuntu repo
updates, etc. But
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with
locking in
particular versions in the build script itself (a la packrat)
or
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
just hoping
...
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Carl Boettiger
2014-09-29 16:22:57 UTC
Permalink
Carsten,

Thanks for your reply, you bring up really great points about the
realities of PC environments that I'm not really in touch with. It's
easy for me to get stuck just thinking about the small academic lab
context so it's great that you can help us think more big picture
here.

Also good points about an archival repository. For those following
along, one of the nice things about Docker is that the software that
runs the Docker Hub, docker-registry, is open source, so anyone can
host their own public or private hub. Easy sharing is a key feature
that I think has helped make Docker successful and a compelling
element for reproducible research.

While I see your point that the Docker Hub might not be ideal for all
cases, I think the most important attribute of a repository should be
longevity. Certainly Docker Hub won't be around forever, but at this
stage with 60 million in it's latest VC round it's likely to be more
long-lasting than anything that a small organization like rOpenSci
would host. It would be great to see existing scientific repositories
show an interest in archiving images in this way though, since
organizations like DataONE and Dryad already have recognition in the
scientific community and better discoverability / search / metadata
features. Building of the docker registry technology would of course
make a lot more sense than just archiving static binary docker images,
which lack both the space saving features and the ease of download /
integration that examples like the docker registry have.

I think it would be interesting to bring that discussion to some of
the scientific data repositories and see what they say. Of course
there's the chicken & egg problem in that most researchers have never
heard of docker. Would be curious what others think of this.

Cheers,

Carl

p.s. Nice proof of principle with the herokuapp by the way, but I'm
missing something about the logic here. if I just go to
https://blooming-lake-3277.herokuapp.com with the parameters you
provide, I'm told it can't authenticate. Am I supposed to be running
the app on my own droplet instead? Sorry, a bit over my head.

On Mon, Sep 29, 2014 at 1:12 AM, Carsten Behring
Post by Carsten Behring
Dear Carl,
thanks for your support and comments.
I would like to reply to some of you points.
In a lot of organisations, getting Linux or even docker on windows is
impossible to achieve. Security concerns let a lot of admins not touch the
corporate PCs to install "strange" applications.
This will get worse in the coming years in my view.
That is why I would like to envision as well a reproducible research
workflow, which is completely independent of any local installed software,
and it only needs a web browser.
I believe that in certain organisations it is easier to get a Creditcard and
budget to pay cloud computing services, then get "special virtualisation
software like VMWare/Docker" on the standard corporate windows PC of the
user.
This means, we need to come to solutions which contains "RStudio in the
cloud" as one possible computing engine.
The same argumentation is true for getting access to "fast hardware with a
lot of memory".
Docker hub is for sure the best place for the kind of base images and its
Dockerfiles you are working on.
I was thinking that we could envision that "each individual study /
analysis" could be published as a docker image.
In this case, I would say docker hub is not the right place to store all of
those.
It would be nice to have a specific registry only to share "docker images of
statistical analysis" which could contain different features for searching
and son on.
1. There would be one (or even several) docker registries dedicated to
"docker images containing Rstudio based images with individual analysis
projects"
2. Having the images there, means a user with a local docker installation
can use them as usual with his local Docker client
3. A user without a docker installation can "press a button" and he gets
automatically a new cloud server (Digitalocean, Amazon Ec2, others)
containing the Rstudio based image,
by giving his authentication data (so he pays for it). So he can login
immediately and look (and change) the analysis directly in the cloud.
What would be missing is, that in case 3), the user can not easily republish
his changes as a new Docker image. But this is solvable. It would need a R
package which can interact with the running cloud server (over ssh ...) and
re-creates and re-publishes a new version of the image on request of the
user.
So in this scenario, setting up somewhere a "docker hub based" registry for
"RStudio based reproducible research images" would be the starting point.
The code is here: https://github.com/docker/docker-registry
Please provide me with any comments you might have.
Post by Carl Boettiger
Casten,
Thanks for joining the discussion and sharing your experiences, it's
really nice to see how others are using these approaches.
I agree entirely that cloud platforms like digitalocean give a really
nice user experience coupled with docker and RStudio-server.
Certainly ropensci could host it's own hub, but the Docker Hub works
rather nicely already so I'm not sure what the advantage of that might
be? I also agree that docker has many advantages for reproducible
research, though it is worth noting that other 'classical' virtual
machines can offer a very similar solution to the distribution problem
you describe -- e.g. Vagrant Hub.
Nonetheless, I still see running docker locally as an important part
of the equation. In my experience, most researchers still do some
(most/all) of their development locally. If we are to have a
containerized, portable, reproducible development environment, that
means running docker locally (as well as in the cloud).
The reproducible research angle has the most to gain when people can
both build on existing Dockerfiles / docker images, as you mention,
but also when they can write their own Dockerfiles particular to their
packages. I don't think that's too high a barrier -- it's easier than
writing a successful .travis.yml or other CI file for sure -- but it
does mean being able to do more than just access an RStudio server
instance that just happens to be running in a docker container.
On a linux machine, this kind of containerized workflow is remarkably
seamless. I think boot2docker has come a long way in making this
easier, but not there yet.
To that end, we're still doing lots of work on defining some useful
base Dockerfiles and images that others could build from, including
the rstudio example. Current development is at
https://github.com/eddelbuettel/rocker
Thanks for sharing the API call example, that's awesome! (btw, Scott
has an R package in the works for the digitalocean API that may also
be of interest: https://github.com/sckott/analogsea).
On Sat, Sep 27, 2014 at 3:22 PM, Carsten Behring
Post by Carsten Behring
Hi everybody,
some of you mentioned in the previous posts, that the hurdle of using docker
is still to high for a lot of researchers.
One way to lower the hurdle is to use the cloud.
So instead of asking collaborators to install and use docker on their PCs in
1. We install (with the help of an little automatic tool using cloud
services API's) our docker image to share in the "cloud" (digitalocean /
Amazon EC2 others).
The we only send ip address and port to others and then they can start
immediately in Rstudio Browser environment.
In this case the distributor pays .... for the cloud service
2. We send to a collaborator the name of our image (and the registry) and he
uses the same little tool to generate a cloud server containing the RStudio
environment.
-> Their are costs involved for the cloud providers.
- docker image name (based on RStudio, containing all data / R code of the
analysis)
- cloud provider credentials (for billing ...)
- IP address and port of RStudio, ready to use
I did a proof of concept for this with digitalocean and the docker image
mgymrek/docker-reproducibility-example.
(create-droplet "my token" {:name "core1":region "ams3" :size "512mb"
:image 6373176 :ssh_keys [42550] :user_data user-data})
where the "user-data" contains some info for coreos operating system to
#cloud-config
- name: docker-rs.service
command: start
content: |
[Unit]
Description=RStudio service container
Author=Me
After=docker.service
[Service]
Restart=always
ExecStart=/usr/bin/docker run -p 49000:8787 --name "rstudio"
mgymrek/docker-reproducibility-example
ExecStop=/usr/bin/docker stop rstudio
and, voila, on boot it starts RStudio on a new digitalocean server, ready to
use.
It should work the same way for Amazon EC2 or others.
So the tool could allow to select the cloud provider.
I am pretty sure as well, that a tool could be done as well which does the
same for the installation on local PC. (Windows, Linux, OS2)
https://github.com/behrica/ropensciCloud
The big added value of docker compared to classical virtual machines is,
that it solved the distribution problem of the images.
By just specifying "image name", "tag" and "registry" (if not docker hub is
used) each docker client knows how to get the image.
By using common base images it would be even very fast to download the
images (after the first download happened)
Maybe ropensci could host its own image registry somewhere...
Casten
Post by Carl Boettiger
John,
Thanks again for your input. Yeah, lack of support for 32 bit hosts is a
problem; though since you were speaking about AMIs I imagine you were
already used to not working locally, so you can of course try it out on an
amazon image or digitalocean droplet.
Yeah, Titus makes a great point. If we only distributed docker images as
2 GB binary tar files, we'd not be doing much better on the open / remixable
side than a binary VM image. And docker isn't the only way to provide a this
kind of script, as I mentioned earlier.
Nevertheless, I believe there is a technical difference and not just a
cultural one. Docker is not a virtual machine; containers are designed
expressly to be remixable blocks. You can put an R engine on one container
and a mysql database on another and connect them. Docker philosophy aims at
one function per container to maximize this reuse. of course it's up to you
to build this way rather than a single monolithic dockerfile, but the idea
of linking containers is a technical concept at the heart of docker that
offers a second and very different way to address the 'remix' problem of
VMs.
---
Carl Boettiger
http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos
On Sep 10, 2014 10:49 AM, "John Stanton-Geddes"
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is worth
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.
My one residual concern, which is more practical/technological than (open
sci) philosophical is that I still have to be using a system that I can
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported for
64-bit. If I go through the (not always necessary) effort of spinning up an
AMI, I can access it through anything with ssh. The easy solution is to run
Docker on the AMI.
the argument here --
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly, judging
by the hostile reactions over the years ;) -- is that it doesn't really
matter
*which* approach you choose, so much as whether or not the approach you do
choose permits understanding and remixing. So I would argue that neither
an
AMI nor a fully-baked Docker image is sufficient; what I really want is a
*recipe*. In that sense the Docker community seems to be doing a better
job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which is the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.
Hi John,
Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should be
asked elsewhere or a new topic) but I've also recently discovered and become
intrigued by Docker for facilitating reproducible research.
My question: what's the advantage of Docker over an amazon EC2 machine
image?
I've moved my analyses to EC2 for better than my local university
cluster. Doesn't my machine image achieve Carl's acid test of allowing
others to build and extend on work? What do I gain by making a Dockerfile on
my already existing EC2 image? Being new to all this, the only clear
advantage I see is a Dockerfile is much smaller than a machine image, but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to
support the little guy. But as with anything, there is a huge diversity of
AMIs and greater discoverability on EC2, at least for now.
On Tue, Aug 12, 2014 at 11:01 AM, Scott Chamberlain
Hmm, looks like DO is planning on it, but not possible yet. Do go
upvote this feature
https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Nice, we could work on this working privately, then when sharing is
available, boom.
On Tue, Aug 12, 2014 at 10:43 AM, Carl Boettiger
Great idea. Yeah, should be possible. Does the DO API support a way
to launch a job on the instance, or otherwise a way to share a custom
machine image publicly? (e.g. the way Amazon EC2 lets you make an
AMI public
from an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker image
they have, but that we would then need a wrapper to ssh into the
DO machine
and execute the single command needed to bring up the RStudio instance in
the browser.
On Tue, Aug 12, 2014 at 10:35 AM, Scott Chamberlain
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my
Digital Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Scott
On Fri, Aug 8, 2014 at 5:54 PM, Carl Boettiger
Hi folks,
Just thought I'd share an update on this thread -- I've gotten
RStudio Server working in the ropensci-docker image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your
browser at localhost:8787. (Change the first number after the
-p to have a
different address). You can log in with username:pw
rstudio:rstudio and
have fun.
One thing I like about this is the ease with which I can now get an
RStudio server up and running in the cloud (e.g. I took this for
sail on
DigitalOcean.com today). This means in few minutes and 1 penny
you have a
URL that you and any collaborators could use to interact with R
using the
familiar RStudio interface, already provisioned with your data and
dependencies in place.
To keep this brief-ish, I've restricted further commentary to my
http://www.carlboettiger.info/lab-notebook.html
Cheers,
Carl
On Thu, Aug 7, 2014 at 3:44 PM, Carl Boettiger
Thanks Rich! some further thoughts / questions below
On Thu, Aug 7, 2014 at 3:11 PM, Rich FitzJohn
Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies", at
least
in it's current form. It requires running on Linux for
starters
-
I've got it running on a virtual machine on OSX via virtualbox,
but
the amount of faffing about there is pretty intimidating. I
believe
it's possible to get it running via vagrant (which is in theory
going
to be easier to distribute) but at that point it's all
getting a
bit
silly. It's enlightening to ask a random ecologist to go to
the
website for docker (or heroku or vagrant or chef or any of
these
newfangled tools) and ask them to guess what they do. We're
down a
rabbit hole here.
Completely agree here. Anything that cannot be installed by
downloading and
clicking on something is dead in the water. It looks like Docker
is just
download and click on Macs or Windows. (Haven't tested, I have
only linux
boxes handy). So I'm not sure that the regular user needs to
know that it's
running a linux virtual machine under the hood when they aren't
on a linux
box.
So I'm optimistic think the installation faffing will largely go
away, if it
hasn't yet. I'm more worried about the faffing after it is
installed.
I've been getting drone (https://github.com/drone/drone ) up
and
running here for one of our currently-closed projects. It uses
docker
as a way of insulating the build/tests from the rest of the
system,
but it's still far from ready to recommend for general use.
The
advantages I see there are: our test suite can run for several
hours
without worrying about running up against allowed times, and
working
for projects that are not yet open source. It also simplifies
getting
things off the container, but I think there are a bunch of ways
of
doing that easily enough. However, I'm on the lookout for
something
much simpler to set up, especially for local use and/or behind
NAT. I
can post the dockerfile at some point (it's apparently not on
this
computer!) but it's similarly simple to yours.
Very cool! Yeah, I think there's great promise that we'll see
more
easy-to-use tools being built on docker. Is Drone ubuntu-only at
the moment
then?
As I see it, the great advantage of all these types of
approaches,
independent of the technology, is the recipe-based approach to
documenting dependencies. With travis, drone, docker, etc, you
document your dependencies and if it works for you it will
probably
work for someone else.
Definitely. I guess this is the heart of the "DevOpts" approach
(at least
according the BCE paper I linked -- they have nice examples that
use these
tools, but also include case studies of big collaborative science
projects
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One thing I
like about
Docker is the versioned images. If you re-run my build scripts
even 5 days
from now, you'll get a different image due to ubuntu repo
updates, etc. But
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with
locking in
particular versions in the build script itself (a la packrat) or
just hoping
the most recent version is good enough (a la CRAN).
I'm OK with this being nerd only for a bit, because (like
travis
etc)
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over
into
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds
to work
with large data sets that are tedious to download: pull the raw
data
as one RUN command, wrangle as another. Then a separate
wrangle
step
will reuse the intermediate container (I believe). This is
sort
of a
different way of doing the types of things that Ethan's "eco
data
retriever" aims to do. There's some overlap here with make,
but
in a
way that would let you jump in at a point in the analysis in a
fresh
environment.
Great point, hadn't thought about that.
I don't think that people will jump to using virtual
environments for
the sake of it - there has to be some pay off. Isolating the
build
from the rest of your machine or digging into a 5 year old
project
probably does not have widespread appeal to non-desk types
either!
Definitely agree with that. I'd like to hear more about your
perspective on
CI tools though -- of course we love them, but do you think that
CI has a
larger appeal to the average ecologist than other potential
'benefits'? I
think the tangible payoffs are: (Cribbing heavily from that
Berkeley
1) For instructors: having students in a consistent and optimized
environment with little effort. That environment can become a
resource
maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming the
tool is as
easy to use on the desktop as whatever they currently do --
clearly we're
not there yet).
3) Easier to get collaborators / readers to use & re-use. (I
think that
only happens if lots of people are performing research and/or
teaching using
these environments -- just like sharing code written in Go just
isn't that
useful among ecologists. Clearly we may never get here.)
I
think that the biggest potential draws are the CI-type tools,
but
there are probably other tools that require
isolation/virtualisation
that will appeal broadly. Then people will accidentally end up
with
reproducible work :)
Cheers,
Rich
Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible
research in the
comments of Rich et al’s recent post on the rOpenSci blog
where quite a
few
of people mentioned the potential for Docker as a way to
facilitate
this.
I’ve only just started playing around with Docker, and
though
I’m quite
impressed, I’m still rather skeptical that non-crazies
would
ever use it
productively. Nevertheless, I’ve worked up some Dockerfiles
to
explore
how
one might use this approach to transparently document and
manage a
computational environment, and I was hoping to get some
feedback from
all of
you.
For those of you who are already much more familiar with
Docker than me
(or
are looking for an excuse to explore!), I’d love to get
your
feedback on
some of the particulars. For everyone, I’d be curious what
you
think
about
the general concept.
So far I’ve created a dockerfile and image
If you have docker up and running, perhaps you can give it
a
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This
image builds
on
Dirk Eddelbuettel’s R docker images and serves as a
starting
point to
test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a
bear then
usual
to run, since it needs rJava (requires external libs),
Sxslt
(only
available
on Omegahat and requires extra libs) and latest phytools (a
tar.gz file
from
Liam’s website), along with the usual mess of pandoc/latex
environment
to
compile the manuscript itself. By building on
ropensci-docker,
we need a
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run
rmarkdown::render("manuscript.Rmd"). This
will recompile the manuscript from cache and leave you to
interactively
explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user can
run the code
without dependency hell (often not as much an R problem as
it
is in
Python,
but nevertheless one that I hit frequently, particularly as
my
projects
age), and also without altering their personal R
environment.
Third (in
principle) this makes it easy to run the code on a cloud
server, scaling
the
computing resources appropriately.
I think the real acid test for this is not merely that it
recreates the
results, but that others can build and extend on the work
(with fewer
rather
than more barriers than usual). I believe most of that has
nothing to do
with this whole software image thing — providing the
methods
you use as
general-purpose functions in an R package, or publishing
the
raw (&
processed) data to Dryad with good documentation will
always
make work
more
modular and easier to re-use than cracking open someone’s
virtual
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up
whatever a
researcher
or group is already doing into something portable and
extensible. So, is
this really portable and extensible?
This presupposes someone can run docker on their OS — and
from
the
command
line at that. Perhaps that’s the biggest barrier to entry
right now,
(though
given docker’s virulent popularity, maybe something smart
people with
big
money might soon solve).
The only way to interact with thing is through a bash shell
running on
the
container. An RStudio server might be much nicer, but I
haven’t been
able to
get that running. Anyone know how to run RStudio server
from
docker?
https://github.com/mingfang/docker-druid/issues/2)
I don’t see how users can move local files on and off the
docker
container.
In some ways this is a great virtue — forcing all code to
use
fully
resolved
paths like pulling data from Dryad instead of their
hard-drive, and
pushing
results to a (possibly private) online site to view them.
But
obviously
a
barrier to entry. Is there a better way to do this?
Alternative strategies
1) Docker is just one of many ways to do this (particularly
if
you’re
not
concerned about maximum performance speed), and quite
probably
not the
easiest. Our friends at Berkeley D-Lab opted for a
GUI-driven
virtual
machine instead, built with Packer and run in Virtualbox,
after their
experience proved that students were much more comfortable
with the
mouse-driven installation and a pixel-identical environment
to
the
instructor’s (see their excellen paper on this).
2) Will/should researchers be willing to work and develop
in
virtual
environments? In some cases, the virtual environment can be
closely
coupled
to the native one — you use your own editors etc to do all
the
writing,
and
then execute in the virtual environment (seems this is
easier
in
docker/vagrant approach than in the BCE.
[^1]: friends cc’d above: We’re reviving this
ropensci-discuss
list to
chat
about various issues related to our packages, our goals,
and
more broad
scientific workflow issues. I’d encourage you to sign up
for
the
https://groups.google.com/forum/#!forum/ropensci-discuss

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google
Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/</u
...
--
You received this message because you are subscribed to the Google Groups
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Carsten Behring
2014-09-30 07:36:56 UTC
Permalink
Carl,

regarding the authentication for the app, I will check.

Its the first time, I used heroku myself.
I thought by default an application on heroku is public....
(It works for me, probably because I am logged into heroku)

Can you send me a screenshot of the error message ?

Carsten
Post by Carl Boettiger
Carsten,
Thanks for your reply, you bring up really great points about the
realities of PC environments that I'm not really in touch with. It's
easy for me to get stuck just thinking about the small academic lab
context so it's great that you can help us think more big picture
here.
Also good points about an archival repository. For those following
along, one of the nice things about Docker is that the software that
runs the Docker Hub, docker-registry, is open source, so anyone can
host their own public or private hub. Easy sharing is a key feature
that I think has helped make Docker successful and a compelling
element for reproducible research.
While I see your point that the Docker Hub might not be ideal for all
cases, I think the most important attribute of a repository should be
longevity. Certainly Docker Hub won't be around forever, but at this
stage with 60 million in it's latest VC round it's likely to be more
long-lasting than anything that a small organization like rOpenSci
would host. It would be great to see existing scientific repositories
show an interest in archiving images in this way though, since
organizations like DataONE and Dryad already have recognition in the
scientific community and better discoverability / search / metadata
features. Building of the docker registry technology would of course
make a lot more sense than just archiving static binary docker images,
which lack both the space saving features and the ease of download /
integration that examples like the docker registry have.
I think it would be interesting to bring that discussion to some of
the scientific data repositories and see what they say. Of course
there's the chicken & egg problem in that most researchers have never
heard of docker. Would be curious what others think of this.
Cheers,
Carl
p.s. Nice proof of principle with the herokuapp by the way, but I'm
missing something about the logic here. if I just go to
https://blooming-lake-3277.herokuapp.com with the parameters you
provide, I'm told it can't authenticate. Am I supposed to be running
the app on my own droplet instead? Sorry, a bit over my head.
On Mon, Sep 29, 2014 at 1:12 AM, Carsten Behring
Post by Carsten Behring
Dear Carl,
thanks for your support and comments.
I would like to reply to some of you points.
In a lot of organisations, getting Linux or even docker on windows is
impossible to achieve. Security concerns let a lot of admins not touch
the
Post by Carsten Behring
corporate PCs to install "strange" applications.
This will get worse in the coming years in my view.
That is why I would like to envision as well a reproducible research
workflow, which is completely independent of any local installed
software,
Post by Carsten Behring
and it only needs a web browser.
I believe that in certain organisations it is easier to get a Creditcard
and
Post by Carsten Behring
budget to pay cloud computing services, then get "special virtualisation
software like VMWare/Docker" on the standard corporate windows PC of the
user.
This means, we need to come to solutions which contains "RStudio in the
cloud" as one possible computing engine.
The same argumentation is true for getting access to "fast hardware
with a
Post by Carsten Behring
lot of memory".
Docker hub is for sure the best place for the kind of base images and
its
Post by Carsten Behring
Dockerfiles you are working on.
I was thinking that we could envision that "each individual study /
analysis" could be published as a docker image.
In this case, I would say docker hub is not the right place to store all
of
Post by Carsten Behring
those.
It would be nice to have a specific registry only to share "docker
images of
Post by Carsten Behring
statistical analysis" which could contain different features for
searching
Post by Carsten Behring
and son on.
1. There would be one (or even several) docker registries dedicated to
"docker images containing Rstudio based images with individual analysis
projects"
2. Having the images there, means a user with a local docker
installation
Post by Carsten Behring
can use them as usual with his local Docker client
3. A user without a docker installation can "press a button" and he gets
automatically a new cloud server (Digitalocean, Amazon Ec2, others)
containing the Rstudio based image,
by giving his authentication data (so he pays for it). So he can login
immediately and look (and change) the analysis directly in the cloud.
What would be missing is, that in case 3), the user can not easily
republish
Post by Carsten Behring
his changes as a new Docker image. But this is solvable. It would need
a R
Post by Carsten Behring
package which can interact with the running cloud server (over ssh ...)
and
Post by Carsten Behring
re-creates and re-publishes a new version of the image on request of the
user.
So in this scenario, setting up somewhere a "docker hub based" registry
for
Post by Carsten Behring
"RStudio based reproducible research images" would be the starting
point.
Post by Carsten Behring
The code is here: https://github.com/docker/docker-registry
Please provide me with any comments you might have.
Post by Carl Boettiger
Casten,
Thanks for joining the discussion and sharing your experiences, it's
really nice to see how others are using these approaches.
I agree entirely that cloud platforms like digitalocean give a really
nice user experience coupled with docker and RStudio-server.
Certainly ropensci could host it's own hub, but the Docker Hub works
rather nicely already so I'm not sure what the advantage of that might
be? I also agree that docker has many advantages for reproducible
research, though it is worth noting that other 'classical' virtual
machines can offer a very similar solution to the distribution problem
you describe -- e.g. Vagrant Hub.
Nonetheless, I still see running docker locally as an important part
of the equation. In my experience, most researchers still do some
(most/all) of their development locally. If we are to have a
containerized, portable, reproducible development environment, that
means running docker locally (as well as in the cloud).
The reproducible research angle has the most to gain when people can
both build on existing Dockerfiles / docker images, as you mention,
but also when they can write their own Dockerfiles particular to their
packages. I don't think that's too high a barrier -- it's easier than
writing a successful .travis.yml or other CI file for sure -- but it
does mean being able to do more than just access an RStudio server
instance that just happens to be running in a docker container.
On a linux machine, this kind of containerized workflow is remarkably
seamless. I think boot2docker has come a long way in making this
easier, but not there yet.
To that end, we're still doing lots of work on defining some useful
base Dockerfiles and images that others could build from, including
the rstudio example. Current development is at
https://github.com/eddelbuettel/rocker
Thanks for sharing the API call example, that's awesome! (btw, Scott
has an R package in the works for the digitalocean API that may also
be of interest: https://github.com/sckott/analogsea).
On Sat, Sep 27, 2014 at 3:22 PM, Carsten Behring
Post by Carsten Behring
Hi everybody,
some of you mentioned in the previous posts, that the hurdle of using docker
is still to high for a lot of researchers.
One way to lower the hurdle is to use the cloud.
So instead of asking collaborators to install and use docker on their PCs in
1. We install (with the help of an little automatic tool using cloud
services API's) our docker image to share in the "cloud"
(digitalocean
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
/
Amazon EC2 others).
The we only send ip address and port to others and then they can
start
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
immediately in Rstudio Browser environment.
In this case the distributor pays .... for the cloud service
2. We send to a collaborator the name of our image (and the registry) and he
uses the same little tool to generate a cloud server containing the RStudio
environment.
-> Their are costs involved for the cloud providers.
- docker image name (based on RStudio, containing all data / R code
of
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
the
analysis)
- cloud provider credentials (for billing ...)
- IP address and port of RStudio, ready to use
I did a proof of concept for this with digitalocean and the docker
image
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
mgymrek/docker-reproducibility-example.
(create-droplet "my token" {:name "core1":region "ams3" :size
"512mb"
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
:image 6373176 :ssh_keys [42550] :user_data user-data})
where the "user-data" contains some info for coreos operating system
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
#cloud-config
- name: docker-rs.service
command: start
content: |
[Unit]
Description=RStudio service container
Author=Me
After=docker.service
[Service]
Restart=always
ExecStart=/usr/bin/docker run -p 49000:8787 --name
"rstudio"
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
mgymrek/docker-reproducibility-example
ExecStop=/usr/bin/docker stop rstudio
and, voila, on boot it starts RStudio on a new digitalocean server, ready to
use.
It should work the same way for Amazon EC2 or others.
So the tool could allow to select the cloud provider.
I am pretty sure as well, that a tool could be done as well which
does
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
the
same for the installation on local PC. (Windows, Linux, OS2)
https://github.com/behrica/ropensciCloud
The big added value of docker compared to classical virtual machines
is,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
that it solved the distribution problem of the images.
By just specifying "image name", "tag" and "registry" (if not docker
hub
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
is
used) each docker client knows how to get the image.
By using common base images it would be even very fast to download
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
images (after the first download happened)
Maybe ropensci could host its own image registry somewhere...
Casten
On Wednesday, September 10, 2014 9:14:03 PM UTC+2, Carl Boettiger
Post by Carl Boettiger
John,
Thanks again for your input. Yeah, lack of support for 32 bit hosts
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a
problem; though since you were speaking about AMIs I imagine you
were
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
already used to not working locally, so you can of course try it out
on
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
an
amazon image or digitalocean droplet.
Yeah, Titus makes a great point. If we only distributed docker
images
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
as
2 GB binary tar files, we'd not be doing much better on the open / remixable
side than a binary VM image. And docker isn't the only way to
provide a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
this
kind of script, as I mentioned earlier.
Nevertheless, I believe there is a technical difference and not just
a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
cultural one. Docker is not a virtual machine; containers are designed
expressly to be remixable blocks. You can put an R engine on one container
and a mysql database on another and connect them. Docker philosophy aims at
one function per container to maximize this reuse. of course it's
up
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
to you
to build this way rather than a single monolithic dockerfile, but
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
idea
of linking containers is a technical concept at the heart of docker that
offers a second and very different way to address the 'remix'
problem
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
of
VMs.
---
Carl Boettiger
http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos
On Sep 10, 2014 10:49 AM, "John Stanton-Geddes"
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is worth
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.
My one residual concern, which is more practical/technological than (open
sci) philosophical is that I still have to be using a system that I
can
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported
for
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
64-bit. If I go through the (not always necessary) effort of
spinning
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
up an
AMI, I can access it through anything with ssh. The easy solution is
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
run
Docker on the AMI.
the argument here --
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly,
judging
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
by the hostile reactions over the years ;) -- is that it doesn't
really
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
matter
*which* approach you choose, so much as whether or not the approach
you
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
do
choose permits understanding and remixing. So I would argue that neither
an
AMI nor a fully-baked Docker image is sufficient; what I really want
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a
*recipe*. In that sense the Docker community seems to be doing a better
job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.
Hi John,
Nice to hear from you and thanks for joining the discussion. You
ask
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and
summarize
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
what I think are the promising features of Docker. I don't think
this
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that
"you
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
can't install an image for every pipeline you want...". In
contrast,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable
than
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
the image. (b) the script can be version managed (c) the script
gives
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is
huge.
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't
necessary
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
for me to spin up an EC2 instance (with all the knowledge &
potential
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on.
It
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the
Dockerfile,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a
little
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should
be
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
asked elsewhere or a new topic) but I've also recently discovered
and
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
become
intrigued by Docker for facilitating reproducible research.
My question: what's the advantage of Docker over an amazon EC2 machine
image?
I've moved my analyses to EC2 for better than my local university
cluster. Doesn't my machine image achieve Carl's acid test of allowing
others to build and extend on work? What do I gain by making a
Dockerfile on
my already existing EC2 image? Being new to all this, the only
clear
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
advantage I see is a Dockerfile is much smaller than a machine
image,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
support the little guy. But as with anything, there is a huge
diversity of
AMIs and greater discoverability on EC2, at least for now.
On Tue, Aug 12, 2014 at 11:01 AM, Scott Chamberlain
Hmm, looks like DO is planning on it, but not possible yet. Do
go
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
upvote this feature
https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
Nice, we could work on this working privately, then when sharing
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
available, boom.
On Tue, Aug 12, 2014 at 10:43 AM, Carl Boettiger
Great idea. Yeah, should be possible. Does the DO API support
a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
way
to launch a job on the instance, or otherwise a way to share a custom
machine image publicly? (e.g. the way Amazon EC2 lets you make
an
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
AMI public
from an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker
image
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
they have, but that we would then need a wrapper to ssh into
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
DO machine
and execute the single command needed to bring up the RStudio
instance in
the browser.
On Tue, Aug 12, 2014 at 10:35 AM, Scott Chamberlain
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my
Digital Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Scott
On Fri, Aug 8, 2014 at 5:54 PM, Carl Boettiger
Hi folks,
Just thought I'd share an update on this thread -- I've
gotten
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
RStudio Server working in the ropensci-docker image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your
browser at localhost:8787. (Change the first number after
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
-p to have a
different address). You can log in with username:pw
rstudio:rstudio and
have fun.
One thing I like about this is the ease with which I can now
get
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
an
RStudio server up and running in the cloud (e.g. I took this
for
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
sail on
DigitalOcean.com today). This means in few minutes and 1
penny
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
you have a
URL that you and any collaborators could use to interact with
R
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
using the
familiar RStudio interface, already provisioned with your
data
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
and
dependencies in place.
To keep this brief-ish, I've restricted further commentary to
my
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
http://www.carlboettiger.info/lab-notebook.html
Cheers,
Carl
On Thu, Aug 7, 2014 at 3:44 PM, Carl Boettiger
Thanks Rich! some further thoughts / questions below
On Thu, Aug 7, 2014 at 3:11 PM, Rich FitzJohn
Hi Carl,
Thanks for this!
I think that docker is always going to be for the
"crazies",
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
at
least
in it's current form. It requires running on Linux for
starters
-
I've got it running on a virtual machine on OSX via
virtualbox,
but
the amount of faffing about there is pretty intimidating.
I
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
believe
it's possible to get it running via vagrant (which is in
theory
going
to be easier to distribute) but at that point it's all
getting a
bit
silly. It's enlightening to ask a random ecologist to go
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
the
website for docker (or heroku or vagrant or chef or any of
these
newfangled tools) and ask them to guess what they do.
We're
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
down a
rabbit hole here.
Completely agree here. Anything that cannot be installed
by
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
downloading and
clicking on something is dead in the water. It looks like
Docker
is just
download and click on Macs or Windows. (Haven't tested, I
have
only linux
boxes handy). So I'm not sure that the regular user needs
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
know that it's
running a linux virtual machine under the hood when they
aren't
on a linux
box.
So I'm optimistic think the installation faffing will
largely
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
go
away, if it
hasn't yet. I'm more worried about the faffing after it
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
installed.
I've been getting drone (https://github.com/drone/drone )
up
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
and
running here for one of our currently-closed projects. It
uses
docker
as a way of insulating the build/tests from the rest of
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
system,
but it's still far from ready to recommend for general
use.
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
The
advantages I see there are: our test suite can run for
several
hours
without worrying about running up against allowed times,
and
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
working
for projects that are not yet open source. It also
simplifies
getting
things off the container, but I think there are a bunch of
ways
of
doing that easily enough. However, I'm on the lookout for
something
much simpler to set up, especially for local use and/or
behind
NAT. I
can post the dockerfile at some point (it's apparently not
on
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
this
computer!) but it's similarly simple to yours.
Very cool! Yeah, I think there's great promise that we'll
see
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
more
easy-to-use tools being built on docker. Is Drone
ubuntu-only
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
at
the moment
then?
As I see it, the great advantage of all these types of
approaches,
independent of the technology, is the recipe-based
approach
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
to
documenting dependencies. With travis, drone, docker,
etc,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
you
document your dependencies and if it works for you it will
probably
work for someone else.
Definitely. I guess this is the heart of the "DevOpts"
approach
(at least
according the BCE paper I linked -- they have nice examples
that
use these
tools, but also include case studies of big collaborative
science
projects
that do more-or-less the same thing with Makefiles.
I think the devil is still in the details though. One
thing I
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
like about
Docker is the versioned images. If you re-run my build
scripts
even 5 days
from now, you'll get a different image due to ubuntu repo
updates, etc. But
it's easy to pull any of the earlier images and compare.
Contrast this to other approaches, where you're stuck with
locking in
particular versions in the build script itself (a la
packrat)
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
or
just hoping
the most recent version is good enough (a la CRAN).
I'm OK with this being nerd only for a bit, because (like
travis
etc)
it's going to be useful enough without having to be
generally
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
accessible. But there will be ideas here that will carry
over
into
less nerdy activities. One that appeals to me would be to
take
advantage of the fancy way that Docker does incremental
builds
to work
with large data sets that are tedious to download: pull
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
raw
data
as one RUN command, wrangle as another. Then a separate
wrangle
step
will reuse the intermediate container (I believe). This
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
sort
of a
different way of doing the types of things that Ethan's
"eco
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
data
retriever" aims to do. There's some overlap here with
make,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
but
in a
way that would let you jump in at a point in the analysis
in
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
a
fresh
environment.
Great point, hadn't thought about that.
I don't think that people will jump to using virtual
environments for
the sake of it - there has to be some pay off. Isolating
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
build
from the rest of your machine or digging into a 5 year old
project
probably does not have widespread appeal to non-desk types
either!
Definitely agree with that. I'd like to hear more about
your
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
perspective on
CI tools though -- of course we love them, but do you think
that
CI has a
larger appeal to the average ecologist than other potential
'benefits'? I
think the tangible payoffs are: (Cribbing heavily from that
Berkeley
1) For instructors: having students in a consistent and
optimized
environment with little effort. That environment can
become a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
resource
maintained and enhanced by a larger community.
2) For researchers: easier to scale to the cloud (assuming
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
tool is as
easy to use on the desktop as whatever they currently do --
clearly we're
not there yet).
3) Easier to get collaborators / readers to use & re-use.
(I
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
think that
only happens if lots of people are performing research
and/or
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
teaching using
these environments -- just like sharing code written in Go
just
isn't that
useful among ecologists. Clearly we may never get here.)
I
think that the biggest potential draws are the CI-type
tools,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
but
there are probably other tools that require
isolation/virtualisation
that will appeal broadly. Then people will accidentally
end
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
up
with
reproducible work :)
Cheers,
Rich
Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible
research in the
comments of Rich et al’s recent post on the rOpenSci
blog
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
where quite a
few
of people mentioned the potential for Docker as a way to
facilitate
this.
I’ve only just started playing around with Docker, and
though
I’m quite
impressed, I’m still rather skeptical that non-crazies
would
ever use it
productively. Nevertheless, I’ve worked up some
Dockerfiles
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
to
explore
how
one might use this approach to transparently document
and
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
manage a
computational environment, and I was hoping to get some
feedback from
all of
you.
For those of you who are already much more familiar with
Docker than me
(or
are looking for an excuse to explore!), I’d love to get
your
feedback on
some of the particulars. For everyone, I’d be curious
what
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
you
think
about
the general concept.
So far I’ve created a dockerfile and image
If you have docker up and running, perhaps you can give
it
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
a
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages.
This
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
image builds
on
Dirk Eddelbuettel’s R docker images and serves as a
starting
point to
test
individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more
of a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
bear then
usual
to run, since it needs rJava (requires external libs),
Sxslt
(only
available
on Omegahat and requires extra libs) and latest phytools
(a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
tar.gz file
from
Liam’s website), along with the usual mess of
pandoc/latex
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
environment
to
compile the manuscript itself. By building on
ropensci-docker,
we need a
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run
rmarkdown::render("manuscript.Rmd"). This
will recompile the manuscript from cache and leave you
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
interactively
explore any of the R code shown.
Advantages / Goals
Being able to download a precompiled image means a user
can
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
run the code
without dependency hell (often not as much an R problem
as
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
it
is in
Python,
but nevertheless one that I hit frequently, particularly
as
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
my
projects
age), and also without altering their personal R
environment.
Third (in
principle) this makes it easy to run the code on a cloud
server, scaling
the
computing resources appropriately.
I think the real acid test for this is not merely that
it
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
recreates the
results, but that others can build and extend on the
work
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
(with fewer
rather
than more barriers than usual). I believe most of that
has
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
nothing to do
with this whole software image thing — providing the
methods
you use as
general-purpose functions in an R package, or publishing
the
raw (&
processed) data to Dryad with good documentation will
always
make work
more
modular and easier to re-use than cracking open
someone’s
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
virtual
machine.
But that is really a separate issue.
In this context, we look for an easy way to package up
whatever a
researcher
or group is already doing into something portable and
extensible. So, is
this really portable and extensible?
This presupposes someone can run docker on their OS —
and
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
from
the
command
line at that. Perhaps that’s the biggest barrier to
entry
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
right now,
(though
given docker’s virulent popularity, maybe something
smart
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
people with
big
money might soon solve).
...
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Carsten Behring
2014-10-02 19:48:26 UTC
Permalink
Ciao Carl,

the application is public, so maybe you just put wrong data.

You need to put your digitalocean api token and the name of one of our ssh
keys from digital ocean.

Carsten
Post by Carsten Behring
Carl,
regarding the authentication for the app, I will check.
Its the first time, I used heroku myself.
I thought by default an application on heroku is public....
(It works for me, probably because I am logged into heroku)
Can you send me a screenshot of the error message ?
Carsten
Carsten,
Thanks for your reply, you bring up really great points about the
realities of PC environments that I'm not really in touch with. It's
easy for me to get stuck just thinking about the small academic lab
context so it's great that you can help us think more big picture
here.
Also good points about an archival repository. For those following
along, one of the nice things about Docker is that the software that
runs the Docker Hub, docker-registry, is open source, so anyone can
host their own public or private hub. Easy sharing is a key feature
that I think has helped make Docker successful and a compelling
element for reproducible research.
While I see your point that the Docker Hub might not be ideal for all
cases, I think the most important attribute of a repository should be
longevity. Certainly Docker Hub won't be around forever, but at this
stage with 60 million in it's latest VC round it's likely to be more
long-lasting than anything that a small organization like rOpenSci
would host. It would be great to see existing scientific repositories
show an interest in archiving images in this way though, since
organizations like DataONE and Dryad already have recognition in the
scientific community and better discoverability / search / metadata
features. Building of the docker registry technology would of course
make a lot more sense than just archiving static binary docker images,
which lack both the space saving features and the ease of download /
integration that examples like the docker registry have.
I think it would be interesting to bring that discussion to some of
the scientific data repositories and see what they say. Of course
there's the chicken & egg problem in that most researchers have never
heard of docker. Would be curious what others think of this.
Cheers,
Carl
p.s. Nice proof of principle with the herokuapp by the way, but I'm
missing something about the logic here. if I just go to
https://blooming-lake-3277.herokuapp.com with the parameters you
provide, I'm told it can't authenticate. Am I supposed to be running
the app on my own droplet instead? Sorry, a bit over my head.
On Mon, Sep 29, 2014 at 1:12 AM, Carsten Behring
Post by Carsten Behring
Dear Carl,
thanks for your support and comments.
I would like to reply to some of you points.
In a lot of organisations, getting Linux or even docker on windows is
impossible to achieve. Security concerns let a lot of admins not touch
the
Post by Carsten Behring
corporate PCs to install "strange" applications.
This will get worse in the coming years in my view.
That is why I would like to envision as well a reproducible research
workflow, which is completely independent of any local installed
software,
Post by Carsten Behring
and it only needs a web browser.
I believe that in certain organisations it is easier to get a Creditcard
and
Post by Carsten Behring
budget to pay cloud computing services, then get "special virtualisation
software like VMWare/Docker" on the standard corporate windows PC of the
user.
This means, we need to come to solutions which contains "RStudio in the
cloud" as one possible computing engine.
The same argumentation is true for getting access to "fast hardware
with a
Post by Carsten Behring
lot of memory".
Docker hub is for sure the best place for the kind of base images and
its
Post by Carsten Behring
Dockerfiles you are working on.
I was thinking that we could envision that "each individual study /
analysis" could be published as a docker image.
In this case, I would say docker hub is not the right place to store all
of
Post by Carsten Behring
those.
It would be nice to have a specific registry only to share "docker
images of
Post by Carsten Behring
statistical analysis" which could contain different features for
searching
Post by Carsten Behring
and son on.
1. There would be one (or even several) docker registries dedicated to
"docker images containing Rstudio based images with individual analysis
projects"
2. Having the images there, means a user with a local docker
installation
Post by Carsten Behring
can use them as usual with his local Docker client
3. A user without a docker installation can "press a button" and he gets
automatically a new cloud server (Digitalocean, Amazon Ec2, others)
containing the Rstudio based image,
by giving his authentication data (so he pays for it). So he can login
immediately and look (and change) the analysis directly in the cloud.
What would be missing is, that in case 3), the user can not easily
republish
Post by Carsten Behring
his changes as a new Docker image. But this is solvable. It would need
a R
Post by Carsten Behring
package which can interact with the running cloud server (over ssh ...)
and
Post by Carsten Behring
re-creates and re-publishes a new version of the image on request of the
user.
So in this scenario, setting up somewhere a "docker hub based" registry
for
Post by Carsten Behring
"RStudio based reproducible research images" would be the starting
point.
Post by Carsten Behring
The code is here: https://github.com/docker/docker-registry
Please provide me with any comments you might have.
Post by Carl Boettiger
Casten,
Thanks for joining the discussion and sharing your experiences, it's
really nice to see how others are using these approaches.
I agree entirely that cloud platforms like digitalocean give a really
nice user experience coupled with docker and RStudio-server.
Certainly ropensci could host it's own hub, but the Docker Hub works
rather nicely already so I'm not sure what the advantage of that might
be? I also agree that docker has many advantages for reproducible
research, though it is worth noting that other 'classical' virtual
machines can offer a very similar solution to the distribution problem
you describe -- e.g. Vagrant Hub.
Nonetheless, I still see running docker locally as an important part
of the equation. In my experience, most researchers still do some
(most/all) of their development locally. If we are to have a
containerized, portable, reproducible development environment, that
means running docker locally (as well as in the cloud).
The reproducible research angle has the most to gain when people can
both build on existing Dockerfiles / docker images, as you mention,
but also when they can write their own Dockerfiles particular to their
packages. I don't think that's too high a barrier -- it's easier than
writing a successful .travis.yml or other CI file for sure -- but it
does mean being able to do more than just access an RStudio server
instance that just happens to be running in a docker container.
On a linux machine, this kind of containerized workflow is remarkably
seamless. I think boot2docker has come a long way in making this
easier, but not there yet.
To that end, we're still doing lots of work on defining some useful
base Dockerfiles and images that others could build from, including
the rstudio example. Current development is at
https://github.com/eddelbuettel/rocker
Thanks for sharing the API call example, that's awesome! (btw, Scott
has an R package in the works for the digitalocean API that may also
be of interest: https://github.com/sckott/analogsea).
On Sat, Sep 27, 2014 at 3:22 PM, Carsten Behring
Post by Carsten Behring
Hi everybody,
some of you mentioned in the previous posts, that the hurdle of using docker
is still to high for a lot of researchers.
One way to lower the hurdle is to use the cloud.
So instead of asking collaborators to install and use docker on their PCs in
1. We install (with the help of an little automatic tool using cloud
services API's) our docker image to share in the "cloud"
(digitalocean
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
/
Amazon EC2 others).
The we only send ip address and port to others and then they can
start
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
immediately in Rstudio Browser environment.
In this case the distributor pays .... for the cloud service
2. We send to a collaborator the name of our image (and the registry) and he
uses the same little tool to generate a cloud server containing the RStudio
environment.
-> Their are costs involved for the cloud providers.
- docker image name (based on RStudio, containing all data / R code
of
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
the
analysis)
- cloud provider credentials (for billing ...)
- IP address and port of RStudio, ready to use
I did a proof of concept for this with digitalocean and the docker
image
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
mgymrek/docker-reproducibility-example.
(create-droplet "my token" {:name "core1":region "ams3" :size
"512mb"
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
:image 6373176 :ssh_keys [42550] :user_data user-data})
where the "user-data" contains some info for coreos operating system
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
#cloud-config
- name: docker-rs.service
command: start
content: |
[Unit]
Description=RStudio service container
Author=Me
After=docker.service
[Service]
Restart=always
ExecStart=/usr/bin/docker run -p 49000:8787 --name
"rstudio"
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
mgymrek/docker-reproducibility-example
ExecStop=/usr/bin/docker stop rstudio
and, voila, on boot it starts RStudio on a new digitalocean server, ready to
use.
It should work the same way for Amazon EC2 or others.
So the tool could allow to select the cloud provider.
I am pretty sure as well, that a tool could be done as well which
does
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
the
same for the installation on local PC. (Windows, Linux, OS2)
https://github.com/behrica/ropensciCloud
The big added value of docker compared to classical virtual machines
is,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
that it solved the distribution problem of the images.
By just specifying "image name", "tag" and "registry" (if not docker
hub
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
is
used) each docker client knows how to get the image.
By using common base images it would be even very fast to download
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
images (after the first download happened)
Maybe ropensci could host its own image registry somewhere...
Casten
On Wednesday, September 10, 2014 9:14:03 PM UTC+2, Carl Boettiger
Post by Carl Boettiger
John,
Thanks again for your input. Yeah, lack of support for 32 bit hosts
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a
problem; though since you were speaking about AMIs I imagine you
were
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
already used to not working locally, so you can of course try it out
on
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
an
amazon image or digitalocean droplet.
Yeah, Titus makes a great point. If we only distributed docker
images
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
as
2 GB binary tar files, we'd not be doing much better on the open / remixable
side than a binary VM image. And docker isn't the only way to
provide a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
this
kind of script, as I mentioned earlier.
Nevertheless, I believe there is a technical difference and not just
a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
cultural one. Docker is not a virtual machine; containers are designed
expressly to be remixable blocks. You can put an R engine on one container
and a mysql database on another and connect them. Docker philosophy aims at
one function per container to maximize this reuse. of course it's
up
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
to you
to build this way rather than a single monolithic dockerfile, but
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
idea
of linking containers is a technical concept at the heart of docker that
offers a second and very different way to address the 'remix'
problem
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
of
VMs.
---
Carl Boettiger
http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos
On Sep 10, 2014 10:49 AM, "John Stanton-Geddes"
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is worth
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.
My one residual concern, which is more practical/technological than (open
sci) philosophical is that I still have to be using a system that I
can
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported
for
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
64-bit. If I go through the (not always necessary) effort of
spinning
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
up an
AMI, I can access it through anything with ssh. The easy solution is
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
run
Docker on the AMI.
the argument here --
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly,
judging
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
by the hostile reactions over the years ;) -- is that it doesn't
really
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
matter
*which* approach you choose, so much as whether or not the approach
you
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
do
choose permits understanding and remixing. So I would argue that neither
an
AMI nor a fully-baked Docker image is sufficient; what I really want
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a
*recipe*. In that sense the Docker community seems to be doing a better
job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.
Hi John,
Nice to hear from you and thanks for joining the discussion. You
ask
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and
summarize
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
what I think are the promising features of Docker. I don't think
this
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that
"you
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
can't install an image for every pipeline you want...". In
contrast,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable
than
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
the image. (b) the script can be version managed (c) the script
gives
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is
huge.
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't
necessary
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
for me to spin up an EC2 instance (with all the knowledge &
potential
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on.
It
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the
Dockerfile,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a
little
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should
be
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
asked elsewhere or a new topic) but I've also recently discovered
and
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
become
intrigued by Docker for facilitating reproducible research.
My question: what's the advantage of Docker over an amazon EC2 machine
image?
I've moved my analyses to EC2 for better than my local university
cluster. Doesn't my machine image achieve Carl's acid test of allowing
others to build and extend on work? What do I gain by making a
Dockerfile on
my already existing EC2 image? Being new to all this, the only
clear
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
advantage I see is a Dockerfile is much smaller than a machine
image,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
support the little guy. But as with anything, there is a huge
diversity of
AMIs and greater discoverability on EC2, at least for now.
On Tue, Aug 12, 2014 at 11:01 AM, Scott Chamberlain
Hmm, looks like DO is planning on it, but not possible yet. Do
go
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
upvote this feature
https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
Nice, we could work on this working privately, then when sharing
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
available, boom.
On Tue, Aug 12, 2014 at 10:43 AM, Carl Boettiger
Great idea. Yeah, should be possible. Does the DO API support
a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
way
to launch a job on the instance, or otherwise a way to share a custom
machine image publicly? (e.g. the way Amazon EC2 lets you make
an
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
AMI public
from an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker
image
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
they have, but that we would then need a wrapper to ssh into
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
DO machine
and execute the single command needed to bring up the RStudio
instance in
the browser.
On Tue, Aug 12, 2014 at 10:35 AM, Scott Chamberlain
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my
Digital Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Scott
On Fri, Aug 8, 2014 at 5:54 PM, Carl Boettiger
Hi folks,
Just thought I'd share an update on this thread -- I've
gotten
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
RStudio Server working in the ropensci-docker image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your
browser at localhost:8787. (Change the first number after
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
-p to have a
different address). You can log in with username:pw
rstudio:rstudio and
have fun.
One thing I like about this is the ease with which I can now
get
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
an
RStudio server up and running in the cloud (e.g. I took this
for
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
sail on
DigitalOcean.com today). This means in few minutes and 1
penny
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
you have a
URL that you and any collaborators could use to interact with
R
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
using the
familiar RStudio interface, already provisioned with your
data
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
and
dependencies in place.
To keep this brief-ish, I've restricted further commentary to
my
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
http://www.carlboettiger.info/lab-notebook.html
&
...
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Carsten Behring
2014-10-06 21:13:33 UTC
Permalink
Dear all,

my use case of "simple use of docker, cloud + rstudio" is largely satisfied
by the digitalocean R package under development here:

https://github.com/sckott/analogsea

It is digitalocean specific, but that's good enough for a start.

A simple cloud-hosted (shiny) web application using this library might even
ease further usage.

Regards
Post by Carsten Behring
Ciao Carl,
the application is public, so maybe you just put wrong data.
You need to put your digitalocean api token and the name of one of our ssh
keys from digital ocean.
Carsten
Carl,
regarding the authentication for the app, I will check.
Its the first time, I used heroku myself.
I thought by default an application on heroku is public....
(It works for me, probably because I am logged into heroku)
Can you send me a screenshot of the error message ?
Carsten
Carsten,
Thanks for your reply, you bring up really great points about the
realities of PC environments that I'm not really in touch with. It's
easy for me to get stuck just thinking about the small academic lab
context so it's great that you can help us think more big picture
here.
Also good points about an archival repository. For those following
along, one of the nice things about Docker is that the software that
runs the Docker Hub, docker-registry, is open source, so anyone can
host their own public or private hub. Easy sharing is a key feature
that I think has helped make Docker successful and a compelling
element for reproducible research.
While I see your point that the Docker Hub might not be ideal for all
cases, I think the most important attribute of a repository should be
longevity. Certainly Docker Hub won't be around forever, but at this
stage with 60 million in it's latest VC round it's likely to be more
long-lasting than anything that a small organization like rOpenSci
would host. It would be great to see existing scientific repositories
show an interest in archiving images in this way though, since
organizations like DataONE and Dryad already have recognition in the
scientific community and better discoverability / search / metadata
features. Building of the docker registry technology would of course
make a lot more sense than just archiving static binary docker images,
which lack both the space saving features and the ease of download /
integration that examples like the docker registry have.
I think it would be interesting to bring that discussion to some of
the scientific data repositories and see what they say. Of course
there's the chicken & egg problem in that most researchers have never
heard of docker. Would be curious what others think of this.
Cheers,
Carl
p.s. Nice proof of principle with the herokuapp by the way, but I'm
missing something about the logic here. if I just go to
https://blooming-lake-3277.herokuapp.com with the parameters you
provide, I'm told it can't authenticate. Am I supposed to be running
the app on my own droplet instead? Sorry, a bit over my head.
On Mon, Sep 29, 2014 at 1:12 AM, Carsten Behring
Post by Carsten Behring
Dear Carl,
thanks for your support and comments.
I would like to reply to some of you points.
In a lot of organisations, getting Linux or even docker on windows is
impossible to achieve. Security concerns let a lot of admins not touch
the
Post by Carsten Behring
corporate PCs to install "strange" applications.
This will get worse in the coming years in my view.
That is why I would like to envision as well a reproducible research
workflow, which is completely independent of any local installed
software,
Post by Carsten Behring
and it only needs a web browser.
I believe that in certain organisations it is easier to get a Creditcard
and
Post by Carsten Behring
budget to pay cloud computing services, then get "special virtualisation
software like VMWare/Docker" on the standard corporate windows PC of the
user.
This means, we need to come to solutions which contains "RStudio in the
cloud" as one possible computing engine.
The same argumentation is true for getting access to "fast hardware
with a
Post by Carsten Behring
lot of memory".
Docker hub is for sure the best place for the kind of base images and
its
Post by Carsten Behring
Dockerfiles you are working on.
I was thinking that we could envision that "each individual study /
analysis" could be published as a docker image.
In this case, I would say docker hub is not the right place to store all
of
Post by Carsten Behring
those.
It would be nice to have a specific registry only to share "docker
images of
Post by Carsten Behring
statistical analysis" which could contain different features for
searching
Post by Carsten Behring
and son on.
1. There would be one (or even several) docker registries dedicated to
"docker images containing Rstudio based images with individual analysis
projects"
2. Having the images there, means a user with a local docker
installation
Post by Carsten Behring
can use them as usual with his local Docker client
3. A user without a docker installation can "press a button" and he gets
automatically a new cloud server (Digitalocean, Amazon Ec2, others)
containing the Rstudio based image,
by giving his authentication data (so he pays for it). So he can login
immediately and look (and change) the analysis directly in the cloud.
What would be missing is, that in case 3), the user can not easily
republish
Post by Carsten Behring
his changes as a new Docker image. But this is solvable. It would need
a R
Post by Carsten Behring
package which can interact with the running cloud server (over ssh ...)
and
Post by Carsten Behring
re-creates and re-publishes a new version of the image on request of the
user.
So in this scenario, setting up somewhere a "docker hub based" registry
for
Post by Carsten Behring
"RStudio based reproducible research images" would be the starting
point.
Post by Carsten Behring
The code is here: https://github.com/docker/docker-registry
Please provide me with any comments you might have.
Post by Carl Boettiger
Casten,
Thanks for joining the discussion and sharing your experiences, it's
really nice to see how others are using these approaches.
I agree entirely that cloud platforms like digitalocean give a really
nice user experience coupled with docker and RStudio-server.
Certainly ropensci could host it's own hub, but the Docker Hub works
rather nicely already so I'm not sure what the advantage of that might
be? I also agree that docker has many advantages for reproducible
research, though it is worth noting that other 'classical' virtual
machines can offer a very similar solution to the distribution problem
you describe -- e.g. Vagrant Hub.
Nonetheless, I still see running docker locally as an important part
of the equation. In my experience, most researchers still do some
(most/all) of their development locally. If we are to have a
containerized, portable, reproducible development environment, that
means running docker locally (as well as in the cloud).
The reproducible research angle has the most to gain when people can
both build on existing Dockerfiles / docker images, as you mention,
but also when they can write their own Dockerfiles particular to their
packages. I don't think that's too high a barrier -- it's easier than
writing a successful .travis.yml or other CI file for sure -- but it
does mean being able to do more than just access an RStudio server
instance that just happens to be running in a docker container.
On a linux machine, this kind of containerized workflow is remarkably
seamless. I think boot2docker has come a long way in making this
easier, but not there yet.
To that end, we're still doing lots of work on defining some useful
base Dockerfiles and images that others could build from, including
the rstudio example. Current development is at
https://github.com/eddelbuettel/rocker
Thanks for sharing the API call example, that's awesome! (btw, Scott
has an R package in the works for the digitalocean API that may also
be of interest: https://github.com/sckott/analogsea).
On Sat, Sep 27, 2014 at 3:22 PM, Carsten Behring
Post by Carsten Behring
Hi everybody,
some of you mentioned in the previous posts, that the hurdle of using docker
is still to high for a lot of researchers.
One way to lower the hurdle is to use the cloud.
So instead of asking collaborators to install and use docker on their PCs in
1. We install (with the help of an little automatic tool using cloud
services API's) our docker image to share in the "cloud"
(digitalocean
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
/
Amazon EC2 others).
The we only send ip address and port to others and then they can
start
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
immediately in Rstudio Browser environment.
In this case the distributor pays .... for the cloud service
2. We send to a collaborator the name of our image (and the registry) and he
uses the same little tool to generate a cloud server containing the RStudio
environment.
-> Their are costs involved for the cloud providers.
- docker image name (based on RStudio, containing all data / R code
of
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
the
analysis)
- cloud provider credentials (for billing ...)
- IP address and port of RStudio, ready to use
I did a proof of concept for this with digitalocean and the docker
image
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
mgymrek/docker-reproducibility-example.
(create-droplet "my token" {:name "core1":region "ams3" :size
"512mb"
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
:image 6373176 :ssh_keys [42550] :user_data user-data})
where the "user-data" contains some info for coreos operating system
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
#cloud-config
- name: docker-rs.service
command: start
content: |
[Unit]
Description=RStudio service container
Author=Me
After=docker.service
[Service]
Restart=always
ExecStart=/usr/bin/docker run -p 49000:8787 --name
"rstudio"
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
mgymrek/docker-reproducibility-example
ExecStop=/usr/bin/docker stop rstudio
and, voila, on boot it starts RStudio on a new digitalocean server, ready to
use.
It should work the same way for Amazon EC2 or others.
So the tool could allow to select the cloud provider.
I am pretty sure as well, that a tool could be done as well which
does
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
the
same for the installation on local PC. (Windows, Linux, OS2)
https://github.com/behrica/ropensciCloud
The big added value of docker compared to classical virtual machines
is,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
that it solved the distribution problem of the images.
By just specifying "image name", "tag" and "registry" (if not docker
hub
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
is
used) each docker client knows how to get the image.
By using common base images it would be even very fast to download
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
images (after the first download happened)
Maybe ropensci could host its own image registry somewhere...
Casten
On Wednesday, September 10, 2014 9:14:03 PM UTC+2, Carl Boettiger
Post by Carl Boettiger
John,
Thanks again for your input. Yeah, lack of support for 32 bit hosts
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a
problem; though since you were speaking about AMIs I imagine you
were
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
already used to not working locally, so you can of course try it out
on
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
an
amazon image or digitalocean droplet.
Yeah, Titus makes a great point. If we only distributed docker
images
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
as
2 GB binary tar files, we'd not be doing much better on the open / remixable
side than a binary VM image. And docker isn't the only way to
provide a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
this
kind of script, as I mentioned earlier.
Nevertheless, I believe there is a technical difference and not just
a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
cultural one. Docker is not a virtual machine; containers are designed
expressly to be remixable blocks. You can put an R engine on one container
and a mysql database on another and connect them. Docker philosophy aims at
one function per container to maximize this reuse. of course it's
up
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
to you
to build this way rather than a single monolithic dockerfile, but
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
idea
of linking containers is a technical concept at the heart of docker that
offers a second and very different way to address the 'remix'
problem
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
of
VMs.
---
Carl Boettiger
http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos
On Sep 10, 2014 10:49 AM, "John Stanton-Geddes"
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is worth
the extra cost (in time, etc) as it provides explicit instructions
('recipe') for the container.
My one residual concern, which is more practical/technological than (open
sci) philosophical is that I still have to be using a system that I
can
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
install Docker on to get Docker to work. This is relevant as I can't
(easily) install Docker on my 32-bit laptop as it's only supported
for
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
64-bit. If I go through the (not always necessary) effort of
spinning
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
up an
AMI, I can access it through anything with ssh. The easy solution is
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
run
Docker on the AMI.
the argument here --
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly,
judging
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
by the hostile reactions over the years ;) -- is that it doesn't
really
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
matter
*which* approach you choose, so much as whether or not the approach
you
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
do
choose permits understanding and remixing. So I would argue that neither
an
AMI nor a fully-baked Docker image is sufficient; what I really want
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a
*recipe*. In that sense the Docker community seems to be doing a better
job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.
Hi John,
Nice to hear from you and thanks for joining the discussion. You
ask
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and
summarize
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
what I think are the promising features of Docker. I don't think
this
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.
1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that
"you
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
can't install an image for every pipeline you want...". In
contrast,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.
2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable
than
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
the image. (b) the script can be version managed (c) the script
gives
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).
For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].
3) You can run the docker container *locally*. I think this is
huge.
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't
necessary
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
for me to spin up an EC2 instance (with all the knowledge &
potential
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on.
It
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
has gotten much better and continues to improve.
4) Versioned images. In addition to version managing the
Dockerfile,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.
[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl
If you want to try running RStudio server from docker, I have a
little
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
overview in: https://github.com/ropensci/docker
Cheers,
Carl
On Wed, Sep 10, 2014 at 6:07 AM, John Stanton-Geddes
Post by John Stanton-Geddes
Hi Carl and rOpenSci,
Apologies for jumping in late here (and let me know if this should
be
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
asked elsewhere or a new topic) but I've also recently discovered
and
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
become
intrigued by Docker for facilitating reproducible research.
My question: what's the advantage of Docker over an amazon EC2 machine
image?
I've moved my analyses to EC2 for better than my local university
cluster. Doesn't my machine image achieve Carl's acid test of allowing
others to build and extend on work? What do I gain by making a
Dockerfile on
my already existing EC2 image? Being new to all this, the only
clear
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
advantage I see is a Dockerfile is much smaller than a machine
image,
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
but
this seems like a rather trivial concern in comparison to 100s of gigs of
sequence data associated with my project.
thanks,
John
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2
to
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
support the little guy. But as with anything, there is a huge
diversity of
AMIs and greater discoverability on EC2, at least for now.
On Tue, Aug 12, 2014 at 11:01 AM, Scott Chamberlain
Hmm, looks like DO is planning on it, but not possible yet. Do
go
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
upvote this feature
https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
Nice, we could work on this working privately, then when sharing
is
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
available, boom.
On Tue, Aug 12, 2014 at 10:43 AM, Carl Boettiger
Great idea. Yeah, should be possible. Does the DO API support
a
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
way
to launch a job on the instance, or otherwise a way to share a custom
machine image publicly? (e.g. the way Amazon EC2 lets you make
an
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
AMI public
from an S3 bucket?)
I suspect we can just droplets_new() with the ubuntu_docker
image
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
they have, but that we would then need a wrapper to ssh into
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
DO machine
and execute the single command needed to bring up the RStudio
instance in
the browser.
On Tue, Aug 12, 2014 at 10:35 AM, Scott Chamberlain
Carl,
Awesome, nice work.
Thoughts on whether we could wrap the docker workflow into my
Digital Ocean client so that a user never needs to leave R?
https://github.com/sckott/analogsea
Scott
On Fri, Aug 8, 2014 at 5:54 PM, Carl Boettiger
Hi folks,
Just thought I'd share an update on this thread -- I've
gotten
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
RStudio Server working in the ropensci-docker image.
docker -d -p 8787:8787 cboettig/ropensci-docker
will make an RStudio server instance available to you in your
browser at localhost:8787. (Change the first number after
the
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
-p to have a
different address). You can log in with username:pw
rstudio:rstudio and
have fun.
One thing I like about this is the ease with which I can now
get
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
an
RStudio server up and running in the cloud (e.g. I took this
for
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
sail on
DigitalOcean.com today). This means in few minutes and 1
penny
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
you have a
URL that you and any collaborators could use to interact with
R
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
using the
familiar RStudio interface, already provisioned with your
data
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
and
dependencies in place.
To keep this brief-ish, I've restricted further commentary to
my
Post by Carsten Behring
Post by Carl Boettiger
Post by Carsten Behring
Post by Carl Boettiger
Post by John Stanton-Geddes
<a href="http://www.carlboettiger.info/lab-notebook.html"
target="_blank" onmousedown="this.href='
http://www.google.com/url?q\75http%3A%2F%2Fwww.carlboettiger.info%2Flab-notebook.html\46sa\75D\46sntz\0751\46usg\75AFQjCNExAhHuhPBkKJtqUMjNvjhMntggqw';return
<http://www.google.com/url?q%5C75http%3A%2F%2Fwww.carlboettiger.info%2Flab-notebook.html%5C46sa%5C75D%5C46sntz%5C0751%5C46usg%5C75AFQjCNExAhHuhPBkKJtqUMjNvjhMntggqw';return>
true;" onclick="this.href='
http://www.google.com/url?q\75http%3A%2F%2Fwww.carlboettiger.info%2Flab-notebook.ht
<http://www.google.com/url?q%5C75http%3A%2F%2Fwww.carlboettiger.info%2Flab-notebook.ht>
...
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.
Loading...