Using docker to share/distribute all scientific work, even simple papers ?

Discussion:

Carsten Behring

2014-10-06 21:47:21 UTC

Dear all,

I have a question to people doing reproducible research in practice, which
is not my case. I have an IT background, but I am very interested in the
subject.

I did some data science course on Coursera (very good by the way) and we
did our exercises with reproducible research in mind and at a certain point
I asked myself, if docker could be the unique distribution format for all
type of research, even the smallest publication (if done in rmd format)

So, does it make sense to distribute even a single Rmd file as a docker
file ? (which contains the OS + Rstudio environment, which renders the Rmd
-> PDF, or html) ?

It might sound crazy, as a docker image including OS/Rstudio has a size of
hundreds of MByte.

But docker has a build-in, very smart, storage and distribution system for
the docker images. If everybody would uses the same (or a small number) of
base images only,
then every user of the scientific papers in form of docker images, would
only need to do the download of the base images ones.
Later downloads from the same person of different papers (as docker images
again), would only download the layer with the document (very small,
comparable to word or PDF files) , as the base images gets cached by the
docker client.
(This feature does not exist for classical Virtual Machine images)

The same is true for storage of those "docker-fiet" papers. Only the
"difference" towards the base images need to be stored each time.
This means as well, that it is feasible for a university/institute to run
their own private docker registry (code is open source) and host all their
research papers as docker images.
This registries could work with docker images using whatever technologies
(R, python, java, julia, whatever runs on Linux). Everything could be
distributed in the same format.

The concrete distribution of those images is then just a matter of telling
people the URL of the docker image. (as:
http://registry.my-host.my-domain/user-name/paper-name )

Everybody with access to a docker installation (local PC / cloud) could
then use those images and reproduce the analysis or paper in the "same"
environment as the original author.

Please provide me with any comments you might have.

Carsten

--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.

Carl Boettiger

2014-10-08 16:56:28 UTC

Permalink

Hi Carsten,

Before getting to your question, note that the first challenge here is
really getting researchers to share software at all, or event to work
in a scripted manner that it makes it possible to have something to
share. In almost all academic cases, distributing an .Rmd is already
a crazy and rather inaccessible idea. I suspect that only when
technology offers short-term rewards will researchers adopt it in
significant numbers. (I do believe docker brings us closer to those
short-term rewards).

You've posed the issue in terms of distributing images. How about
just distributing the Dockerfile along with the .Rmd? If your
computational environment is simple enough, chances are I can just
ignore the Dockerfile and run your Rmd in RStudio. If things don't
work, then the Dockerfile is at once both an automated recipe and a
human-readable documentation I can use to either troubleshoot or
deploy directly as a container. If the image is available somewhere,
so much the better, but it would not be strictly necessary. A few 100
MB is still large compared to a few KB for a Dockerfile and a code
script. Technically the Dockerfile doesn't guarantee a bitwise
identical environment, but the transparency and ease of remixing it is
probably more important anyway.

Just my perspectives here of course, would be happy to hear others
push back etc.

Cheers,

Carl

On Mon, Oct 6, 2014 at 2:47 PM, Carsten Behring

Post by Carsten Behring
Dear all,
I have a question to people doing reproducible research in practice, which
is not my case. I have an IT background, but I am very interested in the
subject.
I did some data science course on Coursera (very good by the way) and we did
our exercises with reproducible research in mind and at a certain point I
asked myself, if docker could be the unique distribution format for all type
of research, even the smallest publication (if done in rmd format)
So, does it make sense to distribute even a single Rmd file as a docker file
? (which contains the OS + Rstudio environment, which renders the Rmd ->
PDF, or html) ?
It might sound crazy, as a docker image including OS/Rstudio has a size of
hundreds of MByte.
But docker has a build-in, very smart, storage and distribution system for
the docker images. If everybody would uses the same (or a small number) of
base images only,
then every user of the scientific papers in form of docker images, would
only need to do the download of the base images ones.
Later downloads from the same person of different papers (as docker images
again), would only download the layer with the document (very small,
comparable to word or PDF files) , as the base images gets cached by the
docker client.
(This feature does not exist for classical Virtual Machine images)
The same is true for storage of those "docker-fiet" papers. Only the
"difference" towards the base images need to be stored each time.
This means as well, that it is feasible for a university/institute to run
their own private docker registry (code is open source) and host all their
research papers as docker images.
This registries could work with docker images using whatever technologies
(R, python, java, julia, whatever runs on Linux). Everything could be
distributed in the same format.
The concrete distribution of those images is then just a matter of telling
http://registry.my-host.my-domain/user-name/paper-name )
Everybody with access to a docker installation (local PC / cloud) could then
use those images and reproduce the analysis or paper in the "same"
environment as the original author.
Please provide me with any comments you might have.
Carsten
--
You received this message because you are subscribed to the Google Groups
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/d/optout.

Carsten Behring

2014-10-09 09:59:30 UTC

Permalink

Hi Carl,

thanks for giving me some insight into reality of research.

I know that the docker community promotes Dockerfiles versus "keeping
images"
and I can understand the reason behind this.

But in the case of R, it seems to me a problem for long term
reproducibility to mainly keep the Dockerfile.
Re-building the docker image from the Dockerfile in the future will change
the package versions...

Something like Packrat could solve this as well.

I thought about the problem from on IT organisation point of view.

I think that an IT department of an research organisation would love to
hear that storage and distribution of research artifacts (papers, code ,
data) done with whatever technology
could be achieved by just having to manage (install, upgrade, backup) a
single application: "a private docker registry".

Cheers,

Carsten

Post by Carl Boettiger
Hi Carsten,
Before getting to your question, note that the first challenge here is
really getting researchers to share software at all, or event to work
in a scripted manner that it makes it possible to have something to
share. In almost all academic cases, distributing an .Rmd is already
a crazy and rather inaccessible idea. I suspect that only when
technology offers short-term rewards will researchers adopt it in
significant numbers. (I do believe docker brings us closer to those
short-term rewards).
You've posed the issue in terms of distributing images. How about
just distributing the Dockerfile along with the .Rmd? If your
computational environment is simple enough, chances are I can just
ignore the Dockerfile and run your Rmd in RStudio. If things don't
work, then the Dockerfile is at once both an automated recipe and a
human-readable documentation I can use to either troubleshoot or
deploy directly as a container. If the image is available somewhere,
so much the better, but it would not be strictly necessary. A few 100
MB is still large compared to a few KB for a Dockerfile and a code
script. Technically the Dockerfile doesn't guarantee a bitwise
identical environment, but the transparency and ease of remixing it is
probably more important anyway.
Just my perspectives here of course, would be happy to hear others
push back etc.
Cheers,
Carl
On Mon, Oct 6, 2014 at 2:47 PM, Carsten Behring

Post by Carsten Behring
Dear all,
I have a question to people doing reproducible research in practice,

which

Post by Carsten Behring
is not my case. I have an IT background, but I am very interested in the
subject.
I did some data science course on Coursera (very good by the way) and we

did

Post by Carsten Behring
our exercises with reproducible research in mind and at a certain point

Post by Carsten Behring
asked myself, if docker could be the unique distribution format for all

type

Post by Carsten Behring
of research, even the smallest publication (if done in rmd format)
So, does it make sense to distribute even a single Rmd file as a docker

file

Post by Carsten Behring
? (which contains the OS + Rstudio environment, which renders the Rmd ->
PDF, or html) ?
It might sound crazy, as a docker image including OS/Rstudio has a size

Post by Carsten Behring
hundreds of MByte.
But docker has a build-in, very smart, storage and distribution system

for

Post by Carsten Behring
the docker images. If everybody would uses the same (or a small number)

Post by Carsten Behring
base images only,
then every user of the scientific papers in form of docker images, would
only need to do the download of the base images ones.
Later downloads from the same person of different papers (as docker

images

Post by Carsten Behring
again), would only download the layer with the document (very small,
comparable to word or PDF files) , as the base images gets cached by the
docker client.
(This feature does not exist for classical Virtual Machine images)
The same is true for storage of those "docker-fiet" papers. Only the
"difference" towards the base images need to be stored each time.
This means as well, that it is feasible for a university/institute to

run

Post by Carsten Behring
their own private docker registry (code is open source) and host all

their

Post by Carsten Behring
research papers as docker images.
This registries could work with docker images using whatever

technologies

Post by Carsten Behring
(R, python, java, julia, whatever runs on Linux). Everything could be
distributed in the same format.
The concrete distribution of those images is then just a matter of

telling

Post by Carsten Behring
http://registry.my-host.my-domain/user-name/paper-name )
Everybody with access to a docker installation (local PC / cloud) could

then

Post by Carsten Behring
use those images and reproduce the analysis or paper in the "same"
environment as the original author.
Please provide me with any comments you might have.
Carsten
--
You received this message because you are subscribed to the Google

Groups

Post by Carsten Behring
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send

Post by Carsten Behring
For more options, visit https://groups.google.com/d/optout.

--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

Carl Boettiger

2014-10-09 22:40:02 UTC

Permalink

Hi Carsten,

Thanks, no doubt you're right about that -- keeping the binary image
is the best way to make sure one has access to the correct versions.
(ideally along with the Dockerfile). It's particularly cool that you
could always diff the images to see what layers had changed, at least
in a crude way.

As you say, the registry goes a long way to this, though I think some
care would have to be taken to prevent updates from overwriting
previous images. The docker registry design isn't really concerned
with preserving archival states -- most users will be more concerned
about working with an image that has not been rebuilt recently, and
thus received all the security updates and so forth. You'd have to do
something to 'lock' a registry entry, or otherwise end up basically
storing just image tarballs.

I suspect the concern from archiving Dockerfiles alone is much less of
an issue if the Dockerfile installs entirely from Linux distribution
binaries, (where all the software is packaged into specific releases
that are maintained and archived) than say, a Dockerfile that installs
R packages from CRAN or other less persistent locations (of course
CRAN 'archives' packages that it removes, but installing becomes
rather manual then). It's an interesting question of how one might
write a Dockerfile so that it is most likely to be stable in the
future.

Yeah, packrat is clearly trying to solve that problem, but I haven't
been thrilled with it so far. It doesn't capture dependencies on
external libraries like libxml2, etc, though most of those tend to be
more stable. More to the point, in my experience it has always felt
rather invasive in my workflow -- I spend time 'managing packrat' and
having packrat scripts running in the background all the time, or
manually telling packrat that using the newer version of the software
is gonna be okay. Perhaps that's gotten better, maybe others could
comment.

Speaking of running old versions of code, I think the GRANbase package
approach is pretty promising:
http://blog.revolutionanalytics.com/2014/08/gran-and-switchr-cant-send-you-back-in-time-but-they-can-send-r-sort-of.html
It's nice to be able to run on the old systems when necessary, rather
than the packrat model of simply locking you into a particular version
of everything.

On Thu, Oct 9, 2014 at 2:59 AM, Carsten Behring

Post by Carsten Behring
Hi Carl,
thanks for giving me some insight into reality of research.
I know that the docker community promotes Dockerfiles versus "keeping
images"
and I can understand the reason behind this.
But in the case of R, it seems to me a problem for long term reproducibility
to mainly keep the Dockerfile.
Re-building the docker image from the Dockerfile in the future will change
the package versions...
Something like Packrat could solve this as well.
I thought about the problem from on IT organisation point of view.
I think that an IT department of an research organisation would love to hear
that storage and distribution of research artifacts (papers, code , data)
done with whatever technology
could be achieved by just having to manage (install, upgrade, backup) a
single application: "a private docker registry".
Cheers,
Carsten

--
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

--
You received this message because you are subscribed to the Google Groups
"ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.