Invertible Reproducible Documents

Eric Lim jlim062@aucklanduni.ac.nz1, Paul Murrell paul@stat.auckland.ac.nz1, and Finlay Thompson finlay@dragonfly.co.nz2.

1Department of Statistics, University of Auckland
2Dragonfly Science

Abstract

Reproducible documents provide an efficient way to produce reports by automatically generating content from code chunks within the report. The processing of a source document, that contains code chunks, to a final document, that contains automatically-generated content, is typically one way, with the resulting report being read-only. This report describes an experiment that attempts to make the final report document modifiable and attempts to invert the process from final document back to source document so that the modifications to the final document can be efficiently conveyed back to the original author of the report.

Table of Contents

Motivating problem

This article addresses the following standard scenario: a consultant needs to prepare a report for a client where, in addition to normal text, the report contains tables and graphs based on the results of data analysis.

Following best practice, the consultant produces the report as a reproducible document , using tools such as 'Sweave' and 'knitr' . This means that the consultant creates and edits a source document that combines normal text content with sections of computer code - code chunks - where the code chunks describe data analysis tasks. The figure A source document presents a minimal example of a reproducible document.


      

A source document: This LaTeX-based reproducible document contains a combination of normal LaTeX content and a chunk of R code (the code chunk has been highlighted in yellow [will appear underlined when printed]).

In the normal workflow, the source document is processed to produce a final document for the client. As part of this processing, the code chunks are automatically run and the results from running the code, e.g., tables and graphs, are automatically inserted in the final document. The figure A final document presents the result from processing the document in the figure A source document.

A final document: This PDF document is the result of processing the LaTeX-based reproducible document in figure A source document. The code chunk in the source document has been processed to produce a date in the final document.

This workflow benefits the consultant because regenerating the final document is very simple, fast, and reliable. There is only one source document where all changes are made and all changes to the source document are automatically incorporated into the final document during processing.

However, a problem arises when the client is asked to provide comments on the report. Typically, the source document and the final document have different formats. For example, the source document is a LaTeX file and the final document is a PDF file, or the source document is a markdown file and the final document is an HTML file. Depending on the format of the final document and the software available to the client, it may not be possible for the client to make direct modifications to the final document, in which case the comments that the consultant receives may only be hand-written on a hard-copy of the report. Where the final document can be edited, changes to the final document format will not usually map in a simple way to changes to the source document. In either case, the task of merging comments from the final document back into the source document is very laborious and error-prone.

This article describes an experiment to set up a reproducible document workflow that allows changes made by a client to the final document to be merged accurately and efficiently back into the source document.

In practice, the content that is generated by a code chunk in a source document is more likely to be a table of results or a graph of some data and the workflow described in this report has been tested on a range of different content. The simple generation of a date value is used in the examples in this report as a placeholder for more complex code; the purpose of using this simplified example is so that we can present and discuss the entire content of source and final documents at all stages of the workflow.

The Figure 'knitr' workflow provides a bird's eye view of the standard reproducible document workflow: The consultant works on a source document then processes it with 'knitr' to produce a final document for the client.

height= data="diagram-knitr.svg" type="image/svg+xml">

'knitr' workflow: The standard uni-directional workflow from source document to final document using a reproducible document tool like 'knitr'.

The Figure Inverted workflow (high level) shows the basic outline of the inverted workflow that is described in this article. The consultant works on a source document then processes it with 'knitr' to produce a final document for the client, plus the client is able to make changes to the final document, plus it is possible to process the final document to get back to the source document format, with the client's changes incorporated.

height= data="diagram-invert.svg" type="image/svg+xml">

Invertible workflow (high level): An invertible workflow from source document to final document and back again.

The advantage of the invertible workflow is that changes by the client are embedded directly within a new source document. This means that the consultant is saved the work of merging client changes. It also means that client changes can be treated as just a new version of the source document, which has positive implications for version control. The section Discussion explores the advantages and disadvantages of the inverted workflow in much greater detail.

Solution overview

This section describes the main problems that must be solved by the inverted workflow and the approaches taken to solve them. The section Solution details provides more detailed descriptions of the solution.

HTML as the document format

A major obstacle to the goal of inverting the final document is the difficulty of converting a final document format back into the source document format. For example, if the source document is a LaTeX-based reproducible document (as in the figure A source document), powerful tools exist to convert that document to a PDF document, but the inverse, from a PDF document to a LaTeX document, is very much harder to achieve (especially if we want to get back to a LaTeX document that is identical to the original source document, apart from explicit changes made by the client to the PDF document).

The solution chosen in this experiment is to focus on HTML-based reproducible documents. The reason for this is because an HTML-based source document gets processed to an HTML final document; the final document format is the same as the source document format. This improves our chances of being able to convert from final document to source document because at least the format of the two documents is the same. Because HTML is an XML-like markup language, we will also be able to use the powerful XML tool set (e.g., XPath) to manipulate the documents.

The figure An HTML-based source document shows a concrete, though extremely simplified, example of the sort of source documents we will work with and the figure An HTML final document shows the final document that is produced by processing that source document. The important point is that much of the content of the final document is very similar to the content of the source document, which is what we need to have any hope of getting from the final document back to the source document.


      

An HTML-based source document: This HTML-based reproducible document contains a combination of normal HTML content and a chunk of R code (the code chunk has been highlighted in yellow [will appear underlined when printed]).



      

An HTML final document: This HTML document is the result of processing the HTML-based reproducible document in the figure An HTML-based source document. The raw HTML code of the final document is shown as well as an image of how the final document appears to a client in a web browser. The code chunk in the source document has been processed to produce a date in the final document (the HTML code that was produced to show the date has been highlighted in yellow [will appear underlined when printed]).

Allowing the client to modify the final document

In a standard workflow, the final document is often treated as a static document that the client can view, but cannot modify. In other words, the final document is usually a dead end (see the figure 'knitr' workflow).

The choice of HTML, rather than, say, PDF, as the final document format does not initially improve this situation because an HTML-only web page is designed just for viewing in a web browser, not for editing. Although, technically, it would be possible for the client to modify the raw HTML, that is not something that clients are typically asked or expected to do.

The solution chosen in this experiment is to add the ability to modify the HTML final document by adding javascript code to the final HTML document, specifically, the CKEditor javascript library , which provides a relatively familiar word-processing GUI style of editing for the client.

The figure An editable HTML final document provides a simple demonstration; the paragraph of normal text on that page can be edited by clicking on it.

An editable HTML final document: This is an editable version of the HTML final document in figure An HTML final document. A click on the highlighted paragraph of normal text brings up a toolbar that allows simple editing of the text.

Loss of information

The processing of a source document to a final document can lose information from the source document. For example, if we compare the figure An HTML-based source document with the figure An HTML final document we see that the code chunk in the source document ...


      

... does not appear at all in the final document.

The solution chosen in this experiment is to introduce a pre-processing step that makes a copy of all code chunks in such a way that those code chunk copies are retained in the final document. The figure A pre-processed HTML-based source document shows how this works for the simple example in the figure An HTML-based source document.


      

A pre-processed HTML-based source document: This HTML-based reproducible document has been pre-processed to produce copies of all code chunks (the copy is highlighted in yellow [will appear underlined when printed]).

The figure A knitted pre-processed HTML-based source document shows the result of processing that pre-processed source document with 'knitr', to show that the code chunk copies are retained in the processed document.


      

A knitted pre-processed HTML-based source document: This HTML-based reproducible document has been pre-processed to produce copies of all code chunks and then knitted to produce a processed document that retains copies of the code chunks (the retained copy is highlighted in yellow [will appear underlined when printed]).

Gain of information

The processing of a source document to a final document can also add information. For example, if we compare the figure An HTML-based source document with the figure An HTML final document we see that the output from processing the code chunk in the source document ...


      

... is new content in the final document; that content was not part of the source document.

This gain of information during processing from source document to final document creates two problems, one simple and one hard. The simple problem is that conversion from final document back to source document will have to remove all of this gained information. That can be handled as just a text-processing task (as long as we can easily identify which content has been added).

The hard problem arises because the client can see some of the gained information. And if the client can see it, the client may want to modify it. And if the client modifies content that is not contained in the source document, how do we convert that change into a change to the source document?

The solution chosen in this experiment is to disallow editing of content that has been added during processing from source document to final document. Instead, the client is only allowed to annotate that content.

Annotation is enabled by adding javascript code to the final HTML document, specifically, the Annotator javascript library , which allows annotations to be added to selected regions of an HTML document.

The figure An annotatable HTML final document provides a simple demonstration; the date text (the output from the code chunk in the source document) can be annotated by selecting any part of the date text with the mouse.

An annotatable HTML final document: This is an annotatable version of the HTML final document in figure An HTML final document. Select (click and drag) any part of the highlighted date text to try out the annotation interface.

Solution details

This section provides a more detailed description of the solution that was implemented in this experiment. The solution was implemented as a series of R functions that can be used to transform a source document into an editable and annotatable final document and then transform back to a source document, incorporating any changes made to the final document. The Figure Invertible workflow (low level) shows a diagram of the functions involved.

height= data="diagram-invert-detail.svg" type="image/svg+xml">

Invertible workflow (low level): An invertible workflow from source document to final document and back again. This is an expanded view of the figure Invertible workflow (high level).

Each of the functions are described below in more detail:

sew()

This function implements the pre-processing step (before the source document is processed with 'knitr'). A copy is made of all code chunks in the source document, with each copy delimited by <!--begin.keepcode and end.keepcode--> markup. These copies are then seen as normal HTML comments by 'knitr', which means that they are ignored during processing so that they are retained in the final document. This makes it easy to restore the original code chunks when converting from final document back to source document in the rip() function (see below).

Inline code chunks are also copied, using <!--keep.rinline and --> markup.

By default, the original source file with a .Rhtml suffix is converted to a file with a .post.Rhtml suffix in this step.

knit()

This is the normal 'knitr' processing step that replaces code chunks with the results of running the code chunk.

By default, the result of this step is a file with a .post.html suffix.

snap()

This step inserts additional javascript and HTML code into the HTML file to allow the client to edit and annotate the final document in a web browser.

Javascript code is added to the head element of the HTML file to load the CKEditor javascript library and the Annotator javascript library and other javascript libraries upon which they depend (e.g., jQuery, ). Javascript code is also added to enable annotation in all div elements with class="chunk" (this is the markup that 'knitr' adds to the content that was generated by 'knitr' as a result of running code chunks in the source document).

Changes are also made directly to top-level elements within the body of the HTML file to enable editing. All top-level elements that are not div elements with class="chunk" have the attribute contenteditable="true" added to allow editing with CKEditor. These elements also have an automatically-generated id attribute added (if none exists) so that each piece of editable content can be uniquely identified.

Finally, some extra HTML code is added to the document body to provide a "save" button (and javascript code is included in the head to implement an action for this button). A click on the "save" button saves all direct modifications of the document to a file called test-changes.txt and all annotations to a file called test-annotations.txt.

In order to save the changes from CKEditor and the annotations from Annotator, the editable final document must be hosted on a web server, so the code is designed to work on a test server that has been set up at http://stat220.stat.auckland.ac.nz/cke/ . Follow the instructions on that site to upload a file that has been prepared using sew(), knit(), and snap(). Access to the server is restricted via the username and password "cke".

By default, the result of this step is a file with a .edit.html suffix, which is ready to be uploaded to the test web server.

Once that file has been uploaded to the web server, edits and annotations have been made, and the "save" button has been clicked, the final document consists of the editable HTML document (a file with a .edit.html suffix) plus a file of changes and a file of annotations (both sitting on the web server).

The figure An annotations file shows an example of the information that is saved for annotations and the figure A changes file shows an example of the information that is saved for changes.

        
      

An annotations file: An example of the information that is saved for annotations. This includes information about what text was annotated as well as the annotation message itself. In this case, the text "25 July, 2014" in the final document has been annotated with the message "Can we use ISO 8601 format for the date?".

        
      

A changes file: An example of the information that is saved for direct edits of the final document. In this case, there were two top-level elements that could be edited (a heading and a paragraph of text); the heading was not modified, but the paragraph of text was, so the new content for the paragraph of text has been saved.

annotations()

This step merges annotations from the file test-annotations.txt into the final document as paragraphs of normal text. These annotations are comments that will require manual changes to be made to the source document, but they are at least placed in a relevant location within the document.

This step adds new content to the final document that is designed to survive the inversion back to a new source document. By default, the result of this step is a file with a .anns.html suffix.

The figure A final document with annotations shows the result of merging the example annotation from the figure An annotations file into a final document.



      

A final document with annotations: The HTML final document has had annotation information merged into it in the form of a new paragraph just above the code chunk where the annotation was made (the merged annotation is highlighted in yellow [will appear underlined when printed]).

changes()

This step merges changes from the file test-changes.txt into the final document,

This step involves replacing old content (that has been edited) with new content and this new content that is designed to survive the inversion back to a new source document. By default, the result of this step is a file with a .save.html suffix.

The figure A final document with changes shows the result of merging the example changes from the figure A changes file into a final document.



      

A final document with changes: The HTML final document has had changes merged into it where content was modified by CKEditor (the merged change is highlighted in yellow [will appear underlined when printed]).

rip()

The last step converts the final document back into a source document by removing all unwanted content that was added in previous steps. This includes: all content added by 'knitr' in the knit() step; all content added in the snap() step (that made the final document editable); and replacing the original code chunks using the copies that were made in the sew() step.

There may be new content from annotations or from changes made directly with CKEditor.

The result of this step is a .Rhtml file, which is a source document, so is designed to be viewed and modified in a text editor (not viewed in a web browser). By default, the result of this step is a file with a .return.Rhtml suffix.

The figure An inverted final document shows the result of converting the final document, with all annotations and changes merged, back into a source document, with all unwanted content removed. The important point is that this inverted document is very similar to the original source document from the figure An HTML-based source document.


      

An inverted final document: The HTML final document has been converted back into a source document.

The remaining sub-sections in this section describe some general features of the solution.

Text processing versus XML tools

Most of the steps involved in the inverted workflow (other than the 'knitr' step) could be characterised as "text processing" - searching for patterns within a document and then modifying, replacing, or removing the matching sections. For example, the sew() step involves searching for code chunks using the patterns <!--begin.rcode and end.rcode-->, making a copy and, in the copy, changing rcode to keepcode.

Because the documents that we are working with are HTML-based, another way to view the steps in the inverted workflow is as manipulations of XML content. For example, when we want to find all top-level elements within the body of the document, it is simpler to use an XPath expression like //body/* rather than attempt to search the document for opening and closing tags using text processing.

Both of these approaches are employed within the solution, with XML tools used to help locate important places within the document (using functions from the 'XML' package for R, ; a combination of getNodeSet() to identify fragments of the XML using XPath expressions and getLineNumber() to map a location within the XML to a location in the text document), but with text processing employed to actually make most changes.

The reason for this mixture of approaches is because text processing has a lower impact on the document. The round trip from text document, to XML objects within R, and back to text document, introduces side-effects (both white space and special entities) which interfere with our desire to be able to convert between source and final document and back again.

The importance of 'diff'

The R functions in the inverted workflow work quite hard to make sure that the only differences between the original source document and an inverted final document are differences that have been deliberately introduced by editing or annotating the final document. In particular, the functions work hard to avoid any spurious changes in white space, including changes in indenting and changes in where line breaks occur.

The reason for this emphasis is that it makes it possible to use simple, standard, text processing tools, e.g., 'diff', to easily identify changes. This has major flow-on benefits for version control because tools like 'git' and 'svn' can then reliably track the changes. Complex situations, for example when the consultant makes changes to the source document at the same time as the client is making changes to the final document, which require a merge of different versions of the document, can also be handled with existing version control tools.

The figure A 'diff' of the round trip shows output from the 'diff' program comparing the original source document, from the figure An HTML-based source document, with the inverted final document, from the figure An inverted final document, to show that the only changes that have been made to the original source document are either direct modifications using CKEditor or new annotations.



      

A diff of the round trip: The 'diff' program shows the differences between the original source document and the inverted final document, with changes and annotations merged.

Summary

A summary of the steps involved in the inverted workflow, along with example code, are shown below:

  • Create a source document, e.g., source.Rhtml.
  • Edit the source document until it is ready to send to the client.
  • Pre-process the source document ...
              sew("source.Rhtml")
            
  • Process the source document with 'knitr' to create the final document ...
              knit("source.post.Rhtml")
            
  • Post-process the final document to enable editing ...
              snap("source.post.html")
            
  • Upload the editable final document for the client to edit on the web server via http://stat220.stat.auckland.ac.nz/cke/upload.html
  • Client edits and annotates the final document via the web browser and clicks the "save" button to create files test-annotations.txt and test-changes.txt on the web server.
  • Merge changes and annotations into the final document ...
              annotations("source.edit.html")
              changes("source.anns.html")
            
  • Invert the final document to produce the new source document ...
              rip("source.save.html")
            

Discussion

We have described a set of R functions that, together with 'knitr', allow a consultant to write a report as a reproducible HTML-based source document, convert that report into an HTML final document that can be directly edited and annotated by a client, merge the client's changes into the final document, and then convert the final document back into a reproducible HTML-based source document that only differs from the original source document where the client has made direct edits or the client has added an annotation.

The functions described do not represent a production-quality workflow. Their main value is to demonstrate that an inverted workflow is possible and as a starting point for further exploration. This section discusses the limitations of this inverted workflow, compares it to other existing workflows, and suggests ways to generalise the workflow in various ways.

Limitations

The main limitations arise from the fact that the R functions described in this experiment are tightly coupled with the software tools chosen to solve the various issues: 'knitr', CKEditor, and Annotator. For example, the sew() step looks for code chunks based on the markup used by the 'knitr' package for HTML-based reproducible documents.

One consequence of this tight coupling is that the functions are only of use for this particular combination of tools; the functions are not very general. Another consequence is that the functions are vulnerable to changes in the underlying tools. For example, if the 'knitr' package changed its markup for code chunks, the functions would fail. So the functions are also fragile.

The choice of HTML as the source document format also imposes some limitations. For example, some components of a report, such as tables and equations, may be harder to produce in HTML compared to using LaTeX. On the other hand, HTML provides some possibilities, such as interactivity, beyond what LaTeX can offer.

Other workflows

This article has focused on the 'knitr' package as a tool for reproducible documents, but many other tools exist for this purpose. This section considers the advantages and disadvantages that other tools might offer as a basis for an inverted workflow.

Beyond 'knitr'

Within the R ecosystem, the 'Sweave' package , which predates 'knitr', can also be used, with the help of the 'R2HTML' package , as the basis for an HTML-based workflow. Assuming that the HTML produced by 'R2HTML' is as regular and predictable as the HTML produced by 'knitr', it should not be difficult to modify the functions to work with 'Sweave' instead of 'knitr'.

Some other packages provide alternative document formats, e.g., the 'odfWeave' package allows a workflow based on Open Office documents. The idea of generalising the inverted workflow to other document formats will be discussed in more depth later, but one package deserves attention here: the 'SWord' package. The interesting aspect of this package is that the source document and final document are not only both Word documents, they are the same document. The figure The SWord workflow shows a conceptual diagram of the workflow when source and final document are the same thing.

height= data="diagram-sword.svg" type="image/svg+xml">

The SWord workflow: A reproducible workflow based on the 'SWord' R package. Both the consultant and the client can work on exactly the same document because when 'Sword' processes the source document, it simply modifies the document; the source and final documents are the same thing.

This approach completely avoids the issue that the inverted workflow is trying to solve because there is no document transformation to invert, plus the document is not only editable, but it is a format that clients will be comfortable editing. However, one downside to this solution is that it forces the consultant to work in Word and that will not be ideal if the consultant prefers to work with simpler, text-based document formats in order to benefit from being able to work programmatically with the source document and to obtain full benefit from version control tools such as svn and git.

Another related tool that needs mentioning is the 'patchDVI' package . This is relevant to a LaTeX-to-PDF reproducible workflow and allows a client who is viewing a PDF final document to link a location in the PDF document back to the corresponding location in a LaTeX-based source document. This tool is more aimed at enabling the source document author (the consultant rather than the client) to map PDF locations back to the LaTeX source, so it has a different aim compared to the inverted workflow.

Beyond R

The discussion of an inverted workflow has been centred around reproducible tools in the R environment, but reproducible documents and shared documents exist in many other forms as well. Two notable examples are IPython notebooks and the R Extension for MediaWiki. Both of these are similar to 'SWord' in that they work with a single document that both consultant and client can edit, so there is no transformation to invert, but the consultant and client both have to work in the same environment, which is likely to make one or the other unhappy.

Generalisations

The inverted workflow described in this article is limited to an HTML-based source document that is processed by 'knitr' to an HTML final document, and that is made editable via CKEditor and annotatable via Annotator. This section considers how many of these constraints could be relaxed in future work.

Beyond CKEditor and Annotator

While it would be a simple matter to include a different javascript library than CKEditor, one important feature of CKEditor that might be difficult to replicate is the ability to enable editing via CKEditor in only a subset of the elements of an HTML document (and to save changes per element).

The feature of the Annotator library that was most useful was the fact that it provides quite detailed information, in the form of both XPath expressions and character offsets, about exactly which part of the document was annotated. It was also useful that the Annotator library also allowed annotation on a subset of the elements of an HTML document.

Another avenue for future experimentation would be to explore the lifecycle of annotations in a more sophisticated fashion. Having received information from the client in the form of an annotation, it might be desirable for the consultant to round-trip the annotation back to the client with information about whether the annotation has been resolved. In other words, the inversion from final document back to source document in the inverted workflow is currently too one-directional and does not consider the possibility of recycling information from the client, in the form of a modified source document, back to the client in a new final document.

Beyond HTML

The choice of HTML as the document format for the inverted workflow proved successful, but it is an open question whether that success could be replicated with other document formats. It appears unlikely that an inversion of the common LaTeX source document to PDF final document transformation would be possible, but there may be some hope with an Rmarkdown source document to HTML final document workflow. Future work is planned to explore this further.

Beyond consultant/client mode

This experiment was motivated by the scenario where a consultant produces a report for a client, but reproducible documents are also being used in several other important contexts, including scientific reproducibility and literate code development.

It is less likely that an inverted workflow would be of use in literate code development because both the producer and consumer of source and final documents are developers with technical skills to work on the source document (the final document can safely be read-only because everyone is capable of working directly on the source document).

However, An inverted workflow may have some relevance in the context of scientific reproducibility. In this case, the technical skills of the producer and consumer of source and final documents can vary widely and there is potential value in a consumer being able to transmit comments to the producer via changes to the final document. This applies not only to the relationship between one researcher as author and another researcher as reader, but also to the peer-review process where the reader is a reviewer.

Reusable documents versus reproducible documents

The standard reproducible document workflow involves the consultant working on a source document and then processing that document to produce a read-only final document for the client.

The inverted workflow described in this report allows the client to modify, as well as view, the final document and allows the final document to be processed back into a source document.

A further generalisation of this idea of allowing invertible documents, is the idea of allowing reusable documents. The point of making a document reusable is that we might want to perform other processing tasks with the final document. For example, we might want to extract all source code from the final document (a process sometimes described as "tangling"). More generally, some third party might want to process the final (or source) document in some way that the original author has not predicted.

There are several reasons why the workflow described in this report allows the final document to be inverted: information from the source document is never thrown away (e.g., all source code is kept even when it is not visible in the final document); processing of the source document to the final document happens in a predictable way, with structure and labelling (we know or we can identify what additional content gets added during processing); and the format of the final document is a markup format (HTML).

These features - retention of content, predictable processing output, and use of markup - are desirable features for any workflow where a document author wishes to produce a final document that can not only be used for a specific, predetermined purpose, but can also be reused in ways that the author might not have anticipated.

Availability

The functions described in this article are available from github.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

All functions described in this report and all code samples embedded in the source document for this report are released under the GPLv3 licence.

The examples in this report were tested with CKEditor version 4.4.3 and Annotator version 1.2.9.

The files required to create the test web server are: upload.html and upload.php, for upload of editable final document; and save-changes.php and save-annotations.php for saving of changes and annotations to the files "test-changes.txt" and "test-annotations.txt".

Contributions

All authors were involved in the design of the software described in this article. Finlay Thompson provided the original problem statement. Paul Murrell wrote the javascript code to incorporate CKEditor and Annotator in an HTML document. Eric Lim developed all of the R code for the sew(), snap(), annotations(), changes(), and rip() functions.

Acknowledgements

This article was processed from a source document using 'knitr' and 'knitcitations' .

References