Consider the following short R script that shows the development of a collapse() function to produce a single object from two input objects.
The scenario that we want to focus on in this document is one where we are developing this R code in a somewhat ad hoc fashion. We record our work in a text file, and then cut-and-paste or otherwise submit sections of the code to R. For example, first we define the collapse() function itself ...
... then we define a special operator for convenience ...
... and finally we run some tests to check that the function and operator work as expected ...
This approach of submitting code piecemeal from a text file
is not the safest way to develop code; for example, it would be safer to
The reason this approach of submitting code piecemeal is not
best practice is because, with this approach, it is very easy
to shoot ourselves in the foot. For example, consider a scenario
where we decide to modify the definition of the
After modifying our script, we resubmit the
... and rerun the tests ...
... which now give different results because the special operator,
That was a very stupid thing to do, it would have been avoided if we
had
The purpose of this document is to explore an idea for protecting ourselves against this sort of stupidity.
The essence of the problem is that it is possible to submit
R expressions from a script in the wrong order (in an
order that does not correspond to the order of the expressions
within the script file). In the example above, our mistake was
to run the
tests before redefining the
The proposed solution is to monitor every R expression as it is submitted and detect when expressions are evaluated in the wrong order. We do this by recording a time stamp whenever a value is assigned to a symbol; this gives every symbol an "age". We also record which other symbols are involved in every assignment and this gives every symbol a list of "dependents". This allows us to determine whether an expression involves a "stale" symbol - a symbol that is older than its dependents.
In the example above, when we tried to run the expression ...
... we would get a warning that the
It turns out not to be too hard to produce a simple demonstration of the solution. However, to do so, we need to simplify the example problem even further. We will work with the following R script.
The first thing we need is a place to record time stamps and dependencies.
Now, we define a new function,
Here is this function in action and the resulting time stamps and dependencies that are recorded.
We can see that the assignments happened in the order 'a', then 'b', then 'c'.
We can see that 'a' has no dependencies, 'b' is dependent on 'a', and 'c' is dependent on 'b'.
With that information recorded, we can write a function that determines whether a symbol is "stale" (whether it is older than its dependents or any of its dependents are stale).
None of our symbols are currently stale, either because they have no dependents, or because they are younger than their dependents ...
... but if we assign a new value to 'a' (using
We can work this
Now if we try to do something stupid, like reassigning 'c' without first reassigning 'b', we get a warning.
There are several major problems with the naive solution presented in the previous section.
For a start, it is unlikely that users are going to want to change
to using a
Even if users were prepared to do this, assignments of the form ...
... or ...
... would be hard to support.
An alternative could be to define a special operator, say
... which would require of users only a simple search-and-replace. However, this quickly runs into problems of its own, including the fact that operator precedence places special operators ahead of common arithmetic operators, so we get problems like this ...
... and the solution, bracketing the right-hand side of the assignment, is again not something users are likely to embrace willingly.
A further problem is that this naive solution only detects problems on assignment, not on each use of a symbol. For example, if we reassign 'a', just using 'b' is not enough to trigger a warning, we must make an assignment involving 'b'.
The script we are working with is also extremely simple and identifying the dependent symbols will be much harder for more complex code.
In this section, we look at a more comprehensive attempt at a solution.
The same basic approach is used to record time stamps and dependencies so that we can determine whether a symbol is stale, but more sophisticated tools are used to determine the dependencies between symbols.
Firstly, the
This information allows us to determine whether an assignment took
place and which symbols were the target(s) of the assignment.
Importantly,
In addition to
This function is useful because it will not identify symbols that are involved in a formula (or are otherwise quoted), which helps us to avoid creating unnecessary dependencies ...
Another major difference with our more complete solution
is that, instead of defining a function or a special
operator, we create a "safe mode" sub-session to handle all input from the
user.
This approach is based on
Ross Ihaka's
A complete explanation of the code for the 'safemode' package is available in the literate document that is installed with the package. This section now just focuses on a demonstration of some of the useful features of this "safe mode."
We start a "safe mode" session by calling the
There now follows a set of examples:
We can reproduce our original problem to show that "safe mode"
can detect that the
We can detect problems on any use of a stale symbol, not just on assignments. Furthermore, "safe mode" works with normal R expressions, so an arithmetic expression on the right-hand side of an assignment is not a problem.
We can handle assignments that only "update" the left-hand side of an assignment, for example, assigning to a subset and assigning new names to an object ...
An important detail is that "safe mode" only records dependencies between symbols that we have assigned in the current session (which reflects the fact that we are interested in preventing users from evaluating expressions within their own code in an inappropriate order) ...
We can detect that symbol 'z' is stale when it depends on 'y', which depends on 'x', and 'y' is stale ...
We can report that more than one dependency is stale at once ...
We can handle a function with a global variable ...
We can handle a multiple assignment ...
We can handle a compound expression ...
We can cope with the right-assignment operator ...
We can work with a for loop ...
It is possible to create dependency loops, but "safe mode" reports those happily without, for example, infinite looping itself (it is up to the user to break the dependency loop) ...
Finally, a demonstration that "safe mode" works properly in a negative-result sense (it does not report problems where there are none) ...
To exit "safe mode", we type
The 'safemode' package provides a
The idea of "safe mode" solves a problem that should not exist (in an ideal world). If people developed R scripts in a strictly disciplined fashion, "stale" symbols would not arise. However, the reality is that this sort of problem can occur, probably does occur, and the 'safemode' package demonstrates an idea for protecting users from harming themselves in this way.
This is not a tool that would be useful for developers of R packages, but it might help to prevent some nasty slip ups when developing scripts in a more casual fashion for small one-off projects.
The code analysis involved in determining dependencies between symbols is based on heuristics (and some special cases), so it can be defeated by things like non-standard evaluation. In the following example, "safe mode" incorrectly creates a dependency between 'mpg' and 'x' ...
Because
There are several R packages that provide code analysis. The 'safemode' package has similar aims to the 'CodeDepends' package and 'safemode' relies heavily on sophisticated functions from 'CodeDepends' (and 'codetools'). The main difference is that 'safemode' is focused on a more dynamic scenario and acts as the user enters expressions, rather than being aimed at static analysis of script files.
The 'lintr' package is integrated with several editors and IDEs for R to provide dynamic code checking, but it has more of an emphasis on syntax checking and code style.
The first next step will be to try out "safe mode" in a more realistic setting, rather than just on toy examples. For example, we will get students in an undergraduate course to use it in computer labs to see if it does help to catch any problems.
If "safe mode" proves to have some worth, a number of improvements immediately suggest themselves for future development:
There could be some global options to control "safe mode"
behaviour, such as customising the
It might be useful to be able to deliberately clear the time stamp and dependency databases (though this happens automatically on exit from "safe mode").
It might be useful to have a function that produces a summary of the status of all tracked symbols in "safe mode" and possibly a function that reports, if there are stale symbols, which symbols need updating (and in which order).
Something else to look at is whether the time spent monitoring and checking for "stale" symbols can become burdensome and ways to make that more efficient.
Finally, an alternative approach to this problem, rather than monitoring expressions as they are entered, might be to monitor and check for staleness entirely within the script file (within an IDE environment). For example, whenever a line is edited, highlight every other line in the file that needs to be re-evaluated as a result of that change; put another way, highlight all lines within a script that have become "stale".
The 'safemode' package is available from github. It can be installed with the following ...
library(devtools) install_github("pmur002/safemode", subdir="pkg")
There is also a tar ball for installing on Linux and a zip file for installing on Windows, for version 0.1-0.
Ross Ihaka's
Luke Tierney's 'codetools' package is distributed with R, but is also available from CRAN.
Duncan Temple Lang's 'CodeDepends' package is available from the Omegahat repository. However, to get all of the examples in this document to run (particularly the ones involved a character value on the left-hand side of an assignment), you will need a fork of 'CodeDepends' from github. Hopefully, this will eventually turn up in Duncan's github repo. There is a zip file for the forked 'CodeDepends' for installing on Windows.
Jim Hester's 'lintr' package is available from CRAN or github.
The source and additional resources used to build this document are available from the parent page for this document.