by Paul Murrell http://orcid.org/0000-0002-3224-8858
This document by Paul Murrell is licensed under a Creative Commons Attribution 4.0 International License.
This report describes the 'gdiff' package for R, which provides tools for performing visual tests on graphical output. Visual testing is essential for checking that code is producing the correct graphical output. The goal of the 'gdiff' package is to support a wide range of testing scenarios, including testing graphical output between different versions of R.
NOTE: there are several paths and computing environments used in code examples in this report that rely on the code being run within a specific Docker container; see the Resources Section for details.
When we test code, we are interested in establishing not only that the code runs without producing errors, but also that the code produces the desired output.
Testing output requires a comparison between two sets of output, which we will label "control output" and "test output". When the output from code is graphical, one way that we can compare a control image with a test image is by comparing raster versions of the two images to see whether any pixels differ.
The following code shows a manual example of this process. We draw a control image consisting of the text "same" and a test image consisting of the words "same" and "different".
image_compare function from the
'magick' package (Ooms, 2019)
produces an image of the difference between
the test and control images.
The images are shown below, with the control image on the left, the test image in the middle, and an image of the differences on the right, with different pixels highlighted in red (and identical pixels light grey).
image_compare function also gives us
a count of the number of different pixels.
This provides us with a test for differences (is the number of different pixels greater than zero?), plus an image to show where differences have occurred, which can help us to track down the source of the difference within our code.
It is also possible to compare images in other ways. For example,
SVG images are text-based, so a text-based comparison using
a tool like GNU
can be performed. Another possibility is to reduce image files to
a unique "fingerprint", through the use of a hashing function,
and then compare image fingerprints.
This report focuses on pixel-based comparisons.
This report describes the 'gdiff' package (Murrell, 2019a)
for visual testing of graphics
code. The 'gdiff' package provides functions for generating control output
and test output and for comparing the two. The following code
shows a simple use of the package: we generate
control output from the function
test output from the function
and compare the two outputs, with the result being a difference
that is represented by a comparison image and a pixel difference of 248.
There are other R packages that perform visual testing, such as 'graphicsQC' (Murrell and Gardiner, 2009), 'vdiffr' (Henry et al., 2019), 'vtest' (Chang, 2019), and 'visualTest' (Campbell and Csardi, 2019), but the aim of the 'gdiff' package is to support a broader range of testing scenarios than any of those packages. A detailed comparison of 'gdiff' with other R packages will be made later in the Discussion Section, once we have seen how the 'gdiff' package works and once we have seen some examples of its use.
The 'gdiff' package supports a wide variety of testing scenarios in two ways: it provides flexibility in the ways that graphical output can be generated; and it has low expectations about the type of graphical output that it will perform comparisons on. The 'gdiff' package also provides the result of comparisons in a very simple format, which makes it easy to build upon.
The following sections demonstrate these three aspects of the 'gdiff' package and give examples of use for the main functions in the package.
gdiffOutput function can be used to
generate graphical output, for either control images or test images.
The first argument to this function is a function (with no arguments)
that will generate graphical
output and the second argument is a directory where image files
will be generated. The following code defines a simple
f, that generates a single plot and
the output from that function is generated in a single
file in a directory called
"Control". The result
of the function call
is a character vector of the files that were generated.
By default the name of the output file is generated by deparsing the
first argument - in this case,
but it is also possible to specify a
name for the output
file in the call
In the code below, we change the name used for the output file to
"fun". A numeric suffix is also appended to the
output file name automatically.
If the function that we give to
generates several plots, the result will be
multiple files. The example below generates two plots in
two separate files. This demonstrates why the numeric suffix
is always appended to the output file names.
The function that draws plots is called with no arguments, so
the function must use data that is always available. In the
example below, the function
h makes use of the
datasets::mtcars data set.
The example in the code below uses a closure to define a function with the required data stored in the function's environment.
By default, the directory in which to generate output files
is created if it does not exist and emptied if it does exist.
clean=FALSE means that an existing
output directory will not be emptied. However, this is not allowed
if the directory already contains 'gdiff' output.
The code below attempts to add output to the "Control" directory
without cleaning it first, but the directory already contains
'gdiff' output, so we get an error.
We will see an example later
can be useful.
By default, output is generated on a PNG
graphics device, but the
device argument to the
gdiffOutput function can be
used to select a different graphics device. This expands
the range of testing scenarios to include different graphics devices.
postscriptDevice. The following code generates
output on a PDF device.
The value of the
device function can also be a list
of devices, in which case output is generated for each device.
The following code generates each page of output twice:
once on a PNG device and once on a PDF device.
device functions allow arguments to be passed
to the graphics device, for example to control the size or resolution
of the device. We will see some examples later.
Output is generated within a "session" of R, which by default is the current R session. The following code makes that explicit.
libPaths argument to the
function can be used to specify
a directory from which packages will be loaded when generating
output. This expands the range of testing scenarios to
allow the same code to be tested on different
versions of a package (and its dependencies).
The following code installs two different versions of the 'grImport' package (Murrell, 2009) into two different directories.
We can now generate control output based on one version of 'grImport' and test output based on the other version of 'grImport'. The resulting images are displayed in each case to show that they are different.
We can generate output within a separate R session with the
localSession function. The following code
repeats the 'grImport' example, but uses a separate R session
to generate both control and test output.
Notice that the test
grImport now no longer needs to
the 'grImport' package (because the package is loaded within a
separate R session).
localSession function also has an
Rpath argument so that we can generate output within
a separate R session using a different version of R.
This expands the range of testing scenarios to include
comparisons between different versions of R.
The following code generates output from the
help page examples using a separate R session running
R version 3.5.3 for control output and
the current R session, which is
R version 3.6.1, for test output. The test output generates
an extra file of output because an extra example was added to
axis help page in R version 3.6.0.
There is also a
remoteSession function and
dockerSession function so that we can generate output on
different platforms. The
remoteSession function requires
host name (and possibly a
(as generated by
parallel::makeCluster) and the
dockerSession function requires the name of a docker
In both cases, the
Rpath can also be set.
This expands the range of testing scenarios to include
comparisons between different operating system platforms
and interactions between package versions, R versions, and
operating system platforms.
The following code repeats the
but this time we generate the test output in a docker container
that has R version 3.6.1 installed.
This sort of "remote" output generation requires some set up. For example, the 'gdiff' package at least must be installed on the remote system or the docker image, and of course R itself must also be installed.
There are two other high-level functions for generating
generates output from the examples section
of the specified function (either a function name or a function object).
The following code generates 4 files of output from the examples
on the help page for the
generates output from all examples sections of all exported
functions in a package. The following code generates 10 files
of output by running all of the examples in the 'gridBezier' package.
The previous section demonstrated the flexibility that the 'gdiff' package provides for generating graphical output. This section looks at how we can compare two sets of output. We will see that the low expectation that 'gdiff' has for graphical output files further expands the range of testing scenarios that are supported.
The 'gdiff' package performs comparisons of control and test images based on the following assumptions:
Given existing test and control output,
gdiffCompare function can be used to
compare two sets of output. We just have to name the
directory that contains the control output, the directory
that contains the test output, and the directory to use
for any comparison output.
The following code creates a single control output file and a single test output file, both based on the same function, but with control output generated by R version 3.5.3 and test output generated by R version 3.6.1.
The following call to
gdiffCompare compares the
two output files and reports that they are identical
(so the output from
plot(1) has not changed from
R version 3.5.3 to R version 3.6.1).
Comparisons are performed on PNG output, so if the output files
are not PNG files, they are converted to PNG format for comparison
There is also a
gdiff function, which can be used
to generate output and perform the comparison in one go. By default,
this function uses the names
"Compare" for the three
directories. Many of
the examples in this section will use
because it makes it easier to set up
the output that we want to compare.
The following code uses the function
generate control output and test output and compare
the resulting output files.
The example above shows one of the possible outcomes of a comparison:
the output files are identical. Another very simple case is
when the files differ. This example demonstrates that
session argument to
gdiff can be
a named list of sessions, one for generating control output
and one for generating test output.
A third (and fourth) possible case can arise when a control output file
has no corresponding test output file (or vice versa).
The following code demonstrates the comparison result in this
situation. This example shows that the first argument to
gdiff can also be a list of functions, one for
generating control output and one for test output. It is also
possible to specify
NULL, in which case no output
is generated (and the output directory is not cleaned).
We have to specify a
name in the call to
gdiff in this case because we have provided a
list of functions to generate output (so
generate output file names by itself).
The following code shows a more realistic scenario:
we check the output from the
between R versions 3.5 and 3.6 to show that all output remained
the same, but there is an additional example in R 3.6.
The above examples demonstrate that
gdiffCompare) support all of the testing scenarios
that were described in the previous section on generating graphical
output. The next examples make use of the fact that files to
be compared just have to have the same name (but reside in
different directories) to support some additional testing
The following code shows that we can use different R functions
to generate output, but use the same name for the output files.
In this case, we compare the
function with the
The images that were generated by the code above are shown below,
in the order control, test, and comparison.
This demonstrates the slight difference between the X-Spline approximation
to a Bezier curve that
and the true Bezier curve that
The following code uses the same approach (controlling the names
of output files) to compare output from
two different graphics devices,
The small differences that show up here come from small differences in fonts (Helvetica for the standard PDF device versus TeXGyreHeros for the Cairo PDF device, plus the fact that the standard PDF device actually draws the data symbol using the ZapfDingbats font).
The next example
demonstrates that we do not even
require the control or test output to be generated by R.
Here we compare the output generated by MetaPost
(Hobby and the MetaPost development team, 2018)
with the output generated by the 'metapost' R package
This example involves multiple steps:
we generate MetaPost output manually,
with a specific name in a specific directory;
then we call
so that we do not
generate any control output,
name="scurve" so that the test file has the same
name as the control file,
specify where the control output is located,
device=postscritpDevice so that the test output
is generated in the same format as the control output,
compare=TRUE) so that we do not clean the control directory,
and the comparison occurs between the manually-generated PostScript file
and the PostScript file that gets generated by
The results show that the default line width in R is not the same as the default line width in MetaPost (in fact the default line width in MetaPost is hard for R to match because it is effectively zero, or the thinnest line that PostScript can produce).
In this final comparison example, we show that, although it is not possible to add output to a directory that already contains 'gdiff' output, we can generate output in more than one directory and perform a comparison based on multiple control and/or test directories. In this case, the output files are all (deliberately) identical.
The following code shows that the
testDir arguments to
accept multiple directories.
This section describes the structure of the results from a 'gdiff' comparison.
The return value from
gdiffCompare) is a list with class
"gdiffComparison". The 'gdiff'
package provides a
details argument can be used to
simplify the output even further. This may be useful
when a large number of images are being compared.
"gdiffComparison" is a simple list so it is easy
to present the results of a comparison in other ways.
For example, the following code embeds the control, test, and
difference images from a comparison within HTML tags for
inclusion in this report.
When we generate output, the 'gdiff' package creates a
.gdiffSession file in the output directory.
This file contains information about the R session that was
used to generate the output.
When we perform a comparison, this information is read in
and included in the comparison result.
In the following code, we perform a comparison using two different sessions and the resulting session information shows the different R versions being used in those two sessions.
We can generate control and test output on a range of devices
by specifying different values for the
argument to the various 'gdiff' functions. A few common graphics
devices are already supported (e.g., PNG and PDF), but the range
can easily be expanded by defining support for a new graphics device.
A new device must be a "gdiffDevice" object, as generated by the
This function takes a
name for the
suffix (which defaults to the
to be used on output files generated by the device, and two
open function takes a
that should be used to generate file names for output files; this
function should open a graphics device.
close function defaults to a call to
The code below shows the definition of a very basic SVG device.
The predefined devices provide convenience functions that allow for
parameters to be passed to the
The code below shows the definition of an SVG device convenience function.
output devices in the 'gdiff' package provide other examples.
The following code shows the new
svgDevice in action.
It is also possible to define a new type of 'gdiff' session.
A session must be a "gdiffSession" object, as generated
gdiffSession function. The only argument
to that function is a class for the new type of session.
generateOutput method should then be written
for that new session class. This function is responsible for
generating output files; it will typically perform
some set up steps and clean up steps and call
gdiffGenerateOutput to generate the output files.
The predefined sessions in 'gdiff' provide a range of examples.
This section describes some important internal design details of the 'gdiff' package. This is to help future authors with further development of the package, possibly including the original package author's future self.
The public functions in 'gdiff' that generate output files
and compare output files are all based on a core function
gdiffCore. This function can generate control
output, generate test output,
and perform comparisons, or any subset of those operations.
The argument list to
gdiffCore is basically a
superset of the arguments to the public functions that generate
output or perform comparisons, except for the
The first argument to
is a "function
generator". This is a function that creates a named list of functions.
This function generator is passed, via the
method for the 'gdiff' session,
all the way in to
The function generator is called
to generate a named list of functions and then each function in that list
on each graphics device to generate output files.
A simple example of a function generator is the one used by
function methods of
gdiffOutput. These just take the function argument
that they are provided and store it in a list, with the appropriate name.
A more complex function generator example is the one used by
This involves generating a list of functions based off the list
of exported symbols for a given package. For each exported
symbol we generate a function that essentially calls the
The 'gdiff' package provides tools for detecting changes in graphical output, but this sort of testing is also provided by several other R packages, including 'graphicsQC', 'vdiffr', and 'vtest'.
One motivation for the 'gdiff' package was to create a simpler version of 'graphicsQC'. In particular, 'gdiff' has a much simpler design for keeping track of the list of output files that have been generated and the results of comparisons. This allows the 'gdiff' package to be more flexible in terms of the testing scenarios that are supported and makes 'gdiff' easier to extend.
The 'vdiffr' package integrates with the 'testthat' package (Wickham, 2011) to add visual testing. The 'gdiff' package supports a wider range of testing scenarios than 'vdiffr', including comparing between R versions, and 'gdiff' is more independent of the testing framework. This likely makes it harder to integrate with 'testthat', but makes is easier to use in other contexts. The 'gdiff' package also supports multiple graphics devices, while 'vdiffr' is focused on the SVG format.
The 'gdiff' package is very much focused on detecting differences between output files, but has nothing to say about which of two different images is the correct one. The 'vdiffr' package provides support for validating images and keeping track of "model" images that have been verified as correct. The 'vdiffr' package also provides some useful details like platform-independent fonts and provides multiple ways to view the differences between images.
The 'vtest' package is similar to 'gdiff' in that it compares raster versions of images, though it takes a very different approach to tracking output files and performing comparisons. The main limitation of the 'vtest' package compared to 'gdiff' is that it currently appears to only work with 'ggplot2' images and PDF output.
The 'gdiff' package could be used with
R CMD check
by generating a set of control images within the package, then
running tests that generate test images and compare them with
the control images.
R CMD check
is just one testing scenario (where we generate results from
one version of R and one version of a package against output
that was generated with previous version of a package and some
unknown version of R). The 'gdiff' package aims to also support a
wider range of testing scenarios.
It would be useful to support some of the nice features from 'vdiffr' within 'gdiff'. In particular, it would be useful to make use of cross-platform fonts to avoid spurious differences arising between different operating system platforms.
It would also be useful to expand the convenience functions
for generating output. For example, we could add a function
to generate output from the
within a package.
The examples and discussion in this report relate to version 0.1-0 of the 'gdiff' package.
This report was generated within a Docker container (see Resources section below).
dockerSessionexample made use of a Docker image that is available from Docker Hub or via its Dockerfile.
Murrell, P. (2020). "Visual Testing for Graphics in R" Technical Report 2020-01, Department of Statistics, The University of Auckland. Version 1. [ bib | DOI | http ]
This document by Paul Murrell is licensed under a Creative Commons Attribution 4.0 International License.