Visual Testing for Graphics in R

by Paul Murrell http://orcid.org/0000-0002-3224-8858

Version 1:

This document by Paul Murrell is licensed under a Creative Commons Attribution 4.0 International License.

This report describes the 'gdiff' package for R, which provides tools for performing visual tests on graphical output. Visual testing is essential for checking that code is producing the correct graphical output. The goal of the 'gdiff' package is to support a wide range of testing scenarios, including testing graphical output between different versions of R.

1. Introduction

When we test code, we are interested in establishing not only that the code runs without producing errors, but also that the code produces the desired output.

Testing output requires a comparison between two sets of output, which we will label "control output" and "test output". When the output from code is graphical, one way that we can compare a control image with a test image is by comparing raster versions of the two images to see whether any pixels differ.

The following code shows a manual example of this process. We draw a control image consisting of the text "same" and a test image consisting of the words "same" and "different".

The image_compare function from the 'magick' package (Ooms, 2019) produces an image of the difference between the test and control images.

The images are shown below, with the control image on the left, the test image in the middle, and an image of the differences on the right, with different pixels highlighted in red (and identical pixels light grey).

The image_compare function also gives us a count of the number of different pixels.

This provides us with a test for differences (is the number of different pixels greater than zero?), plus an image to show where differences have occurred, which can help us to track down the source of the difference within our code.

It is also possible to compare images in other ways. For example, SVG images are text-based, so a text-based comparison using a tool like GNU diff can be performed. Another possibility is to reduce image files to a unique "fingerprint", through the use of a hashing function, and then compare image fingerprints. This report focuses on pixel-based comparisons.

This report describes the 'gdiff' package (Murrell, 2019a) for visual testing of graphics code. The 'gdiff' package provides functions for generating control output and test output and for comparing the two. The following code shows a simple use of the package: we generate control output from the function image1, test output from the function image2, and compare the two outputs, with the result being a difference that is represented by a comparison image and a pixel difference of 248.

There are other R packages that perform visual testing, such as 'graphicsQC' (Murrell and Gardiner, 2009), 'vdiffr' (Henry et al., 2019), 'vtest' (Chang, 2019), and 'visualTest' (Campbell and Csardi, 2019), but the aim of the 'gdiff' package is to support a broader range of testing scenarios than any of those packages. A detailed comparison of 'gdiff' with other R packages will be made later in the Discussion Section, once we have seen how the 'gdiff' package works and once we have seen some examples of its use.

The 'gdiff' package supports a wide variety of testing scenarios in two ways: it provides flexibility in the ways that graphical output can be generated; and it has low expectations about the type of graphical output that it will perform comparisons on. The 'gdiff' package also provides the result of comparisons in a very simple format, which makes it easy to build upon.

The following sections demonstrate these three aspects of the 'gdiff' package and give examples of use for the main functions in the package.

2. Flexible generation of graphical output

The gdiffOutput function can be used to generate graphical output, for either control images or test images. The first argument to this function is a function (with no arguments) that will generate graphical output and the second argument is a directory where image files will be generated. The following code defines a simple function, f, that generates a single plot and the output from that function is generated in a single file in a directory called "Control". The result of the function call is a character vector of the files that were generated.

By default the name of the output file is generated by deparsing the first argument - in this case, f becomes "f" - but it is also possible to specify a name for the output file in the call to gdiffOutput. In the code below, we change the name used for the output file to "fun". A numeric suffix is also appended to the output file name automatically.

If the function that we give to gdiffOutput generates several plots, the result will be multiple files. The example below generates two plots in two separate files. This demonstrates why the numeric suffix is always appended to the output file names.

The function that draws plots is called with no arguments, so the function must use data that is always available. In the example below, the function h makes use of the datasets::mtcars data set.

The example in the code below uses a closure to define a function with the required data stored in the function's environment.

By default, the directory in which to generate output files is created if it does not exist and emptied if it does exist. Specifying clean=FALSE means that an existing output directory will not be emptied. However, this is not allowed if the directory already contains 'gdiff' output. The code below attempts to add output to the "Control" directory without cleaning it first, but the directory already contains 'gdiff' output, so we get an error.

We will see an example later where specifying clean=FALSE can be useful.

Graphics devices

By default, output is generated on a PNG graphics device, but the device argument to the gdiffOutput function can be used to select a different graphics device. This expands the range of testing scenarios to include different graphics devices. The options are currently pngDevice, pdfDevice, cario_pdf_device, and postscriptDevice. The following code generates output on a PDF device.

The value of the device function can also be a list of devices, in which case output is generated for each device. The following code generates each page of output twice: once on a PNG device and once on a PDF device.

The device functions allow arguments to be passed to the graphics device, for example to control the size or resolution of the device. We will see some examples later.

R sessions

Output is generated within a "session" of R, which by default is the current R session. The following code makes that explicit.

The libPaths argument to the currentSession function can be used to specify a directory from which packages will be loaded when generating output. This expands the range of testing scenarios to allow the same code to be tested on different versions of a package (and its dependencies).

The following code installs two different versions of the 'grImport' package (Murrell, 2009) into two different directories.

We can now generate control output based on one version of 'grImport' and test output based on the other version of 'grImport'. The resulting images are displayed in each case to show that they are different.

We can generate output within a separate R session with the localSession function. The following code repeats the 'grImport' example, but uses a separate R session to generate both control and test output. Notice that the test function grImport now no longer needs to detach the 'grImport' package (because the package is loaded within a separate R session).

The localSession function also has an Rpath argument so that we can generate output within a separate R session using a different version of R. This expands the range of testing scenarios to include comparisons between different versions of R.

The following code generates output from the axis help page examples using a separate R session running R version 3.5.3 for control output and the current R session, which is R version 3.6.1, for test output. The test output generates an extra file of output because an extra example was added to the axis help page in R version 3.6.0.

There is also a remoteSession function and a dockerSession function so that we can generate output on different platforms. The remoteSession function requires either a host name (and possibly a user) or a "cluster" object (as generated by parallel::makeCluster) and the dockerSession function requires the name of a docker image. In both cases, the libPaths and the Rpath can also be set. This expands the range of testing scenarios to include comparisons between different operating system platforms and interactions between package versions, R versions, and operating system platforms.

The following code repeats the axis example, but this time we generate the test output in a docker container that has R version 3.6.1 installed.

This sort of "remote" output generation requires some set up. For example, the 'gdiff' package at least must be installed on the remote system or the docker image, and of course R itself must also be installed.

There are two other high-level functions for generating graphical output. The gdiffExamplesOutput() function generates output from the examples section of the specified function (either a function name or a function object). The following code generates 4 files of output from the examples on the help page for the plot function.

The gdiffPackageOutput() function generates output from all examples sections of all exported functions in a package. The following code generates 10 files of output by running all of the examples in the 'gridBezier' package.

3. Low expectations of graphical output

The previous section demonstrated the flexibility that the 'gdiff' package provides for generating graphical output. This section looks at how we can compare two sets of output. We will see that the low expectation that 'gdiff' has for graphical output files further expands the range of testing scenarios that are supported.

The 'gdiff' package performs comparisons of control and test images based on the following assumptions:

Control images are in a separate directory from test images.
Images to be compared have the same name (but are in different directories as per assumption 1).
The control and test directories only contain image files to be compared.
Images are the same if PNG versions of the images are identical based on per-pixel comparison.

Given existing test and control output, the gdiffCompare function can be used to compare two sets of output. We just have to name the directory that contains the control output, the directory that contains the test output, and the directory to use for any comparison output.

The following code creates a single control output file and a single test output file, both based on the same function, but with control output generated by R version 3.5.3 and test output generated by R version 3.6.1.

The following call to gdiffCompare compares the two output files and reports that they are identical (so the output from plot(1) has not changed from R version 3.5.3 to R version 3.6.1).

Comparisons are performed on PNG output, so if the output files are not PNG files, they are converted to PNG format for comparison (using magick::image_convert).

There is also a gdiff function, which can be used to generate output and perform the comparison in one go. By default, this function uses the names "Control", "Test", and "Compare" for the three directories. Many of the examples in this section will use gdiff because it makes it easier to set up the output that we want to compare. The following code uses the function f to generate control output and test output and compare the resulting output files.

The example above shows one of the possible outcomes of a comparison: the output files are identical. Another very simple case is when the files differ. This example demonstrates that the session argument to gdiff can be a named list of sessions, one for generating control output and one for generating test output.

A third (and fourth) possible case can arise when a control output file has no corresponding test output file (or vice versa). The following code demonstrates the comparison result in this situation. This example shows that the first argument to gdiff can also be a list of functions, one for generating control output and one for test output. It is also possible to specify NULL, in which case no output is generated (and the output directory is not cleaned). We have to specify a name in the call to gdiff in this case because we have provided a list of functions to generate output (so gdiff cannot generate output file names by itself).

The following code shows a more realistic scenario: we check the output from the axis function between R versions 3.5 and 3.6 to show that all output remained the same, but there is an additional example in R 3.6.

The above examples demonstrate that gdiff (and gdiffCompare) support all of the testing scenarios that were described in the previous section on generating graphical output. The next examples make use of the fact that files to be compared just have to have the same name (but reside in different directories) to support some additional testing scenarios.

The following code shows that we can use different R functions to generate output, but use the same name for the output files. In this case, we compare the grid::grid.bezier function with the gridBezier::grid.Bezier function.

The images that were generated by the code above are shown below, in the order control, test, and comparison. This demonstrates the slight difference between the X-Spline approximation to a Bezier curve that grid.bezier produces and the true Bezier curve that grid.Bezier produces.

The following code uses the same approach (controlling the names of output files) to compare output from two different graphics devices, pdf versus cairo_pdf.

The small differences that show up here come from small differences in fonts (Helvetica for the standard PDF device versus TeXGyreHeros for the Cairo PDF device, plus the fact that the standard PDF device actually draws the data symbol using the ZapfDingbats font).

The next example demonstrates that we do not even require the control or test output to be generated by R. Here we compare the output generated by MetaPost (Hobby and the MetaPost development team, 2018) with the output generated by the 'metapost' R package (Murrell, 2019b). This example involves multiple steps: we generate MetaPost output manually, with a specific name in a specific directory; then we call gdiff with list(control=NULL, test=f) so that we do not generate any control output, name="scurve" so that the test file has the same name as the control file, controlDir=MetaPostControl to specify where the control output is located, device=postscritpDevice so that the test output is generated in the same format as the control output, clean=list(control=FALSE, test=TRUE, compare=TRUE) so that we do not clean the control directory, and the comparison occurs between the manually-generated PostScript file and the PostScript file that gets generated by f.

The results show that the default line width in R is not the same as the default line width in MetaPost (in fact the default line width in MetaPost is hard for R to match because it is effectively zero, or the thinnest line that PostScript can produce).

In this final comparison example, we show that, although it is not possible to add output to a directory that already contains 'gdiff' output, we can generate output in more than one directory and perform a comparison based on multiple control and/or test directories. In this case, the output files are all (deliberately) identical.

The following code shows that the controlDir and the testDir arguments to gdiff accept multiple directories.

4. Comparison results

This section describes the structure of the results from a 'gdiff' comparison.

The return value from gdiff (and gdiffCompare) is a list with class "gdiffComparison". The 'gdiff' package provides a print method for this class that produces a simple summary of the comparison results.

The details argument can be used to simplify the output even further. This may be useful when a large number of images are being compared.

A "gdiffComparison" is a simple list so it is easy to present the results of a comparison in other ways. For example, the following code embeds the control, test, and difference images from a comparison within HTML tags for inclusion in this report.

"))) ]]> "))) ]]>

5. Session information

When we generate output, the 'gdiff' package creates a .gdiffSession file in the output directory. This file contains information about the R session that was used to generate the output. When we perform a comparison, this information is read in and included in the comparison result.

In the following code, we perform a comparison using two different sessions and the resulting session information shows the different R versions being used in those two sessions.

6. Extending 'gdiff'

We can generate control and test output on a range of devices by specifying different values for the device argument to the various 'gdiff' functions. A few common graphics devices are already supported (e.g., PNG and PDF), but the range can easily be expanded by defining support for a new graphics device.

A new device must be a "gdiffDevice" object, as generated by the gdiffDevice function. This function takes a name for the device, a suffix (which defaults to the name) to be used on output files generated by the device, and two functions, open and close. The open function takes a name argument that should be used to generate file names for output files; this function should open a graphics device. The close function defaults to a call to dev.off.

The code below shows the definition of a very basic SVG device.

The predefined devices provide convenience functions that allow for parameters to be passed to the open function. The code below shows the definition of an SVG device convenience function. The predefined output devices in the 'gdiff' package provide other examples.

The following code shows the new svgDevice in action.

It is also possible to define a new type of 'gdiff' session. A session must be a "gdiffSession" object, as generated by the gdiffSession function. The only argument to that function is a class for the new type of session. A generateOutput method should then be written for that new session class. This function is responsible for generating output files; it will typically perform some set up steps and clean up steps and call gdiffGenerateOutput to generate the output files. The predefined sessions in 'gdiff' provide a range of examples.

7. Internal Details

This section describes some important internal design details of the 'gdiff' package. This is to help future authors with further development of the package, possibly including the original package author's future self.

The public functions in 'gdiff' that generate output files and compare output files are all based on a core function called gdiffCore. This function can generate control output, generate test output, and perform comparisons, or any subset of those operations. The argument list to gdiffCore is basically a superset of the arguments to the public functions that generate output or perform comparisons, except for the first argument.

The first argument to gdiffCore, called codeFun, is a "function generator". This is a function that creates a named list of functions. This function generator is passed, via the generateOutput method for the 'gdiff' session, all the way in to gdiffGenerateOutput. The function generator is called to generate a named list of functions and then each function in that list is called on each graphics device to generate output files.

A simple example of a function generator is the one used by function methods of gdiff and gdiffOutput. These just take the function argument that they are provided and store it in a list, with the appropriate name.

A more complex function generator example is the one used by gdiffPackage and gdiffPackageOutput. This involves generating a list of functions based off the list of exported symbols for a given package. For each exported symbol we generate a function that essentially calls the example function.

8. Discussion

The 'gdiff' package provides tools for detecting changes in graphical output, but this sort of testing is also provided by several other R packages, including 'graphicsQC', 'vdiffr', and 'vtest'.

One motivation for the 'gdiff' package was to create a simpler version of 'graphicsQC'. In particular, 'gdiff' has a much simpler design for keeping track of the list of output files that have been generated and the results of comparisons. This allows the 'gdiff' package to be more flexible in terms of the testing scenarios that are supported and makes 'gdiff' easier to extend.

The 'vdiffr' package integrates with the 'testthat' package (Wickham, 2011) to add visual testing. The 'gdiff' package supports a wider range of testing scenarios than 'vdiffr', including comparing between R versions, and 'gdiff' is more independent of the testing framework. This likely makes it harder to integrate with 'testthat', but makes is easier to use in other contexts. The 'gdiff' package also supports multiple graphics devices, while 'vdiffr' is focused on the SVG format.

The 'gdiff' package is very much focused on detecting differences between output files, but has nothing to say about which of two different images is the correct one. The 'vdiffr' package provides support for validating images and keeping track of "model" images that have been verified as correct. The 'vdiffr' package also provides some useful details like platform-independent fonts and provides multiple ways to view the differences between images.

The 'vtest' package is similar to 'gdiff' in that it compares raster versions of images, though it takes a very different approach to tracking output files and performing comparisons. The main limitation of the 'vtest' package compared to 'gdiff' is that it currently appears to only work with 'ggplot2' images and PDF output.

Integration with R CMD check

The 'gdiff' package could be used with R CMD check by generating a set of control images within the package, then running tests that generate test images and compare them with the control images. However, R CMD check is just one testing scenario (where we generate results from one version of R and one version of a package against output that was generated with previous version of a package and some unknown version of R). The 'gdiff' package aims to also support a wider range of testing scenarios.

Future work

It would be useful to support some of the nice features from 'vdiffr' within 'gdiff'. In particular, it would be useful to make use of cross-platform fonts to avoid spurious differences arising between different operating system platforms.

It would also be useful to expand the convenience functions for generating output. For example, we could add a function to generate output from the tests/ directory within a package.

9. Technical requirements

The examples and discussion in this report relate to version 0.1-0 of the 'gdiff' package.

This report was generated within a Docker container (see Resources section below).

10. Resources

The raw source file for this report, a valid XML transformation of the source file, a 'knitr' document generated from the XML file, two R files and the bibtex file that are used to generate the table of contents and reference sections, two XSL files and an R file that are used to transform the XML to the 'knitr' document, and a Makefile that contains code for the other transformations and coordinates everything. These materials are also available on github.
This report was generated within a Docker container. The Docker command to build the report is included in the Makefile above. The Docker image for the container is available from Docker Hub; alternatively, the image can be rebuilt from its Dockerfile.
The dockerSession example made use of a Docker image that is available from Docker Hub or via its Dockerfile.

How to cite this report

Murrell, P. (2020). "Visual Testing for Graphics in R" Technical Report 2020-01, Department of Statistics, The University of Auckland. Version 1. [ bib | DOI | http ]

11. References

[Campbell and Csardi, 2019]: Campbell, C. and Csardi, G. (2019). Perform Fuzzy Image Matching. https://github.com/MangoTheCat/visualTest. [ bib ]
[Chang, 2019]: Chang, W. (2019). vtest: Tools for visual testing of R packages. R package version 0.0.3. [ bib ]
[Henry et al., 2019]: Henry, L., Sutherland, C., Hong, D., Luciani, T. J., Decorde, M., and Lise, V. (2019). vdiffr: Visual Regression Testing and Graphical Diffing. R package version 0.3.1. [ bib | http ]
[Hobby and the MetaPost development team, 2018]: Hobby, J. D. and the MetaPost development team (2018). METAPOST a user's manual. [ bib | .pdf ]
[Lang, 2019]: Lang, D. T. (2019). Sxslt: R interface to libxslt. http://www.omegahat.org/Sxslt, http://www.omegahat.org. [ bib ]
[Murrell, 2009]: Murrell, P. (2009). Importing vector graphics: The grImport package for R. Journal of Statistical Software, 30(4):1--37. [ bib | http ]
[Murrell, 2019a]: Murrell, P. (2019a). gdiff: Graphical Difference Testing. R package version 0.1-0. [ bib | http ]
[Murrell, 2019b]: Murrell, P. (2019b). metapost: Interface to 'MetaPost'. R package version 1.0-6. [ bib | http ]
[Murrell and Gardiner, 2009]: Murrell, P. and Gardiner, S. (2009). Quality control for statistical graphics: The graphicsQC package for R. Journal of Statistical Software, 30(1):1--28. [ bib | http ]
[Ooms, 2019]: Ooms, J. (2019). magick: Advanced Graphics and Image-Processing in R. R package version 2.2. [ bib | http ]
[R Core Team, 2018]: R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [ bib | http ]
[Wickham, 2011]: Wickham, H. (2011). testthat: Get started with testing. The R Journal, 3:5--10. [ bib | .pdf ]