by Paul Murrell
Monday 29 February 2016
The New Zealand Herald ran an article by Peter Lyons titled Cost of living getting tougher for Mr Median Kiwi. The article discusses the problem that the overall inflation rate (for the past seven years, 2008 to 2015) hides the fact that prices have changed by different amounts for different areas of expenditure. For example, the overall inflation was 11 percent, but cost of housing increased 38 percent while communication costs dropped 18 percent.
The article is interesting, but also frustrating because there are no links provided to raw inflation data and there is no information about how the calculations were performed. This presents two problems: it is difficult to challenge the claims made in the article (because we have insufficient information to repeat what the article author did), and it is difficult to explore the problem further with questions of our own.
A small web page was created in response to the Herald article. That web page provides links to public sources of CPI data and uses those data to produce a plot showing CPI over time for different areas of expenditure, which visually confirms the claims made in the original Herald article. This demonstrates that someone who has some skills in data acquisition, data processing, and visualisation can find, explore, and answer other questions about, the data behind the Herald article.
However, there are many other questions that could be asked of these data and not everyone has the skills necessary to explore the data for themselves. The links to the CPI data from the web page point to Excel spreadsheets or CSV files and not everyone has the skills to generate a plot from that sort of raw data. The web page also provides links to the R code that was used to process the data and draw the plot, in the form of XML source documents. However, not everyone is comfortable with crawling through XML documents or with exploring and running R code.
In summary, the problem is that we have two documents that present the results of analysis or visualisation of CPI data, but it is difficult to replicate, let alone build upon, either of these documents, even when all of the raw materials for doing so have been made available.
This report documents the creation of an OpenAPI pipeline from information on the CPI web page. The purpose of creating this pipeline is to make it very easy to replicate the plot from the web page and to make it easier for others to make use of the CPI data to answer other questions.
This section considers the simple situation where we want to reproduce what the author of the web page did. To do this, we need the data from the web page and we need the R code that was used to create the plot.
There are several obstacles in our way. Although the web page author was a good citizen and provided links to XML source documents which contain the R code, plus links to a Makefile to transform that XML document to other formats, including a .Rhtml file for use with 'knitr', we require specific knowledge and skills to make use of these resources. We need to know how to work with some combination of R, XML, and 'knitr', and even then we do not need all of the R code from the XML source document. We also require additional code of our own if we want to make an SVG version of the plot (like the original). There is also the problem that the code behind the web page assumes that the data set is a CSV file in the current working directory.
In summary, there are a number of resources, including code and data files, that need to be accessed, brought together, and organised so that we can run code to produce a plot. We will look at building an OpenAPI pipeline to make this task easier.
An OpenAPI pipeline is built from OpenAPI modules, so we begin by creating a module. The purpose of this module is to download the CSV file of CPI data and create a local copy.
A module is an XML file that describes a set of inputs, a source script to execute, and a set of outputs. For this module, there is one input, the URL for the CSV file, and one output, the local CSV file. The source script is R code that downloads the CSV file from the URL and creates a local copy.
An OpenAPI module is run using a glue system, an example of which is provided by the R package 'conduit'. The module XML file needs to be downloaded to a directory, then the following R code loads and runs the module.
library(conduit) m <- loadPipeline("data-module", "cpi-data.xml") runPipeline(m)
The result of running the module consists of the outputs from the module: in this case, the downloaded CSV file
This module represents a basic API for some of the information on the web page. The web page contains a URL that describes an internet location from which we can download a CSV file by manually clicking on a link in the web page; this module is an easily executable version of that information.
The purpose of the next module in our pipeline is to extract R code from the web page for drawing the plot. This module has one input, the URL for the XML source document for the web page, and one output, a text file containing R code. The source script is R code that finds the R code chunks within the XML document and extracts the relevant pieces to a local text file.
library(conduit) m <- loadPipeline("script-module", "cpi-script.xml") runPipeline(m)
The result of running the module consists of the outputs from the module: the extracted R script file. This module encapsulates the expertise required to extract the R code from the web page in an easily executable form.
The purpose of the final module in our pipeline is to run the R script to produce a plot. This module has two inputs, a local CSV file and an R script, and one output, an SVG plot. The source script for this module is R code that starts an SVG graphics device to record plotting output and runs the R script (the second input to the module).
This module can only be run on its own if its inputs, the CSV file and the R script, are placed in the same directory as the module itself. This module is more designed to be run as part of a pipeline, so that its inputs are provided by other modules, as described in the next section.
An OpenAPI pipeline is an XML file that specifies a set of modules, plus a specification of how those modules are connected together.
In this case, we will build the pipeline from the three modules that were described above: one to download the CPI data from the web page, one to extract the R code from the web page, and one that runs the R code using the CPI data and produces a plot. The simple structure of this pipeline is shown below. This pipeline has two inputs: the URL of the CSV file for the data module and the URL of the web page for the R script module, and one output, the SVG plot from the plot module.
An OpenAPI pipeline is run using a glue system in a similar way to running a module. The pipeline module XML files need to be downloaded to a directory, then the following R code loads and runs the pipeline.
library(conduit) p <- loadPipeline("cpi", "cpi-pipeline.xml") runPipeline(p)
The result of running the pipeline consists of the outputs from all of the modules in the pipeline: the extracted R script, the downloaded CSV file, and the SVG plot. The plot is shown below.
The pipeline has created a simple API from the web page: a set of inputs and a set of outputs, plus a description of what the pipeline does. If we want to reproduce the plot from the web page, all we need to do is download this pipeline and execute it.
This section considers another simple scenario where we want to build on the information from the web page and explore the CPI data in a different way.
One of the panels in the plot from the web page is titled "Miscellaneous goods and services", but it is not clear what "price groups" that title refers to.
It is possible to write R code that reads a local CSV file to extract information about the price groups within "Miscellaneous goods and services" and produces a text file containing the result. We can then create a module for that script with one input and one output.
We can then create a pipeline that reuses the data module from our first pipeline to supply the CSV input to the new groups module.
Running this pipeline produces a text file that contains the names of the price groups within "Miscellaneous goods and services". The contents of that text file are shown below.
Personal care Personal effects Insurance Credit services Other miscellaneous services
This pipeline demonstrates, in a very trivial way, the idea that OpenAPI modules make it easier to reuse the information from the web page. In this case, the data module, which provides an executable way to download the CSV data, was reused to allow a different exploration of the CPI data.
With the creation of the OpenAPI pipeline, it is much easier to reproduce the plot from the web page. We just need to download an XML file that describes the pipeline and then execute the pipeline.
Because the pipeline is made up of reusable modules that can be combined in a simple way, it is also easier to reuse the resources from the web page for other purposes.
The work involved to create a new module has been deliberately ignored because the OpenAPI architecture allows that work to be carried out by a variety of contributors. Anyone could develop R code to perform a task and that work can be done with no knowledge of OpenAPI. Anyone could develop the module XML for that R code and anyone could develop the pipeline XML that combines modules. Once that work has been done, all that is required is to download the pipeline XML and execute it.
OpenAPI is not the only way that this work could be carried out. For example, an expert could write a single script to do all of the work herself. The point of difference is the way that OpenAPI allows the work to be divided up into writing separate scripts for separate parts of the work, creating modules to wrap the scripts, and combining modules into pipelines, which can then be used by a very wide audience.
Another issue that has been ignored is the fact that a pipeline can only be run if all of the necessary software required to run the code in the pipeline is available. For example, if the pipeline contains an R module, we must have (an appropriate version of) R installed, we may need (appropriate versions of) R packages installed, and we may need (appropriate versions of) system libraries installed.
Work is ongoing to explore ways to satisfy system requirements by specifying a "host" machine on which to run a module, where the host machine is known to have the necessary software installed.
There still remains the requirement that we need a glue system to run the pipeline itself, which currently means installing R and the 'conduit' package. However, it should not be difficult to build a simpler interface for a glue system (e.g., a web form) and to have the glue system run on a server (e.g., a shiny server).
A final issue is the visibility of modules and pipelines; how do people find a module or pipeline for a specific task? The obvious solution would be some sort of searchable repository for modules and pipelines. No such repository exists yet, but there are many analogous solutions to mimic or build from (e.g., the network of CRAN mirrors for R and package repositories for Linux distributions).
The OpenAPI architecture bears some resemblance to workflow systems, such as KNIME or Galaxy. However, OpenAPI accomodates more distributed resources and pipeline components and is more focused on making it easy to generate new components. This sort of comparison is discussed in more detail in the report "Helping people to connect with data".
There are also similarities between OpenAPI and reproducible research or literate document tools, such as 'knitr' and IPython. However, OpenAPI is more aimed at creating reusable components rather than wholesale replication of documents. Jupyter Notebooks have also recently been extended to create microservices from code chunks within notebooks, but OpenAPI modules are designed to work with any resource rather than a specific document format.
In the beginning, there was a Herald article on inflation that piqued the reader's curiousity. Then there was a web page that investigated CPI data a little differently and provided links to the data and code that were used in the investigation. This document has described how an OpenAPI pipeline was created from the web page so that the code from the web page can be rerun easily and so that the data and the code from the web page can be accessed and reused easily.
This document demonstrates how the OpenAPI architecture can be used to create executable, resuable components from existing resources. The system does not work perfectly, but the development of the OpenAPI architecture is ongoing.
The examples and discussion in this document relate to OpenAPI version 0.3 and version 0.3 of the 'conduit' package.
This report was generated on Ubuntu 14.04 64-bit running R 3.2.3.
An OpenAPI Pipeline for CPI Data by Paul Murrell is licensed under a Creative Commons Attribution 4.0 International License.