The OpenAPI project attempts to make it easier to connect people with data. This document discusses some of the challenges faced in connecting people with data, and how OpenAPI addresses these challenges. It describes the technical specifications of the OpenAPI architecture, and provides some examples of OpenAPI in action. This document also discusses projects similar to OpenAPI, and how OpenAPI is distinct from these.

What is OpenAPI: blue sky version

The problem: connecting people with data

OpenAPI's main aim is to make it easier for people to connect with data. Connecting with data is not just a simple case of making raw data available to people (though this is one component of it). When people are fully connected with data they can examine, manipulate, display, compare, interpret and share it. We believe the following things are required to connect people with data:

While many people possess one or several of these attributes, it is very rare to find someone who possesses all of them. Access to data is getting easier thanks to initiatives to make data widely and freely available, like Open Data, Open Government and Open Access. Domain knowledge, data science skills, statistical graphics skills and graphical design skills can all be acquired through education and experience. OpenAPI intends to help connect with data those people who do not have the luxury of acquiring special skills and knowledge. OpenAPI will do this by allowing anyone to contribute using those skills that they already possess.

Possible solutions

One solution to the problem of connecting people with data is providing software or resources containing all of the required attributes on the user's behalf. Some examples of this are:

The OpenAPI solution: everyone contributes a small amount

Rather than attempt to capture all of the necessary attributes in one program or service, OpenAPI attempts to capture small contributions which can be combined to suit the user. This means that contributors only require some of the attributes that we have listed. This allows for small contributions like:

In OpenAPI each of these contributions is meaningful. Further, they allow for someone who has none of the required attributes to combine other users' contributions in a meaningful way. Even if a contributor has no particular facility in any of the areas mentioned so far, she can still contribute my making data and scripts available to other users of OpenAPI.

At its simplest OpenAPI is about modules. Each module can ask for inputs, describe some work to be done, and produce some outputs. Modules can be combined in pipelines which plug one module's outputs into another module's inputs. The small contributions listed above could each be captured in a module, and a further contribution would combine them as a pipeline.

What is OpenAPI: technical version

The OpenAPI system is comprised of three main parts:

  1. Modules

    A module is the smallest piece in the OpenAPI system. It describes a job to be done. This can be range from something simple like passing a data file into OpenAPI, to a complicated script in a language like R or python. Each module has zero or more inputs, and zero or more outputs. Modules are specified using XML.

  2. Pipelines

    A pipeline describes how modules' inputs and outputs are connected to each other to perform a series of tasks. A pipeline contains a list of components (modules, or other pipelines) and a list of pipes which describe how to connect component outputs and inputs. Pipelines are specified using XML.

  3. Glue systems

    An OpenAPI glue system can read module and pipeline XML and execute the code in modules to produce the desired outputs.

The following sections expand on each of these parts of the OpenAPI architecture in more detail.

Modules

A module is the most basic component of OpenAPI, the smallest piece of work which can be run to produce an outcome. In general a module will describe some code to be executed. The code is executed, not by the OpenAPI glue system, but by some specified platform or service. A module can require zero or more inputs for its source scripts to be executed, and can produce zero or more outputs.

A module is described in XML, with a root element called 'module'. A basic outline of module XML follows:

<module>
  <description> ... </description>
  <platform name="..." />
  <input name="..." type="...">
    <format>...</format>
  </input>
  <source>
    ...
  </source>
  <output name="..." type="...">
    <format>...</format>
  </output>
</module>
  

Pipelines

A pipeline describes how a collection of modules are to be connected together to create more complex jobs. The simplest pipeline describes how the output from one module is to be connected to the input of another module. When the pipeline is run, the first module's source scripts are executed. Its output is then provided to the second module, and the second module's scripts can be executed.

Pipelines can connect a module to a module, connect a module to another pipeline, or even connect two pipelines. Both modules and pipelines are treated as <component>s, which are be connected by <pipe>s.

Each pipe describes the connection of an output from one component to an input of another component. Pipelines can be made increasingly complex by adding more components and connecting them with pipes. Pipelines can be made as complex as the user desires, as long as the various components have all they need to run, and there are no loops in the order of operation.

A pipeline is described in XML, with a root element called 'pipeline'. A basic outline of pipeline XML follows:

<pipeline>
  <description>...</description>
  <component name="..." ref="..." type="..."/>
  <component name="..." ref="..." type="..."/>
  <pipe>
    <start component="..." output="..."/>
    <end component="..." input="..."/>
  </pipe>
</pipeline>

Glue system(s)

A glue system's job is to interpret OpenAPI pipelines and modules, to execute the scripts they provide, and to return the results to the user. There is not a single official OpenAPI glue system; instead we specify here what is expected of a glue system, and provide the details of a prototype glue system.

Requirements

An OpenAPI glue system must do the following:

What else might a glue system do?

This section describes some features of a glue system that are desirable, but not required by the OpenAPI architecture.

A glue system might also offer:

Examples

Modules

simpleGraph/layoutGraph.xml:

<?xml version="1.0"?>
<module xmlns="http://www.openapi.org/2014/">
  <description>Lays out a graphNEL graph using the Rgraphviz package</description>
  <platform name="R"/>
  <input name="myGraph" type="internal">
    <format formatType="text">R "graphNEL" object</format>
  </input>
  <source><![CDATA[library(Rgraphviz)
Ragraph <- agopen(myGraph, "myGraph")]]></source>
  <output name="Ragraph" type="internal">
    <format formatType="text">R "Ragraph" object</format>
  </output>
</module>

This module requires the 'R' platform. It has one 'internal' input, "myGraph", an R graphNEL object. The module contains one source element, specified as inline R script. The module produces one 'internal' output, "Ragraph", an R Ragraph object.

Pipelines

simpleGraph/pipeline.xml:

<?xml version="1.0"?>
<pipeline xmlns="http://www.openapi.org/2014/">
  <description>This pipeline creates a directed graphNEL graph, lays it out using the Rgraphviz package, and then plots the graph as a PNG file using the gridGraphviz package</description>
  <component name="plotGraph" ref="plotGraph.xml" type="module"/>
  <component name="createGraph" ref="createGraph.xml" type="module"/>
  <component name="layoutGraph" ref="layoutGraph.xml" type="module"/>
  <pipe>
    <start component="createGraph" output="directedGraph"/>
    <end component="layoutGraph" input="myGraph"/>
  </pipe>
  <pipe>
    <start component="layoutGraph" output="Ragraph"/>
    <end component="plotGraph" input="Ragraph"/>
  </pipe>
</pipeline>

This pipeline contains three components, named "plotGraph", "createGraph", and "layoutGraph". The contents of the "layoutGraph" component are given by the 'ref' attribute, which contains the relative file location "layoutGraph.xml", taken to be in the same location as the pipeline file itself, as no search 'path' is given. The 'type' of the "layoutGraph" component is a "module". This module is the same as in the example above.

The other two components, also modules, can be examined at simpleGraph/plotGraph.xml and simpleGraph/createGraph.xml.

This pipeline contains two pipes. The first connects the output "directedGraph" in the component "createGraph" to the input "myGraph" in the component "layoutGraph". The second pipe connects the "layoutGraph" output "Ragraph" to the "plotGraph" input "Ragraph".

The following image shows a graphical representation of the pipeline's three components and two connecting pipes:

Graph of pipeline example

Graph of pipeline example

Glue system: conduit

The conduit package is a prototype built to demonstrate the features of an OpenAPI glue system. It is available as a package for the R statistical environment.

Basic usage: conduit in action

This section shows how the pipeline example above can be read into and executed in conduit.

Installing conduit

The conduit package was built using R version 3.1.x on a 64-bit Linux machine. It requires the following R packages be installed: XML, graph, RBGL, and RCurl. The devtools package is required to install conduit directly from github. The Rgraphviz and gridGraphviz packages are required to run the modules in this example.

Version 0.1 of the conduit, used in this report, is available for download here.

Source code for conduit is available at https://github.com/anhinton/conduit for those who would like to build and install the package manually.

To install the conduit v0.1 using devtools:

devtools::install_github("anhinton/conduit@v0.1")

Load the conduit package:

library(conduit)
Reading a pipeline from an XML file

The loadPipeline() function is used to read a pipeline XML file into conduit. The loadPipeline() function requires a name argument, and a ref argument which describes the XML file location. The function returns a pipeline object.

We will load the pipeline used in our pipelines example above. The pipeline XML file is in a sub-directory of the R working directory called 'simpleGraph'.

## load the simpleGraph pipeline
simpleGraph <- 
    loadPipeline(name = "simpleGraph", ref="simpleGraph/pipeline.xml")

According to its description:

This pipeline creates a directed graphNEL graph, lays it out using the Rgraphviz package, and then plots the graph as a PNG file using the gridGraphviz package
Run a pipeline in conduit

The runPipeline() function is used to run a pipeline object in R. It requires a pipeline object as its only argument. This function will create a directory for the pipeline in the 'pipelines' sub-directory of the current working directory. If a 'pipelines' sub-directory does not exist it will be created. Each module in the pipeline will create output in a named directory found in ./pipelines/PIPELINE_NAME/modules.

## run the simpleGraph pipeline
runPipeline(simpleGraph)

This creates the following files:

  1. pipelines/simpleGraph/modules/createGraph/directedGraph.rds
  2. pipelines/simpleGraph/modules/createGraph/script.R
  3. pipelines/simpleGraph/modules/layoutGraph/Ragraph.rds
  4. pipelines/simpleGraph/modules/layoutGraph/script.R
  5. pipelines/simpleGraph/modules/plotGraph/example.png
  6. pipelines/simpleGraph/modules/plotGraph/script.R

File number 5, pipelines/simpleGraph/modules/plotGraph/example.png is the output file we require, the PNG image of the graph. The image is shown below:

Pipeline output: PNG image file of graph

Pipeline output: PNG image file of graph

Technical details

We will further explore the different features of an OpenAPI glue system by examining conduit more closely. An R package, conduit provides functions for reading, running and creating OpenAPI modules and pipelines.

The conduit package can be installed from https://github.com/anhinton/conduit

Discussion

How is OpenAPI different?

We are aware that OpenAPI bears some resemblance to a variety of other projects in the broad categories of data access and analysis. It is our opinion, however, that several features make OpenAPI unique.

There are too many other systems to line them all up and knock them all down, so this section provides a comparison between OpenAPI and a selection of related projects that represent a broad range of alternatives.

There are several projects that are structurally similar to OpenAPI's use of modules and pipelines: Mortar + Luigi is a platform for building and running data pipelines; Flow-based Programming is a programming paradigm which treats applications as opaque processes which pass data to other processes through externally defined connections; Web Services Description Language (WSDL) is an XML-based language used to define what is offered by a web service.

Unlike these projects, OpenAPI is intended to be simple; it is aimed at non-technical users, and not at developers. It does not attempt to define data types, does not permit looping, and is not meant to provide a paradigm for program design. Rather than attempting to solve the entire problem of connecting people with data through complex and exhaustive programming, OpenAPI merely specifies a framework for describing working with data. The XML used to describe modules and pipelines is simple enough for a user to type out herself. OpenAPI is able to combine these simple pieces to provide as much complexity and power as its users imagine but it is aimed primarily at small scale, single person projects rather than large scale production systems.

Some projects offer hosted web services and collaboration tools for data analysis: OpenCPU provides an HTTP API for performing data analysis using R; RCloud is a web-based platform intended for producing interactive web-based graphics; ElasticR provides a platform for cloud-based data analysis. Other projects offer bundled, comprehensive data workflow solutions: KNIME offers a graphical environment for assembling pipelines of data analysis modules for pharmaceutical and life science research; VisTrails is a workflow management system for scientific data exploration and visualisation; Alteryx provides a graphical environment for data manipulation and analysis for business users.

OpenAPI is not an attempt to provide a single, comprehensive solution to the problem of connecting people with data, nor is it an attempt to combine existing solutions in domain-specific contexts. Instead OpenAPI allows users to combine the contributions of other users, and to share these combinations in turn and build small contributions. OpenAPI is unique in that it is possible to make small but meaningful contributions. A contributor can do something as simple as writing an XML module for a piece of code, and will have made a meaningful contribution. Another contributor could write a pipeline which connects the contributions of several modules from various sources. Yet another contributor could write a small script and that, even without intending to be, is a contribution in the OpenAPI model.

The Galaxy Project also attempts to connect people with data, specifically in the field of biomedical research, and offers both a hosted and an installable platform for this. Galaxy is a web-based platform. This means most users will need nothing more than their regular web browser to use Galaxy. Galaxy solves the problem of access to data by hosting data sets, and then providing the option to share these data sets on public servers. The problems of domain knowledge, data science skills, and various graphical skills are not solved directly by Galaxy. Rather, these are addressed through Galaxy users sharing their data, workflows, and visualisations through the Galaxy Pages service. These Pages allow other users to examine and reproduce the workflows of other users. This then serves as both a published record of the findings of a project, and an educational tool for helping users create work of their own.

Galaxy differs from OpenAPI by being specifically aimed at biomedical research. Like OpenAPI, a user can make use of someone else's workflow to pursue her own research. However, a high level of domain knowledge and skills are still required to make use of and understand the platform. Unlike OpenAPI pipelines, Galaxy workflows must be run on Galaxy servers, using the tools available on this platform. OpenAPI pipelines can be run in any glue system, and can make use of any platforms that the glue system supports. OpenAPI glue systems can be anything from a self-contained application on a single machine, to a distributed system of platform resources over a local or worldwide network.

Ubuntu Juju uses a visual environment to connect applications and services to each other by representing them as nodes called charms, and connecting these nodes. While the metaphors are similar, Juju is aimed at large-scale pipelines, abstracting across whole systems; OpenAPI is meant for much smaller pipelines.

Finally, OpenAPI does not insist on there being one solution, one glue system, one server on which all the work must take place. Rather it allows for modules and pipelines to be brought into already existing setups. Developers are encouraged to provide support for additional platforms in existing glue systems, or even to create whole new glue systems with whatever features they desire. Because module and pipeline XML wraps around existing scripts a contributor's work is not trapped in OpenAPI; the scripts remain in plaintext to be copied and pasted anywhere one pleases. Contributors do not have to learn whole new languages and methods of programming to begin participating in the OpenAPI project, nor do their solutions become locked into a particular service or product.

The unintended benefits of OpenAPI

Though not the original aim of the project, we have so far noticed the following benefits to working in modules and pipelines:

Challenges and future work

One of the challenges OpenAPI faces is ensuring that a module's system requirements are met. While a module's platform program may be present on the user's system this is no guarantee that the program will meet the requirements of the module's source scripts. In some cases these requirements can be met within the modules scripts; for example, a module on the "R" platform might require a package from CRAN be installed: the module could be set up to first run a script to check if the appropriate package is installed, and could even install it.

Other module scripts might make greater demands on a system, and may require specific versions of programs and their dependencies be installed. One solution to this might be to provide a mechanism whereby a module's sources can be executed by the module's platform program on another machine. An IP address and authentication credentials (or similar) could be specified along with the module's platform, and the addressed machine could be set up so as to guarantee the module script in question can be executed. This has the drawback of the machine needing to be instantly available to the user.

Two programs for creating virtual machine environments, Docker and Vagrant, might provide machines which meets the requirements of a module. Both of these programs are capable of providing virtual machines on a user's machine, and both can be set up via provisioning scripts to provide uniform environments to meet a module's requirements. Both also provide networks interfaces which could be addressed in the way speculated for a remote machine.

Furthermore, OpenAPI glue systems themselves may have certain system requirements. It is conceivable that an entire glue system might be run on a remote machine or in a virtual environment like Docker or Vagrant.

Summary

In attempting to connect people with data, OpenAPI addresses the need for domain knowledge, and skills in data science, statistical graphics, and graphic design. The OpenAPI solution is to allow meaningful contributions from people who have some of these skills, rather than requiring that a user have all of them. In allowing small contributions OpenAPI invites more people to be involved. Those without any particular data skills can make use of the contributions and examples of those who do.

The OpenAPI architecture consists of modules, pipelines, and glue systems. A module is the smallest piece of OpenAPI, meant to encapsulate a simple data task. Pipelines are arrangements of modules, linked together to produce more complex outcomes. Glue systems are software which can interpret modules and pipelines, and can execute the scripts within.


Ashley Noel Hinton was supported by a Univeristy of Auckland Faculty of Science Research Development Grant.

This document was produced using the knitr package for R, and the pandoc document conversion program. The document source is available for download.

Creative Commons License
Introducing OpenAPI by Ashley Noel Hinton and Paul Murrell is licensed under a Creative Commons Attribution 4.0 International License.


References

“About Gapminder.” Gapminder. http://www.gapminder.org/about-gapminder/.

“Alteryx.” Alteryx, Inc. http://www.alteryx.com/.

Carey, Vince, Li Long, and R. Gentleman. RBGL: An Interface to the BOOST Graph Library. http://www.bioconductor.org.

Chinnici, Roberto, Jean-Jacques Moreau, Arthur Ryman, and Sanjiva Weerawarana, eds. 2007. “Web Services Description Language (WSDL) Version 2.0 Part 1: Core Language.” W3C Recommendation. W3C. http://www.w3.org/TR/wsdl20.

“Docker.” Docker, Inc. https://www.docker.com/.

“ElasticR.” Cloud Era Ltd. https://www.elasticr.com/welcome.

“Galaxy Wiki.” https://wiki.galaxyproject.org/.

Gentleman, R., Elizabeth Whalen, W. Huber, and S. Falcon. Graph: Graph: A Package to Handle Graph Data Structures.

Hansen, Kasper Daniel, Jeff Gentry, Li Long, Robert Gentleman, Seth Falcon, Florian Hahne, and Deepayan Sarkar. Rgraphviz: Provides Plotting Capabilities for R Graph Objects.

Hinton, Ashley Noel, and Paul Murrell. 2014. Conduit: Prototype Glue System for Openapi.

“KNIME.” Knime.com AG. http://www.knime.org/about.

Lang, Duncan Temple. 2013. XML: Tools for Parsing and Generating XML Within R and S-Plus. http://CRAN.R-project.org/package=XML.

———. 2014. RCurl: General Network (HTTP/FTP/.) Client Interface for R. http://CRAN.R-project.org/package=RCurl.

“Luigi.” Mortar. https://help.mortardata.com/technologies/luigi.

Morrison, J Paul Rodker. “Flow-Based Programming.” http://www.jpaulmorrison.com/fbp/.

Murrell, Paul, and Ashley Noel Hinton. 2014. GridGraphviz: Drawing Graphs with Grid. http://R-Forge.R-project.org/projects/gridgraph/.

“OpenCPU.” https://www.opencpu.org/.

“Orange.” University of Ljubljana. http://orange.biolab.si/.

Python Software Foundation. 2014. The Python Language Reference (version 2.7). https://docs.python.org/2.7/reference/index.html.

R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.

“RCloud.” AT&T. http://stats.research.att.com/RCloud/.

“Ubuntu Juju.” Canonical, Ltd. https://juju.ubuntu.com/.

“Vagrant.” HashiCorp. https://www.vagrantup.com/.

“VisTrails.” http://www.vistrails.org/index.php/Main_Page.

Wickham, Hadley, and Winston Chang. 2014. Devtools: Tools to Make Developing R Code Easier. http://CRAN.R-project.org/package=devtools.

Xie, Yihui. 2014. Knitr: A General-Purpose Package for Dynamic Report Generation in R. http://yihui.name/knitr/.