\documentclass[article]{jss} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% almost as usual \author{Paul Murrell\\The University of Auckland} \title{Viewing Data: The \pkg{rdataviewer} Package} %% for pretty printing and a nice hypersummary also set: \Plainauthor{Paul Murrell, The University of Auckland} %% comma-separated \Plaintitle{Viewing Data: The rdataviewer Package} %% without formatting \Shorttitle{Viewing Data} %% a short title (if necessary) %% an abstract and keywords \Abstract{ The \pkg{rdataviewer} package is a prototype software tool for the \R{} environment that implements several new ideas for viewing data sets. The ``shape'' of the entire data set is always visible to the user, to assist with learning about data structures. In addition, the view of the data set may be interactively zoomed and, for external text files, only the visible part of the data set is loaded into memory. Together these allow very large data sets to be viewed and navigated. } \Keywords{\R{}, data structures, large data} \Plainkeywords{R, data structures, large data} %% without formatting %% at least one keyword must be supplied %% publication information %% NOTE: Typically, this can be left commented and will be filled out by the technical editor %% \Volume{13} %% \Issue{9} %% \Month{September} %% \Year{2004} %% \Submitdate{2004-09-29} %% \Acceptdate{2004-09-29} %% The address of (at least) one author should be given %% in the following format: \Address{ Paul Murrell\\ Department of Statistics\\ The University of Auckland\\ 38 Princes Street, Auckland\\ New Zealand\\ E-mail: \email{paul@stat.auckland.ac.nz}\\ URL: \url{http://www.stat.auckland.ac.nz/~paul/} } %% It is also possible to add a telephone and fax number %% before the e-mail in the following format: %% Telephone: +43/1/31336-5053 %% Fax: +43/1/31336-734 %% for those who use Sweave please include the following line (with % symbols): %% need no \usepackage{Sweave.sty} %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\R}{\proglang{R}} \newcommand{\gtk}{\proglang{GTK+}} \newcommand{\qt}{\proglang{Qt}} \newcommand{\tcltk}{\proglang{Tcl/Tk}} \newcommand{\xml}{\proglang{XML}} <>= options(prompt="R> ", continue="R+ ") @ \begin{document} %% include your article here, just as usual %% Note that you should use the \pkg{}, \proglang{} and \code{} commands. \section{Introduction} %% Note: If there is markup in \(sub)section, then it has to be escape as above. \label{section:intro} This article describes several new ideas for ways to view data files and data structures, with applications to teaching and viewing large data sets. These ideas are implemented in a prototype package for the \R{} system \citep{R} called \pkg{rdataviewer} \citep{pkg:rdataviewer}. \subsection{Teaching students about data structures} One of the difficulties that many students face when they first encounter the \R{} environment, particularly those students without a background in computing, is the concept of data structures. Something that does not help the students' plight is the way that data structures are displayed on screen. For example, in \R{}, a simple vector of values is conceptually a 1-dimensional data structure, typically thought of as a \emph{column} of values. However, due to the limitations of screen real estate, a vector is printed on screen horizontally, with values wrapping across several lines when there are too many values to be displayed on one line. The following simple example, consisting of a character vector containing the letters of the english alphabet, suffices to demonstrate the problem. This does not look like very much like a single column of values to the uninitiated. <>= options(width=60) @ <<>>= letters @ A conceptually 2-dimensional structure, like a matrix, can actually be easier to grasp in simple cases because the values are displayed by \R{} in rows and columns, as shown below. <<>>= head(VADeaths) @ However, for anything but very small data sets, this convenient conceptual and visual correspondence rapidly breaks down. Again, the uninitiated can struggle to see that the following output represents a 2-dimensional structure with 6 rows and 11 columns. More columns just make the problem worse. <<>>= head(mtcars) @ One solution is to arrange the display of values so that there is a correspondence with the conceptual shape of the data structure. For example, in a spreadsheet view, as shown by the \code{View()} function in \R{}, or the more sophisticated data viewers provided by GUIs such as \pkg{Rcmdr} \citep{pkg:Rcmdr} or \pkg{JGR} \citep{pkg:JGR}, the values of a vector appear in a single column and the values in a data frame are shown in rows and columns (see Figure \ref{figure:View}). However, because the entire data structure is not visible, the overall shape is still unknown. Students easily lapse into thinking that the shape corresponds only to those values that are visible. \begin{figure} \hspace*{\fill} \includegraphics[width=1in]{viewletters.png} \hspace*{\fill} \includegraphics[width=4in]{viewmtcars.png} \hspace*{\fill} \caption{\label{figure:View}A vector (left) and a data frame (right) as shown by the \code{View()} function in \R{}.} \end{figure} A solution advocated in this paper is to supplement the spreadsheet view with an image that presents the overall shape of the \emph{entire} data structure when viewing values from the data structure. For example, Figure \ref{figure:shape} shows two consoles that display the shapes of the full \code{letters} vector and the full \code{mtcars} data frame. <>= library("rdataviewer") x <- letters data <- viewerData(x) tcltkViewer(simpleViewer(data, dev=viewerDeviceVp(data))) data <- viewerDataFrame(mtcars) tcltkViewer(simpleViewer(data, dev=viewerDeviceVp(data))) @ \begin{figure} \hspace*{\fill} \includegraphics[width=2in]{lettersShape.png} \hspace*{\fill} \includegraphics[width=2in]{mtcarsShape.png} \hspace*{\fill} \caption{\label{figure:shape}Two examples of simple images (the rectangular areas above the two buttons) that show the overall shape of a vector (left) and a data frame (right). These should be compared with the restricted view of these data that is typical of existing data viewers (see Figure \ref{figure:View}).} \end{figure} \subsection{Viewing large data sets} Printing out the raw data values in a data structure, as shown in the previous section, may seem like a somewhat pointless exercise. It is rarely the case, except for very small data sets, that we actually need to view an individual data value within a data set because interest more often focuses on the results of analyses or summaries of the raw data values. Even for the purpose of checking data, it makes more sense to view data summaries or plots rather than attempting to view thousands of individual raw data values. Nevertheless, it is often useful to view at least a sample of the values within a data set, if only to get an impression of the sorts of data values that lie within. For example, although we might not want to view all 578 rows of the \code{ChickWeight} data set, it is still useful to view the first few rows to gain an impression of the data values within each of the four columns, as shown below. <<>>= head(ChickWeight) @ If a data set is stored within a text file, it is also useful to view the raw values within the text file in order to discover the text format that has been used. With appropriate subsetting tools, such as those available in \R{}, it is also relatively straightforward to view \emph{any} simple subset of a data set (not just the \emph{first} few rows). For example, the following code views the $115^{\rm th}$ to the $120^{\rm th}$ rows of the \code{ChickWeight} data set. <<>>= ChickWeight[115:120, ] @ However, this is an example of a simple task that can be performed much more conveniently via a graphical user interface, where ``Page Up'' and ``Page Down'' keys, or a scroll bar, allow for much faster exploration of multiple views of the data. Again, a spreadsheet type of view, as provided by the \code{View()} function in \R{}, provides this sort of facility. The problem comes when the data set that we want to browse is large. In such cases, it can still be useful to view raw values (as we will see in Section \ref{section:extend}), but standard spreadsheet or even text viewing software may not be able to cope. Two problems in particular are addressed in this article: being able to view a large data set or a large text file \emph{at all}; and being able to view a useful portion of the data set or text file. In the former case, viewing software may struggle to even open a file that is hundreds of megabytes in size, or larger. In the latter case, if a data set has hundreds of columns and thousands of rows, it is useful to be able to decide how many of those rows and columns are currently viewed. Only being able to view 10 columns and 50 rows at a time is less useful than also being able to view 100 columns and 5000 rows at a time if we so choose. The proposed solution to the first problem is to provide a viewer tool that only loads into computer memory as much of the data set as is required to show the current view of the data. This allows data sets of arbitrary size to be viewed. The proposed solution to the second problem is to provide a viewer tool that allows interactive zooming of the current view, to allow more rows and columns to be viewed at once. For example, Figure \ref{figure:zoom} shows two different views of the same data set at different levels of zoom. <>= x11(width=4, height=4) planets <- read.csv("exoplanets.csv") data <- viewerDataFrame(planets) tcltkViewer(simpleViewer(data, dev=viewerDeviceVp(data))) @ \begin{figure} \hspace*{\fill} \includegraphics[width=3.5in]{viewzoom1.png} \hspace*{\fill} \vspace{.2in} \hspace*{\fill} \includegraphics[width=3.5in]{viewzoom2.png} \hspace*{\fill} \caption{\label{figure:zoom}Two views of a data set with 148 rows and 13 columns. On top is the default view and below that the view has been zoomed so that all rows of the data are visible. When text becomes very small, as in the bottom window, there is an option to draw simple lines of the appropriate length rather than text.} \end{figure} \section[The rdataviewer package]{The \pkg{rdataviewer} package} The \pkg{rdataviewer} package is a software prototype for trying out the ideas described in Section \ref{section:intro}. The main function in the package is called \code{view()} and it takes a single argument, which is the data set to view. For example, the following code is used to view the \code{mtcars} data frame. The result is a pair of windows, one an \R{} graphics window and one a \tcltk{} window (see Figure \ref{figure:rdataviewer}). <<>>= library("rdataviewer") @ <<>>= view(mtcars) @ \begin{figure} \hspace*{\fill} \includegraphics[width=2in]{tcltkwindow.png} \hspace*{\fill} \includegraphics[width=3in]{rgraphicswindow.png} \hspace*{\fill} \caption{\label{figure:rdataviewer}The two windows that are generated by the \code{view()} function: a \tcltk{} window (on the left) and an \R{} graphics window (on the right).} \end{figure} The \R{} graphics window displays the current view of the data set. This will typically be just a portion of the data set, as in Figure \ref{figure:rdataviewer}. The \tcltk{} window serves two functions: it allows the user to modify the current view of the data set (through various key strokes) and it displays a diagram of the overall shape of the data set with a red rectangle to represent what subset of the data is currently being viewed. In Figure \ref{figure:rdataviewer}, we see that the \code{mtcars} data set is a square-ish two-dimensional structure (grey and white regions are used to indicate different columns within the data set) and we are currently viewing roughly the top-left quarter of the data. Navigation of the current view is row- and column-based. For example, the right arrow navigates the view one column to the right and the down arrow navigates the current view one row down. The page down and page up keys, plus control-home and control-end also work. The \tcltk{} window must be the active window, otherwise key strokes are ignored. Of more relevance to this article are the key strokes for \emph{zooming} the current view. With the shift key down, the right arrow increases the zoom factor so that \emph{at least} one \emph{fewer} column is visible. With the control key down, the right arrow decreases the zoom factor so that at least one \emph{more} column is visible. For example, Figure \ref{figure:navzoom} shows the two \pkg{rdataviewer} windows after the right arrow key has been pressed twice, with the control key held down; now all of the \code{wt} and \code{qsec} columns are also visible. Notice also that the red region within the \tcltk{} window has grown to reflect the fact that a larger portion of the data is now visible. \begin{figure} \hspace*{\fill} \includegraphics[width=2in]{tcltkzoom.png} \hspace*{\fill} \includegraphics[width=3in]{rgraphicszoom.png} \hspace*{\fill} \caption{\label{figure:navzoom}The two windows from \ref{figure:rdataviewer} after two right arrow key strokes, with the control key held down. The current view has zoomed out so that two more complete columns are now visible. } \end{figure} This simple example demonstrates the first idea of showing the overall shape of the data set while viewing only a subset of the actual values, plus the idea of allowing simple zooming to control how much of the data set is being viewed. The next section looks at part of the design of the \pkg{rdataviewer} package in more detail in order to demonstrate the ideas for viewing large data sets. \section[The design of rdataviewer]{The design of \pkg{rdataviewer}} The \pkg{rdataviewer} package is based on four fundamental classes: a \code{ViewerData} object provides the data set to be viewed; a \code{ViewerState} object records the current view of the data (what portion of the data is visible); a \code{ViewerDevice} object provides somewhere to draw the current view of the data; and a \code{Viewer} object contains one of each of the three other objects, provides an interface to change the current view, and coordinates the updating of the \code{ViewerDevice} with changes in the \code{ViewerState}. The \pkg{rdataviewer} package provides a \code{Viewer} interface that is based on the \pkg{tcltk} GUI toolkit. This allows the package to be used without having to install a separate GUI toolkit like \gtk{} or \qt{}. However, the design allows for different front ends to be added. The package also provides a \code{ViewerDevice} based on the current \R{} graphics device and a default \code{ViewerState}. All we need to provide for the \code{view()} function is a \code{ViewerData} object. The \code{view()} function will accept raw \R{} data structures as we saw in the previous section, but we can obtain greater control if we explicitly generate a \code{ViewerData} object ourselves. For example, the result shown in Figure \ref{figure:rdataviewer} could also have been obtained with the following code, which explicitly creates a \code{ViewerData} object from the \code{mtcars} data frame. <<>>= view(viewerDataFrame(mtcars)) @ \subsection[The ViewerData class]{The \code{ViewerData} class} A \code{ViewerData} object contains the actual data set that is to be viewed and it is required to provide the following information about the data set: \begin{description} \item[dimensions:]The total number of rows and columns in the data set. \item[column widths:] The width of each column (number of characters). \item[column names:] A name for each column (for a specified range of columns). \item[text representation:] A text version of the data set (for a specified range of rows and columns). \end{description} For example, when the data set is a data frame, like \code{mtcars}, the \code{viewerDataFrame()} function creates a \code{ViewerData} object containing the data frame and it calculates the dimensions of the data set using the \code{dim()} function. The column names, column widths, and text representation of the data set are all obtained by capturing the output of a call to \code{print()} for the appropriate subset of the data frame (the rows and columns of the data frame that are currently visible). The reason for this design of the \code{ViewerData} class is that it allows new classes to be derived for other kinds of data sets. \subsection[Extending the ViewerData class]{Extending the \code{ViewerData} class} \label{section:extend} The \code{viewerDataFrame()} function creates an object of class \code{ViewerDataFrame}, which extends the \code{ViewerData} class and allows \R{} data frames to be viewed with the \code{view()} function. We can extend the \code{ViewerData} class in other ways to allow other data sources to be viewed. For example, the \code{viewerDataText()} function creates an object of class \code{ViewerDataText}, which allows plain text files to be viewed. A \code{ViewerDataText} object contains the name of a text file and provides the required information from the file in the following ways: \begin{description} \item[dimensions:]There is only one ``column'' and, when the \code{ViewerDataText} object is created, the file is searched for line breaks to determine how many lines (``rows'') it has. \item[column widths:] The width of the single column is calculated from the maximum number of characters between line breaks. \item[column names:] This is just the name of the file. \item[text representation:] A file ``seek'' is used to read only the rows of the file that are currently being viewed. \end{description} This implementation allows very large files to be viewed because only the viewed portion of the file is ever read into memory. For example, the following code shows a view of the start of a 300 MB \xml{} file with 4,772,013 lines (see Figure \ref{figure:viewmetrix}). % NOTE: got permission from Ian Howell via Joe Gamman (2009-08-06 dataViewer/) % NOTE: I unnecessarily asked for permission again (2011-10-03 dataViewer/) % Hope that doesn't undo the first one! % See /home/fos/pmur002/Talks/Dept2009/Metrix/ for more code and files {\small <>= view(viewerDataText("/scratch/Metrix/BlindData/MEENDL30102008_001-blind.XML", index=TRUE)) @ } <>= # For manual testing x11(width=4, height=4) view(viewerDataText("/scratch/Metrix/BlindData/MEENDL30102008_001-blind.XML")) @ \begin{figure} \hspace*{\fill} \includegraphics[width=2in]{tcltkmetrix.png} \hspace*{\fill} \includegraphics[width=3in]{rgraphicsmetrix.png} \hspace*{\fill} \caption{\label{figure:viewmetrix}The \pkg{rdataviewer} windows for viewing the Metrix data set (a 300 MB \xml{} file).} \end{figure} Although it may take a few seconds for the windows to appear, just being able to view this file is a minor victory. For example, heavyweight text editors like \proglang{vi} \citep{vi} and \proglang{Emacs} \citep{emacs} struggle with it (\proglang{vi} takes several minutes to open the file and \code{Emacs} fails to open it at all). Furthermore, navigation within the file is almost instantaneous. For example, it is possible to navigate to the middle the file by typing \code{2000000} and then \code{G}. It is also possible to zoom the view as normal. For example, Figure \ref{figure:zoommetrix} shows the view after pressing Page Up several times, with the control key held down. In this view, the raw data values are not legible, but it is still interesting to see and explore the regular structure of the raw values. Also shown in Figure \ref{figure:zoommetrix} is this zoomed view after browsing down the file by pressing Page Down several times. This shows a clear change in the data. \begin{figure} \hspace*{\fill} \includegraphics[width=2.9in]{zoommetrix.png} \hspace*{\fill} \includegraphics[width=2.9in]{changemetrix.png} \hspace*{\fill} \caption{\label{figure:zoommetrix}Two different views of the Metrix data set. On the left, the original view (Figure \ref{figure:viewmetrix}) has been zoomed out so that only the regular structure of the data values is visible. On the right, this zoomed view has been browsed to discover a phase change in the structure of the data values.} \end{figure} We can zoom in to this area of the file to see that the change involves the disappearance of \code{QUALITYFLAG} values after row 2936 (see Figure \ref{figure:changemetrix}). \begin{figure} \hspace*{\fill} \includegraphics[width=4in]{zoomchangemetrix.png} \hspace*{\fill} \caption{\label{figure:changemetrix}A zoomed view of the phase change in the structure of the Metrix data set (see Figure \ref{figure:zoommetrix}).} \end{figure} The \code{ViewerData} class has also been extended via the \code{ViewerDataMySQL} class so that an external \proglang{MySQL} database can be used as a data source for the \code{view()} function (see the \code{viewerDataMySQL()} function). \section{Discussion} The \pkg{rdataviewer} package is a software prototype for experimenting with several new ideas for viewing data sets. The package has one main function, \code{view()}, which creates two windows: one for viewing the data set and one for controlling what portion of the data set is visible. A feature of the latter window is a diagrammatic representation of the overall shape of the data set being viewed, which may help students to understand the nature of data structures. A feature of the navigation interface is that it is possible to zoom the view of the data, so that useful amounts of large data sets can be viewed at once. The \pkg{rdataviewer} package is designed so that only the viewed portion of the data set is needed in memory at any one time. This allows large data sets to be viewed and browsed. The package is built upon a set of \code{S4} classes and generic functions so that it is easy for others to experiment further with alternative user interfaces, alternative displays of the data, and alternative data sources. \subsection{Future directions} The \pkg{rdataviewer} package is primarily a test implementation for trying out different ideas for viewing data sets. It is hoped that some of these ideas might be adopted in more complete, sophisticated, and user-friendly systems, not necessarily restricted to \R{}. Nevertheless, the package itself may provide a useful tool if no other tool exists for navigating large raw data sets. Furthermore, the package may provide a basis for further experimentation, particularly with regard to alternative data sources. For example, it would be interesting to experiment with an interface to data structures that are stored on disk via the \pkg{ff} package \citep{pkg:ff}. The demonstration in this article is limited to data set formats with a rectangular structure (vectors, data frames, and plain text files). There is plenty of scope to explore other sorts of data structures, particularly recursive structures like lists. The implementation considered in this article also only considers the task of \emph{viewing} the data. It would also be useful to experiment with allowing editing of data values, if only to transform the data (for example, reordering of the rows). % There is no support for character encodings in text files. \bibliography{rdataviewer} \end{document}