\documentclass[article]{jss}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%% almost as usual
\author{Paul Murrell\\The University of Auckland}
\title{Viewing Data: The \pkg{rdataviewer} Package}

%% for pretty printing and a nice hypersummary also set:
\Plainauthor{Paul Murrell, The University of Auckland} %% comma-separated
\Plaintitle{Viewing Data: The rdataviewer Package} %% without formatting
\Shorttitle{Viewing Data} %% a short title (if necessary)

%% an abstract and keywords
\Abstract{
The \pkg{rdataviewer} package is a prototype software tool for the
\R{} environment that implements several new ideas for viewing data
sets.  The ``shape'' of the entire data set is always visible to the user, 
to assist
with learning about data structures.  In addition,
 the view of the data set may be
interactively zoomed and, for external text files, only the visible 
part of the data set is loaded into memory.  Together these allow very large
data sets to be viewed and navigated.
}
\Keywords{\R{}, data structures, large data}
\Plainkeywords{R,  data structures, large data} %% without formatting
%% at least one keyword must be supplied

%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{13}
%% \Issue{9}
%% \Month{September}
%% \Year{2004}
%% \Submitdate{2004-09-29}
%% \Acceptdate{2004-09-29}

%% The address of (at least) one author should be given
%% in the following format:
\Address{
  Paul Murrell\\
  Department of Statistics\\
  The University of Auckland\\
  38 Princes Street, Auckland\\
  New Zealand\\
  E-mail: \email{paul@stat.auckland.ac.nz}\\
  URL: \url{http://www.stat.auckland.ac.nz/~paul/}
}
%% It is also possible to add a telephone and fax number
%% before the e-mail in the following format:
%% Telephone: +43/1/31336-5053
%% Fax: +43/1/31336-734

%% for those who use Sweave please include the following line (with % symbols):
%% need no \usepackage{Sweave.sty}

%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\R}{\proglang{R}}
\newcommand{\gtk}{\proglang{GTK+}}
\newcommand{\qt}{\proglang{Qt}}
\newcommand{\tcltk}{\proglang{Tcl/Tk}}
\newcommand{\xml}{\proglang{XML}}

<<echo=FALSE>>=
options(prompt="R> ", continue="R+ ")
@ 

\begin{document}

%% include your article here, just as usual
%% Note that you should use the \pkg{}, \proglang{} and \code{} commands.

\section{Introduction}
%% Note: If there is markup in \(sub)section, then it has to be escape as above.
\label{section:intro}

This article describes several new ideas for ways to view
data files and data structures, with applications to teaching and viewing
large data sets.  These ideas are implemented
in a prototype package for the \R{} system \citep{R}
called \pkg{rdataviewer} \citep{pkg:rdataviewer}.

\subsection{Teaching students about data structures}

One of the difficulties that many students face when they first encounter
the \R{} environment, particularly
those students 
without a background in computing,  is the concept of data structures.
Something that does not help the students' plight is the way that 
data structures are displayed on screen.  For example, in \R{}, a simple
vector of values is conceptually
a 1-dimensional data structure, typically thought of 
as a \emph{column} of values.  However, due to the limitations of 
screen real estate, a vector is printed on screen horizontally, with
values wrapping across several lines when there are too many values to be
displayed on one line.  The following simple example, consisting
of a character vector containing the letters of the english alphabet,
suffices to demonstrate the problem.  This does not look like
very much like a single column of values to the uninitiated.

<<echo=FALSE>>=
options(width=60)
@ 

<<>>=
letters
@ 

A conceptually 2-dimensional structure, like a matrix, can actually
be easier to grasp in simple cases because the values are displayed
by \R{} in rows and columns, as shown below.

<<>>=
head(VADeaths)
@ 

However, for anything but very small data sets, this convenient conceptual
and visual correspondence
rapidly breaks down.  Again, the uninitiated can struggle to see that
the following output represents a 2-dimensional structure
with 6 rows and 11 columns.  More columns just make the problem worse.

<<>>=
head(mtcars)
@ 

One solution is to arrange the display of values so that
there is a correspondence with the conceptual shape of the data structure.
For example, in a spreadsheet view, as shown by the \code{View()}
function in \R{}, or the more sophisticated data viewers provided by
GUIs such as \pkg{Rcmdr} \citep{pkg:Rcmdr} or \pkg{JGR} \citep{pkg:JGR},
the values of a vector appear in a single column
and the values in a data frame are shown in rows and columns
(see Figure \ref{figure:View}).
However, because the entire data structure is not visible, the 
overall shape is still unknown.  Students easily lapse into 
thinking that the shape corresponds only to those values that
are visible.

\begin{figure}
\hspace*{\fill}
\includegraphics[width=1in]{viewletters.png}
\hspace*{\fill}
\includegraphics[width=4in]{viewmtcars.png}
\hspace*{\fill}
\caption{\label{figure:View}A vector (left) and a data frame (right)
as shown by the \code{View()} function in \R{}.}
\end{figure}

A solution advocated in this paper is to supplement the spreadsheet view
with an image that
presents the overall shape of the \emph{entire} data structure when 
viewing values from the data structure.
For example, Figure \ref{figure:shape} shows two consoles that 
display the shapes of the full \code{letters} vector and the full
\code{mtcars} data frame.

<<echo=FALSE, eval=FALSE>>=
library("rdataviewer")
x <- letters
data <- viewerData(x)
tcltkViewer(simpleViewer(data, dev=viewerDeviceVp(data)))
data <- viewerDataFrame(mtcars)
tcltkViewer(simpleViewer(data, dev=viewerDeviceVp(data)))
@ 

\begin{figure}
\hspace*{\fill}
\includegraphics[width=2in]{lettersShape.png}
\hspace*{\fill}
\includegraphics[width=2in]{mtcarsShape.png}
\hspace*{\fill}
\caption{\label{figure:shape}Two examples of simple images (the rectangular 
areas above the two buttons)
that show the overall shape of a vector (left) and a data frame (right).
These should be compared with the restricted view of these data that is 
typical of existing data viewers (see Figure \ref{figure:View}).}
\end{figure}

\subsection{Viewing large data sets}

Printing out the raw data values in a data structure, 
as shown in the previous section, may seem like a somewhat pointless 
exercise.  It is rarely the case, except for very small data sets, 
that we actually need to view an individual data value within a data set
because interest more often focuses on the results of analyses or
summaries of the raw data values.  Even for the purpose of 
checking data, it makes more sense to view data summaries 
or plots rather than attempting to view thousands of 
individual raw data values.  

Nevertheless, it is often useful to
view at least a sample of the values within a data set, if only 
to get an impression of the sorts of data values that lie within.
For example, although we might not want to view all 578 rows 
of the \code{ChickWeight} data set, it is still useful to view
the first few rows to gain an impression of the data values within
each of the four columns, as shown below.

<<>>=
head(ChickWeight)
@ 

If a data set is stored within a text file, it is also useful to 
view the raw values within the text file in order to discover the 
text format that has been used.

With appropriate subsetting tools, such as those available in \R{}, 
it is also relatively straightforward to view \emph{any} simple subset of 
a data set (not just the \emph{first} few rows).  
For example, the following code views the $115^{\rm th}$
 to the $120^{\rm th}$  rows of the \code{ChickWeight} data set.

<<>>=
ChickWeight[115:120, ]
@ 

However, this is an example of a simple task that can be performed 
much more conveniently via a graphical user interface, where 
``Page Up'' and ``Page Down'' keys, or a scroll bar,
allow for much faster exploration of multiple views of the data.
Again, a spreadsheet type of view, as provided by the \code{View()} 
function in \R{}, provides this sort of facility.

The problem comes when
the data set that we want to browse is large.  In such cases, it 
can still be useful to view raw values (as we will see in Section
\ref{section:extend}), but standard spreadsheet or even text viewing
software may not be able to cope.  

Two problems in particular are addressed in this article:
being able to view a large data set or a large text file \emph{at all};
and being able to view a useful portion of the data set or text file.
In the former case, viewing software may struggle to even open 
a file that is hundreds of megabytes in size, or larger.
In the latter case, if a data set has hundreds of columns and thousands 
of rows, it is useful to be able to decide how many of those rows and 
columns are currently viewed.  
Only being able to view 10 columns and 50 rows at a time is
less useful than also being able to view 100 columns and 5000 rows 
at a time if we so choose.

The proposed solution to the first problem is to provide a viewer tool
that only loads into computer memory as much of the data set as 
is required to show the current view of the data.
This allows data sets of arbitrary size to be viewed.
The proposed solution to the second problem is to provide
a viewer tool that allows interactive zooming of the current view,
to allow more rows and columns to be viewed at once.
For example, Figure \ref{figure:zoom} shows
two different views of the same data set at different levels of zoom.

<<echo=FALSE, eval=FALSE>>=
x11(width=4, height=4)
planets <- read.csv("exoplanets.csv")
data <- viewerDataFrame(planets)
tcltkViewer(simpleViewer(data, dev=viewerDeviceVp(data)))
@ 

\begin{figure}
\hspace*{\fill}
\includegraphics[width=3.5in]{viewzoom1.png}
\hspace*{\fill}

\vspace{.2in}
\hspace*{\fill}
\includegraphics[width=3.5in]{viewzoom2.png}
\hspace*{\fill}
\caption{\label{figure:zoom}Two views of a data set with 148 rows and 
13 columns.  On top is the default view and below that the view
has been zoomed so that all rows of the data are visible.
When text becomes very small, as in the bottom window,
there is an option to draw simple lines of the appropriate length
rather than text.}
\end{figure}

\section[The rdataviewer package]{The \pkg{rdataviewer} package}

The \pkg{rdataviewer} package is a software prototype 
for trying out the ideas described in Section \ref{section:intro}.
The main function in the package is called \code{view()} and it takes
a single argument, which is the data set to view.  
For example, the following code is used to view the \code{mtcars}
data frame.
The result is a pair of windows, one an \R{} graphics window
and one a \tcltk{} window (see Figure \ref{figure:rdataviewer}).

<<>>=
library("rdataviewer")
@ 

<<>>=
view(mtcars)
@ 

\begin{figure}
\hspace*{\fill}
\includegraphics[width=2in]{tcltkwindow.png}
\hspace*{\fill}
\includegraphics[width=3in]{rgraphicswindow.png}
\hspace*{\fill}
\caption{\label{figure:rdataviewer}The two windows that are generated
by the \code{view()} function:  a \tcltk{} window (on the left) and 
an \R{} graphics window (on the right).}
\end{figure}

The \R{} graphics window displays the current view of the data set.
This will typically be just a portion of the data set, as in
Figure \ref{figure:rdataviewer}.
The \tcltk{} window serves two functions:  it allows the user to 
modify the current view of the data set (through various key strokes) 
and it displays a diagram of the overall shape of the data set
with a red rectangle to represent what subset of the data is currently
being viewed.  In Figure \ref{figure:rdataviewer}, we see that the
\code{mtcars} data set is a square-ish two-dimensional structure
(grey and white regions are used to indicate different columns within the
data set) and we are currently 
viewing roughly the top-left quarter of the data.

Navigation of the current view is row- and column-based.  For example,
the right arrow navigates the view one column to the right and the down arrow
navigates the current view one row down.  The page down and page
up keys, plus control-home and control-end also work.  
The \tcltk{} window must be the active window, otherwise key strokes
are ignored.

Of more relevance to this article are the key strokes for \emph{zooming} the 
current view.  With the shift key down, the right arrow increases the 
zoom factor so that \emph{at least} one \emph{fewer} column is visible.  
With the control
key down, the right arrow decreases the zoom factor so that at least one 
\emph{more} column is visible.  For example, Figure
\ref{figure:navzoom} shows the two \pkg{rdataviewer} windows 
after the right arrow key has been pressed twice, with the control
key held down;  now all of the \code{wt} and \code{qsec}
columns are also visible.  Notice also that the red region within
the \tcltk{} window has grown to reflect the fact that a larger
portion of the data is now visible.

\begin{figure}
\hspace*{\fill}
\includegraphics[width=2in]{tcltkzoom.png}
\hspace*{\fill}
\includegraphics[width=3in]{rgraphicszoom.png}
\hspace*{\fill}
\caption{\label{figure:navzoom}The two windows 
from \ref{figure:rdataviewer} after two right arrow
key strokes, with the control key held down.
The current view has zoomed out so that two more complete columns are
now visible.  }
\end{figure}

This simple example demonstrates the first idea of showing the overall
shape of the data set while viewing only a subset of the actual values,
plus the idea of allowing simple zooming to control how much of the
data set is being viewed.
The next section looks at part of the design of the \pkg{rdataviewer} package 
in more detail in order to demonstrate the ideas for viewing
large data sets.

\section[The design of rdataviewer]{The design of \pkg{rdataviewer}}

The \pkg{rdataviewer} package is based on four fundamental classes:
a \code{ViewerData} object provides the data set to be viewed;
a \code{ViewerState} object records the current view of the data
(what portion of the data is visible);
a \code{ViewerDevice} object provides somewhere to draw the current
view of the data; 
and a \code{Viewer} object contains one of each 
of the three other objects, provides an interface to
change the current view, and coordinates the updating of the
\code{ViewerDevice} with changes in the \code{ViewerState}.

The \pkg{rdataviewer} package provides a \code{Viewer} interface
that is based on the \pkg{tcltk} GUI toolkit.  This allows the
package to be used without having to install a separate GUI toolkit like
\gtk{} or \qt{}.  However, the design allows for different front ends
to be added.
The package also provides a \code{ViewerDevice} 
based on the current \R{} graphics device and a default \code{ViewerState}.

All we need to provide for the \code{view()} function is a 
\code{ViewerData} object.  The \code{view()} function will accept 
raw \R{} data structures as we saw in the previous section, 
but we can obtain greater
control if we explicitly generate a \code{ViewerData} object ourselves.
For example, the result shown in Figure \ref{figure:rdataviewer}
could also have been obtained with the following code, which 
explicitly creates a \code{ViewerData} object from the 
\code{mtcars} data frame.

<<>>=
view(viewerDataFrame(mtcars))
@ 

\subsection[The ViewerData class]{The \code{ViewerData} class}

A \code{ViewerData} object contains the actual data set that
is to be viewed and it is required to provide the following information about 
the data set:
\begin{description}
\item[dimensions:]The total number of rows and columns in the data set.
\item[column widths:]  The width of each column (number of characters).
\item[column names:] A name for each column (for a specified range
of columns).
\item[text representation:] A text version of the data set (for a 
specified range of rows and columns).
\end{description}

For example, when the data set is a data frame, like \code{mtcars}, the
\code{viewerDataFrame()} function creates a \code{ViewerData} object
containing the data frame and it calculates the dimensions of the
data set using the \code{dim()} function.
The column names, column widths, and text representation of the 
data set are all obtained by capturing the output of a call to 
\code{print()} for the 
appropriate subset of the data frame (the rows and columns of the 
data frame that are currently visible).

The reason for this design of the \code{ViewerData} class is 
that it allows new classes to be derived for other kinds of data sets.

\subsection[Extending the ViewerData class]{Extending the \code{ViewerData} class}
\label{section:extend}

The \code{viewerDataFrame()} function creates an object of class
\code{ViewerDataFrame}, which extends the \code{ViewerData} class and
allows \R{} data frames to be viewed with the \code{view()} function.
We can extend the \code{ViewerData} class in other ways to allow
other data sources to be viewed. 
For example, 
the \code{viewerDataText()} function creates an object of class
\code{ViewerDataText}, which allows plain text files to be viewed.

A \code{ViewerDataText} object contains the name of a text file
and provides the required information from the file in the following
ways:

\begin{description}
\item[dimensions:]There is only one ``column'' and, when the
\code{ViewerDataText} object is created, the file is searched
for line breaks to determine how many lines (``rows'') it has.
\item[column widths:]  The width of the single column is 
calculated from the maximum number of characters between line breaks.
\item[column names:] This is just the name of the file.
\item[text representation:] A file ``seek'' is used to read only the 
rows of the file that are currently being viewed.
\end{description}

This implementation allows very large files to be viewed because 
only the viewed portion of the file is ever read into memory.
For example, the following code shows a view of the start of a
300 MB \xml{} file with 4,772,013 lines
(see Figure \ref{figure:viewmetrix}).  

% NOTE: got permission from Ian Howell via Joe Gamman (2009-08-06 dataViewer/)

% NOTE: I unnecessarily asked for permission again (2011-10-03 dataViewer/)
% Hope that doesn't undo the first one!

% See /home/fos/pmur002/Talks/Dept2009/Metrix/ for more code and files

{\small
<<viewmetrix, eval=FALSE>>=
view(viewerDataText("/scratch/Metrix/BlindData/MEENDL30102008_001-blind.XML",
                    index=TRUE))
@ 
}

<<echo=FALSE, eval=FALSE>>=
# For manual testing
x11(width=4, height=4)
view(viewerDataText("/scratch/Metrix/BlindData/MEENDL30102008_001-blind.XML"))
@ 

\begin{figure}
\hspace*{\fill}
\includegraphics[width=2in]{tcltkmetrix.png}
\hspace*{\fill}
\includegraphics[width=3in]{rgraphicsmetrix.png}
\hspace*{\fill}
\caption{\label{figure:viewmetrix}The 
\pkg{rdataviewer} windows for viewing the Metrix data set
(a 300 MB \xml{} file).}
\end{figure}

Although it may take a few seconds for the windows to appear,
just being able to view this file is a minor victory.  For example,
heavyweight text editors like \proglang{vi} \citep{vi} 
and \proglang{Emacs} \citep{emacs}
struggle with it (\proglang{vi} takes several minutes to open the
file and \code{Emacs} fails to open it at all).

Furthermore, navigation within the file is almost instantaneous.
For example, it is possible to navigate to the middle  the file by typing 
\code{2000000} and then \code{G}.  It is also possible to zoom
the view as normal.  For example, Figure \ref{figure:zoommetrix}
shows the view after pressing Page Up several times, with the control
key held down.  In this view, the raw data values are not legible,
but it is still interesting to see and explore 
the regular structure of the raw values.
Also shown in Figure \ref{figure:zoommetrix} is this zoomed view after
browsing down the file by pressing Page Down several times.
This shows a clear change in the data.

\begin{figure}
\hspace*{\fill}
\includegraphics[width=2.9in]{zoommetrix.png}
\hspace*{\fill}
\includegraphics[width=2.9in]{changemetrix.png}
\hspace*{\fill}
\caption{\label{figure:zoommetrix}Two different views of the Metrix data set.
On the left, the original view (Figure \ref{figure:viewmetrix}) has been
zoomed out so that only the regular structure of the data values is visible.
On the right, this zoomed view has been browsed to discover a phase change
in the structure of the data values.}
\end{figure}

We can zoom in to this area of the file to see that the change
involves the disappearance of \code{QUALITYFLAG} values after
row 2936 (see Figure \ref{figure:changemetrix}).

\begin{figure}
\hspace*{\fill}
\includegraphics[width=4in]{zoomchangemetrix.png}
\hspace*{\fill}
\caption{\label{figure:changemetrix}A zoomed view of the phase change
in the structure of the Metrix data set (see Figure \ref{figure:zoommetrix}).}
\end{figure}

The \code{ViewerData} class has also been extended via the
\code{ViewerDataMySQL} class so that an external \proglang{MySQL} database
can be used as a data source for the \code{view()} function
(see the \code{viewerDataMySQL()} function).

\section{Discussion}

The \pkg{rdataviewer} package is a software prototype for experimenting
with several new ideas for viewing data sets.  The package has one main
function, \code{view()}, which creates two windows:  one for viewing the
data set and one for controlling what portion of the data set is visible.
A feature of the latter window is a diagrammatic representation of
the overall shape of the data set being viewed, which may help
students to understand the nature of data structures.
A feature of the navigation interface is that it is possible to zoom
the view of the data, so that useful amounts of large data sets can be
viewed at once.  
The \pkg{rdataviewer} package is designed so that only the viewed portion
of the data set is needed in memory at any one time.  This allows large
data sets to be viewed and browsed.  

The package is built upon a set of \code{S4} classes and generic functions
so that it is easy for others to experiment further with alternative
user interfaces, alternative displays of the data, and alternative
data sources.

\subsection{Future directions}

The \pkg{rdataviewer} package is primarily a test implementation for
trying out different ideas for viewing data sets.  It is hoped
that some of these ideas might be adopted in more complete,
sophisticated, and user-friendly systems, not necessarily 
restricted to \R{}.

Nevertheless, the package itself may provide a useful tool if no other
tool exists for navigating large raw data sets.  Furthermore,
the package may provide a basis for further experimentation, particularly
with regard to alternative data sources.  For example, it would be 
interesting to experiment with an interface to data structures that are
stored on disk via the \pkg{ff} package \citep{pkg:ff}.

The demonstration in this article is limited to data set formats with 
a rectangular structure (vectors, data frames, and plain text files).
There is plenty of scope to explore other sorts of data structures,
particularly recursive structures like lists.

The implementation considered in this article also only considers
the task of \emph{viewing} the data.  It would also be 
useful to experiment with allowing editing of data values, if
only to transform the data (for example, reordering of the rows).

% There is no support for character encodings in text files.

\bibliography{rdataviewer}


\end{document}