9. Data Processing

In previous chapters, we have encountered some useful tools for specific computing tasks. For example, in Chapter 7 we learned about SQL for extracting data from a relational database.

Given a data set that has been stored in a relational database, we could now write a piece of SQL code to extract data from that data set. But what if we want to extract data from the database at a specific time of day? What if we want to repeat that task every day for a year? SQL has no concept of when to execute a task.

In Chapter 5, we learned about the benefits of the XML language for storing data, but we also learned that it is a very verbose format. We could conceive of designing the structure of an XML document, but would we really be prepared to write large amounts of XML by hand? What we want to be able to do is to get the computer to write the file of XML code for us.

In order to perform these sorts of tasks, we need a programming language. With a programming language, we will be able to tell the computer to perform a task at a certain time, or to repeat a task a certain number of times. We will be able to create files of information and we will be able to perform calculations with data values.

The purpose of this chapter is to explore and enumerate some of the tasks that a programming language will allow us to perform. As with previous topics, it is important to be aware of what tasks are actually possible as well as look at the technical details of how to carry out each task.

As we might expect, a programming language will let us do a lot more than the specific languages like HTML, XML, and SQL can do, but this will come at a cost because we will need to learn a few more complex concepts. However, the potential benefit is limitless. This is the chapter where we truly realize the promise of taking control of our computing environment.

The other important purpose of this chapter is to introduce a specific programming language, so that we can perform tasks in practice.

There are many programming languages to choose from, but in this chapter we will use the R language because it is relatively simple to learn and because it is particularly well suited to working with data.

9.0.0.1 Computer hardware

So, what can we do with a programming language?

To answer that question, it will again be useful to compare and contrast a programming language with the more specific languages that we have already met and what we can do with those.

SQL lets us talk to database systems; we can ask database software to extract specific data from database tables. With HTML, we can talk to web browsers; we can instruct the browser software to draw certain content on a web page. Working with XML is a bit more promiscuous because we are essentially speaking to any software system that might be interested in our data. However, we are limited to only being able to say “here are the data”.

A programming language is more general and more powerful than any of these, and the main reason for this is because a programming language allows us to talk not just to other software systems, but also to the computer hardware.

In order to understand the significance of this, we need to have a very basic understanding of the fundamental components of computer hardware and what we can do with them. Figure 9.1 shows a simple diagram of the basic components in a standard computing environment, and each component is briefly described below.

**Figure 9.1:** A diagram illustrating the basic components of a standard computing environment.

The Central Processing Unit (CPU) is the part of the hardware that can perform calculations or process our data. It is only capable of a small set of operations--basic arithmetic, plus the ability to compare values, and the ability to shift values between locations in computer memory--but it performs these tasks exceptionally quickly. Complex tasks are performed by combining many, many simple operations in the CPU.

The CPU also has access to a clock for determining times.

Being able to talk to the CPU means that we can perform arbitrarily complex calculations. This starts with simple things like determining minima, maxima, and averages for numeric values, but includes sophisticated data analysis techniques, and there really is no upper bound.

Computer memory, the hardware where data values are stored, comes in many different forms. Random Access Memory (RAM) is the term usually used to describe the memory that sits closest to the CPU. This memory is temporary--data values in RAM only last while the computer is running (they disappear when the computer is turned off)--and it is fast--values can be transferred to the CPU for processing and the result can be transferred back to RAM very rapidly. RAM is also usually relatively small.

Loosely speaking, RAM corresponds to the popular notion of short-term memory.

All processing of data typically involves, at a minimum, both RAM and the CPU. Data values are stored temporarily in RAM, shifted to the CPU to perform arithmetic or comparisons or something more complex, and the result is then stored back in RAM. A fundamental feature of a programming language is the ability to store values in RAM and specify the calculations that should be carried out by the CPU.

Being able to store data in RAM means that we can accomplish a complex task by performing a series of simpler steps. After each step, we record the intermediate result in memory so that we can use that result in subsequent steps.

The keyboard is one example of input hardware.

Most computers also have a mouse or touchpad. These are also examples of input hardware, but for the purposes of this book we are mostly interested in being able to enter code and data via a keyboard.

We will not be writing code to control the keyboard; this hardware component is more relevant to us as the primary way in which we will communicate our instructions and data to the computer.

Most computers typically have a very large repository of computer memory, such as a hard drive. This will be much larger and much slower than RAM but has the significant advantage that data values will persist when the power goes off. Mass storage is where we save all of our files and documents.

A related set of hardware components includes external storage devices, such as CD and DVD drives and thumb drives (memory sticks), which allow the data to be physically transported away from the machine.

Where RAM corresponds to short-term memory, mass storage corresponds to long-term memory.

It is essential to be able to access mass storage because that is where the original data values will normally reside. With access to mass storage, we can also permanently store the results of our data processing as well as our computer code.

The computer screen is one example of output hardware. The screen is important as the place where text and images are displayed to show the results of our calculations.

Being able to control what is displayed on screen is important for viewing the results of our calculations and possibly for sharing those results with others.

Most modern computers are connected to a network of some kind, which consists of other computers, printers, etc, and in many cases the general internet.

As with mass storage, the importance of having access to the network is that this may be where the original data values reside.

A programming language will allow us to work with these hardware components to perform all manner of useful tasks, such as reading files or documents from a hard disk into RAM, calculating new values, and displaying those new values on the computer screen.

How this chapter is organized

This chapter begins with a task that we might naturally think to perform by hand, but which can be carried out much more efficiently and accurately if instead we write code in a programming language to perform the task. The aims of this section are to show how useful a little programming knowledge can be and to demonstrate how an overall task can be broken down into smaller tasks that we can perform by writing code.

Sections 9.2 and 9.3 provide an initial introduction to the R programming language for performing these sorts of tasks. These sections will allow us to write and run some very basic R code.

Section 9.4 introduces the important idea of data structures--how data values are stored in RAM. In this section, we will learn how to enter data values by hand and how those values can be organized. All data processing tasks require data values to be loaded into RAM before we can perform calculations on the data values. Different processing tasks will require the data values to be organized and stored in different ways, so it is important to understand what options exist for storing data in RAM and how to work with each of the available options.

Some additional details about data structures are provided in Section 9.6, but before that, Section 9.5 provides a look at one of the most basic data processing tasks, which is extracting a subset from a large set of values. Being able to break a large data set into smaller pieces is one of the fundamental small steps that we can perform with a programming language. Solutions to more complex tasks are based on combinations of these small steps.

Section 9.7 addresses how to get data from external files into RAM so that we can process them. Most data sets are stored permanently in some form of mass storage, so it is crucial to know how to load data from various storage formats into RAM using R.

Section 9.8 describes a number of data processing tasks. Much of the chapter up to this point is laying the foundation. This section starts to provide information for performing powerful calculations with data values. Again, the individual techniques that we learn in these sections provide the foundation for completing more complex tasks. Section 9.8.12 provides a larger case study that demonstrates how the smaller steps can be combined to carry out a more substantial data processing exercise.

Section 9.9 looks at the special case of processing text data. We will look at tools for searching within text, extracting subsets from text, and splitting and recombining text values. Section 9.9.2 describes regular expressions, which are an important tool for searching for patterns within text.

Section 9.10 describes how to format the results of data processing, either for display on screen or for use in research reports.

Section 9.11 very briefly discusses some more advanced ideas about writing code in a programming language, and Section 10 contains a few comments about and pointers to alternative software and programming languages for data processing. These are more advanced topics and provide a small glimpse of areas to explore further.

Subsections

Paul Murrell

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 New Zealand License.