5. Data Storage

Chapter 2 dealt with the ideas and tools that are necessary for producing computer code. In Chapter 2, we were concerned with how we communicate our instructions to the computer.

In this chapter, we move over to the computer's view of the world for a while and investigate how the computer takes information and stores it.

There are several good reasons why researchers need to know about data storage options. One is that we may not have control over the format in which data is given to us. For example, data from NASA's Live Access Server is in a format decided by NASA and we are unlikely to be able to convince NASA to provide it in a different format. This says that we must know about different formats in order to gain access to data.

Another common situation is that we may have to transfer data between different applications or between different operating systems. This effectively involves temporary data storage, so it is useful to understand how to select an appropriate storage format.

It is also possible to be involved in deciding the format for archiving a data set. There is no overall best storage format; the correct choice will depend on the size and complexity of the data set and what the data set will be used for. It is necessary to gain an overview of the relative merits of all of the data storage options in order to be able to make an informed decision for any particular situation.

In this chapter, we will see a number of different data storage options and we will discuss the strengths and weaknesses of each.

How this chapter is organized

This chapter begins with a case study that is used to introduce some of the important ideas and issues that arise when data is stored electronically. The purpose of this section is to begin thinking about the problems that can arise and to begin establishing some criteria that we can use to compare different data storage options.

The remaining sections each describe one possible approach to data storage: Section 5.2 discusses plain text formats, Section 5.3 discusses binary formats, Section 5.4 looks briefly at the special case of spreadsheets, Section 5.5 describes XML, and Section 5.6 addresses relational databases.

In each case, there is an explanation of the relevant technology, including a discussion of how to use the technology in an appropriate way, and there is a discussion of the strengths and weaknesses of each storage option.



Subsections

Paul Murrell

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 New Zealand License.