Subsections


5.3 Binary formats


In this section we will consider the option of storing data using a binary format. The purpose of this section is to understand the structure of binary formats and to consider the benefits and drawbacks of using a binary format to store data.

A binary format is a more complex storage solution than a plain text format, but it will typically provide faster and more flexible access to the data and use up less memory.

A file with a binary format is simply a block of computer memory, just like a file with a plain text format. The difference lies in how the bytes of computer memory are used.

In order to understand the difference, we need to learn a bit more about how computer memory works.


5.3.1 More on computer memory

We already know that one way to use a byte of computer memory is to interpret it as a character. This is what plain text formats do.

However, there are many other ways that we can make use of a byte of memory. One particularly straightforward use is to interpret the byte as an integer value.

A sequence of memory bits can be interpreted as a binary (base 2) number. For example, the byte 00000001 can represent the value 1, 00000010 can represent the value 2, 00000011 the value 3, and so on.

With 8 bits in a byte, we can store 256 (28) different integers (e.g., from 0 up to 255).

This is one way that binary formats can differ from plain text formats. If we want to store the value 48 in a plain text format, we must use two bytes, one for the digit 4 and one for the digit 8. In a binary format, we could instead just use one byte and store the binary representation of 48, which is 00110000.

One limitation of representing a number in computer memory this way is that we cannot store large integers, so in practice, a binary format will store an integer value using two bytes (for a maximum value of 216 = 65,535) or four bytes (maximum of 232 = 4,294,967,295). If we need to store negative numbers, we can use 1 bit for the sign. For example, using two bytes per integer, we can get a range of -32,767 to 32,767 ( $\pm 2^{15} - 1$).

Another way to interpret a byte of computer memory is as a real number. This is not as straightforward as for integers, but we do retain the basic idea that each pattern of bits corresponds to a different number.

In practice, a minimum of 4 bytes is used to store real numbers. The following example of raw computer memory shows the bit patterns corresponding to the numbers 0.1 to 0.5, where each number is stored as a real number using 4 bytes per number.

 0  :  11001101 11001100 11001100 00111101  |  0.1
 4  :  11001101 11001100 01001100 00111110  |  0.2
 8  :  10011010 10011001 10011001 00111110  |  0.3
12  :  11001101 11001100 11001100 00111110  |  0.4
16  :  00000000 00000000 00000000 00111111  |  0.5

The correspondence between the bit patterns and the real numbers being stored is far from intuitive, but this is another common way to make use of bytes of memory.

As with integers, there is a maximum real number that can be stored using four bytes of memory. In practice, it is common to use eight bytes per real number, or even more. With 8 bytes per real number, the range is a little over $\pm10^{307}$.

However, there is another problem with storing real numbers this way. Recall that, for integers, the first state can represent 0, the second state can represent 1, the third state can represent 2, and so on. With k bits, we can only go as high as the integer 2k - 1, but at least we know that we can account for all of the integers up to that point.

Unfortunately, we cannot do the same thing for real numbers. We could say that the first state represents 0, but what does the second state represent? 0.1? 0.01? 0.00000001? Suppose we chose 0.01, so the first state represents 0, the second state represents 0.01, the third state represents 0.02, and so on. We can now only go as high as 0.01 x (2k - 1), and we have missed all of the numbers between 0.01 and 0.02 (and all of the numbers between 0.02 and 0.03, and infinitely many others).

This is another important limitation of storing information on a computer: there is a limit to the precision that we can achieve when we store real numbers this way. Most real values cannot be stored exactly on a computer. Examples of this problem include not only exotic values such as transcendental numbers (e.g., $\pi$ and e), but also very simple everyday values such as $\frac{1}{3}$ or even 0.1. Fortunately, this is not as dreadful as it sounds, because even if the exact value cannot be stored, a value very very close to the true value can be stored. For example, if we use eight bytes to store a real number, then we can store the distance of the earth from the sun to the nearest millimeter.

In summary, when information is stored in a binary format, the bytes of computer memory can be used in a variety of ways. To drive the point home even further, the following displays show exactly the same block of computer memory interpreted in three different ways.

First, we treat each byte as a single character.

0  :  74 65 73 74  |  test

Next, we interpret the memory as two, two-byte integers.

0  :  74 65 73 74  |  25972 29811

Finally, we can also interpret the memory as a single, four-byte real number.

0  :  74 65 73 74  |  7.713537e+31

Which one of these is the correct interpretation? It depends on which particular binary format has been used to store the data.

The characteristic feature of a binary format is that there is not a simple rule for determining how many bits or how many bytes constitute a basic unit of information.

It is necessary for there to be a description of the rules for the binary format that states what information is stored and how many bits or bytes are used for each piece of information.

Binary formats are consequently much harder to write software for, which results in there being less software available to do the job.

Given that a description is necessary to have any chance of reading a binary format, proprietary formats, where the file format description is kept private, are extremely difficult to deal with. Open standards become more important than ever.

The main point is that we require specific software to be able to work with data in a binary format. However, on the positive side, binary formats are generally more efficient and faster than text-based formats. In the next section, we will look at an example of a binary format and use it to demonstrate some of these ideas.


5.3.2 Case study: Point Nemo (continued)

In this example, we will use the Point Nemo temperature data (see Section 1.1) again, but this time with the data stored in a binary format called netCDF.

The first 60 bytes of the netCDF format file are shown in Figure 5.7, with each byte interpreted as a single character.

Figure 5.7: The first 60 bytes of the netCDF format file that contains the surface temperatures at Point Nemo. The data are shown here as unstructured bytes to demonstrate that, without knowledge of the structure of the binary format, there is no way to determine where the different data values reside or what format the values have been stored in (apart from the fact that there is obviously some text values in the file).
 

 0  :  43 44 46 01 00 00 00 00 00 00 00 0a  |  CDF.........
12  :  00 00 00 01 00 00 00 04 54 69 6d 65  |  ........Time
24  :  00 00 00 30 00 00 00 00 00 00 00 00  |  ...0........
36  :  00 00 00 0b 00 00 00 02 00 00 00 04  |  ............
48  :  54 69 6d 65 00 00 00 01 00 00 00 00  |  Time........

We will learn more about this format below. For now, the only point to make is that, while we can see that some of the bytes in the file appear to be text values, it is not clear from this raw display of the data, where the text data starts and ends, what values the non-text bytes represent, and what each value is for.

Compared to a plain text file, this is a complete mess and we need software that understands the netCDF format in order to extract useful values from the file.

5.3.3 NetCDF

The Point Nemo temperature data set has been stored in a format called network Common Data Form (netCDF).5.2In this section, we will learn a little bit about the netCDF format in order to demonstrate some general features of binary formats.

NetCDF is just one of a huge number of possible binary formats. It is a useful one to know because it is widely used and it is an open standard.

The precise bit patterns in the raw computer memory examples shown in this section are particular to this data set and the netCDF data format. Any other data set and any other binary format would produce completely different representations in memory. The point of looking at this example is that it provides a concrete demonstration of some useful general features of binary formats.

The first thing to note is that the structure of binary formats tends to be much more flexible than a text-based format. A binary format can use any number of bytes, in any order.

Like most binary formats, the structure of a netCDF file consists of header information, followed by the raw data itself. The header information includes information about how many data values have been stored, what sorts of values they are, and where within the file the header ends and the data values begin.

Figure 5.8 shows a structured view of the start of the netCDF format for the Point Nemo temperature data. This should be contrasted with the unstructured view in Figure 5.7.

Figure 5.8: The start of the header information in the netCDF file that contains the surface temperatures at Point Nemo, with the structure of the netCDF format revealed so that separate data fields can be observed. The first three bytes are characters, the fourth byte is a single-byte integer, the next sixteen bytes are four-byte integers, and the last four bytes are characters. This structured display should be compared to the unstructured bytes shown in Figure 5.7.
 

   0  :  43 44 46                    |  CDF       
 
   3  :  01                          |  1       
 
   4  :  00 00 00 00 00 00 00 0a     |   0 10
  12  :  00 00 00 01 00 00 00 04     |   1  4 
 
  20  :  54 69 6d 65                 |  Time

The netCDF file begins with three bytes that are interpreted as characters, specifically the three characters `C', `D', and `F'. This start to the file indicates that this is a netCDF file. The fourth byte in the file is a single-byte integer and specifies which version of netCDF is being used, in this case, version 1, or the “classic” netCDF format. This part of the file will be the same for any (classic) netCDF file.

The next sixteen bytes in the file are all four-byte integers and the four bytes after that are each single-byte characters again. This demonstrates the idea of binary formats, where values are packed into memory next to each other, with different values using different numbers of bytes.

Another classic feature of binary formats is that the header information contains pointers to the location of the raw data within the file and information about how the raw data values are stored. This information is not shown in Figure 5.8, but in this case, the raw data is located at byte 624 within the file and each temperature value is stored as an eight-byte real number.

Figure 5.9 shows the raw computer memory starting at byte 624 that contains the temperature values within the netCDF file. These are the first eight temperature values from the Point Nemo data set in a binary format.

Figure 5.9: The block of bytes within the Point Nemo netCDF file that contains the surface temperature values. Each temperature values is an eight-byte real number.
 

 624  :  40 71 6e 66 66 66 66 66  |  278.9
 632  :  40 71 80 00 00 00 00 00  |  280.0
 640  :  40 71 6e 66 66 66 66 66  |  278.9
 648  :  40 71 6e 66 66 66 66 66  |  278.9
 656  :  40 71 5c cc cc cc cc cd  |  277.8
 664  :  40 71 41 99 99 99 99 9a  |  276.1
 672  :  40 71 41 99 99 99 99 9a  |  276.1
 680  :  40 71 39 99 99 99 99 9a  |  275.6
...

In order to emphasize the difference in formats, the display below shows the raw computer memory from the plain text format of the Point Nemo data set. Compare these five bytes, which store the number 278.9 as five characters, with the first eight bytes in Figure 5.9, which store the same number as an eight-byte real number.

359  :  32 37 38 2e 39  |  278.9

The section of the file shown in Figure 5.9 also allows us to discuss the issue of speed of access to data stored in a binary format.

The fundamental issue is that it is possible to calculate the location of a data value within the file. For example, if we want to access the fifth temperature value, 277.8, within this file, we know with certainty, because the header information has told us, that this value is 8 bytes long and it starts at byte number 656: the offset of 624, plus 4 times 8 bytes (the size of each temperature value).

Contrast this simple calculation with finding the fifth temperature value in a text-based format like the CSV format for the Point Nemo temperature data. The raw bytes representing the first few lines of the CSV format are shown in Figure 5.10.

Figure 5.10: A block of bytes from the start of the Point Nemo CSV file that contains the surface temperature values. Each character in the file occupies one byte.
 

  0  :  64 61 74 65 2c 74 65 6d 70 0a 20  |  date,temp. 
 11  :  31 36 2d 4a 41 4e 2d 31 39 39 34  |  16-JAN-1994
 22  :  2c 32 37 38 2e 39 0a 20 31 36 2d  |  ,278.9. 16-
 33  :  46 45 42 2d 31 39 39 34 2c 32 38  |  FEB-1994,28
 44  :  30 0a 20 31 36 2d 4d 41 52 2d 31  |  0. 16-MAR-1
 55  :  39 39 34 2c 32 37 38 2e 39 0a 20  |  994,278.9. 
 66  :  31 36 2d 41 50 52 2d 31 39 39 34  |  16-APR-1994
 77  :  2c 32 37 38 2e 39 0a 20 31 36 2d  |  ,278.9. 16-
 88  :  4d 41 59 2d 31 39 39 34 2c 32 37  |  MAY-1994,27
 99  :  37 2e 38 0a 20 31 36 2d 4a 55 4e  |  7.8. 16-JUN
110  :  2d 31 39 39 34 2c 32 37 36 2e 31  |  -1994,276.1

The fifth temperature value in this file starts at byte 98 within the file, but there is no simple way to calculate that fact. The only way to find that value is to start at the beginning of the file and read one character at a time until we have counted six commas (the field separators in a CSV file). Similarly, because not all data values have the same length, in terms of number of bytes of memory, the end of the data value can only be found by continuing to read characters until we find the end of the line (in this file, the byte 0a).

The difference is similar to finding a particular scene in a movie on a DVD disc compared to a VHS tape.

In general, it is possible to jump directly to a specific location within a binary format file, whereas it is necessary to read a text-based format from the beginning and one character at a time. This feature of accessing binary formats is called random access and it is generally faster than the typical sequential access of text files.

This example is just provided to give a demonstration of how it is possible for access to be faster to data stored in a binary file. This does not mean that access speeds are always faster and it certainly does not mean that access speed should necessarily be the deciding factor when choosing a data format. In some situations, a binary format will be a good choice because data can be accessed quickly.

5.3.4 PDF documents

It is worthwhile briefly mentioning Adobe's Portable Document Format (PDF) because so many documents, including research reports, are now published in this binary format.

While PDF is not used directly as a data storage format, it is common for a report in PDF format to contain tables of data, so this is one way in which data may be received.

Unfortunately, extracting data from a table in a PDF document is not straightforward. A PDF document is primarily a description of how to display information. Any data values within a PDF document will be hopelessly entwined with information about how the data values should be displayed.

One simple way to extract the data values from a PDF document is to use the text selection tool in a PDF viewer, such as Adobe Reader, to cut-and-paste from the PDF document to a text format. However, there is no guarantee that data values extracted this way will be arranged in tidy rows and columns.

5.3.5 Other types of data

So far, we have seen how numbers and text values can be stored in computer memory. However, not all data values are simple numbers or text values.

For example, consider the date value March 1${}^{\rm st}$ 2006. How should that be stored in memory?

The short answer is that any value can be converted to either a number or a piece of text and we already know how to store those sorts of values. However, the decision of whether to use a numeric or a textual representation for a data value is not always straightforward.

Consider the problem of storing information on gender. There are (usually) only two possible values: male and female.

One way to store this information would be as text: “male” and “female”. However, that approach would take up at least 4 to 6 bytes per observation. We could do better, in terms of the amount of memory used, by storing the information as an integer, with 1 representing male and 2 representing female, thereby only using as little as one byte per observation. We could do even better by using just a single bit per observation, with “on” representing male and “off” representing female.

On the other hand, storing the text value “male” is much less likely to lead to confusion than storing the number 1 or by setting a bit to “on”; it is much easier to remember or intuit that the text “male” corresponds to the male gender.

An ideal solution would be to store just the numbers, so that we use up less memory, but also record the mapping that relates the value 1 to the male gender.

5.3.5.1 Dates


The most common ways to store date values are as either text, such as “March 1 2006”, or as a number, for example, the number of days since the first day of January, 1970.

The advantage of storing a date as a number is that certain operations, such as calculating the number of days between two dates, becomes trivial. The problem is that the stored value only makes sense if we also know the reference date.

Storing dates as text avoids the problem of having to know a reference date, but a number of other complications arise.

One problem with storing dates as text is that the format can differ between different countries. For example, the second month of the year is called February in English-speaking countries, but something else in other countries. A more subtle and dangerous problem arises when dates are written in formats like this: 01/03/06. In some countries, that is the first of March 2006, but in other countries it is the third of January 2006.

The solution to these problems is to use the international standard for expressing dates, ISO 8601. This standard specifies that a date should consist of four digits indicating the year, followed by two digits indicating the month, followed by two digits indicating the day, with dashes in between each component. For example, the first day of March 2006 is written: 2006-03-01.

Dates (a particular day) are usually distinguished from date-times, which specify not only a particular day, but also the hour, second, and even fractions of a second within that day. Date-times are more complicated to work with because they depend on location; mid-day on the first of March 2006 happens at different times for different countries (in different time zones). Daylight savings time just makes things worse.

ISO 8601 includes specifications for adding time information, including a time zone, to a date. As with simple dates, it is also common to store date-times as numbers, for example, as the number of seconds since the beginning of 1970.


5.3.5.2 Money


There are two major issues with storing monetary values. The first is that the currency should be recorded; NZ$1.00 is very different from US$1.00. This issue applies, of course, to any value with a unit, such as temperature, weight, and distances.

The second issue with storing monetary values is that values need to be recorded exactly. Typically, we want to keep values to exactly two decimal places at all times. Monetary data may be stored in a special format to allow for this emphasis on accuracy.


5.3.5.3 Metadata


In most situations, a single data value in isolation has no inherent meaning. We have just seen several explicit examples: a gender value stored as a number only makes sense if we also know which number corresponds to which gender; a date stored as a number of days only makes sense if we know the reference date to count from; monetary values, temperatures, weights, and distances all require a unit to be meaningful.

The information that provides the context for a data value is an example of metadata.

Other examples of metadata include information about how data values were collected, such as where data values were recorded and who recorded them.

How should we store this information about the data values?

The short answer is that each piece of metadata is just itself a data value, so, in terms of computer memory, each individual piece of metadata can be stored as either a number or as text.

The larger question is how to store the metadata so that it is somehow connected to the raw data values. Deciding what data format to use for a particular data set should also take into account whether and how effectively metadata can be included alongside the core data values.


Recap


A block of bytes in computer memory can be interpreted in many ways. A binary format specifies how each byte or set of bytes in a file should be interpreted.

Extremely large numbers cannot be stored in computer memory using standard representations, and there is a limit to the precision with which real numbers can be stored.

Binary formats tend to use up less memory and provide faster access to data compared to text-based formats.

It is necessary to have specific software to work with a binary format, so it can be more expensive and it can be harder to share data.

Paul Murrell

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 New Zealand License.