Manipulating the Network Packets Data Set with R

Introduction

The data are measurements made on each packet of information that passes through the University of Auckland Internet Gateway. These measurement include a timestamp that represents the time at which the packet reached the network location, and the size of the packet, as a number of bytes. The data set described herein is provided in a space-delimited ASCII text file format. This is a very small sample of just 46 packets.

Variables

The data set contains the following variables:

timestamp - The time at which the packet arrived at the Internet Gateway, as a number of seconds since January 1st 1970.

size - The size of the packet, in bytes.

Data format

The data set is provided as a space-delimited ASCII text file called packetSample.txt.

A larger file, containing packets from a single minute of network traffic is also available (7MB).

The file NetPacketsMeta.xml provides a StatDataML description of the data set.

Q: Reading the data into R

The first task is to read the data set into R. The result should look like this:

          V1   V2
1 1156748010   60
2 1156748010 1254
3 1156748010 1514
4 1156748010 1494
5 1156748010  114
6 1156748010 1514

A:

packets <- read.table("packetSample.txt")
head(packets)

Q: Displaying significant digits

Take a look at the original text file. You will see that the timestamp values appear to be much more precise than the values that are shown above when we read the file into R. For example, the first timestamp value is 1156748010.47817.

The problem is not that R cannot represent these numbers precisely; the problem is that R does not display numbers to full accuracy by default.

Change the number of significant digits that R displays so that we can see these values with full precision. The result should look like this:

                V1   V2
1 1156748010.47817   60
2 1156748010.47865 1254
3 1156748010.47878 1514
4 1156748010.47890 1494
5 1156748010.47892  114
6 1156748010.47891 1514

A:

options(digits=15)
head(packets)

Q: Numerical precision

The timestamp values are both extremely large and extremely precise. In fact, they are approaching the limits of precision for storing numeric values in four bytes (which is what R uses for real values).

The following code calculates the average time gap between packets.

mean(diff(packets[, 1]))

[1] 0.00226244396633572

Now consider the following code, which reads in just the timestamp values, but ignores the first 8 digits in each timestamp.

packetLines <- readLines("packetSample.txt")
con <- textConnection(gsub("[0-9]{8}| [0-9]+$", "", packetLines))
precisePackets <- scan(con)
close(con)
head(precisePackets)

[1] 10.47817 10.47865 10.47878 10.47890 10.47892 10.47891

Why do we get a different answer when we calculate the average time gap between these values?

mean(diff(precisePackets))

[1] 0.00226244444444445

A:

Calculations on values that are close to the limits of numerical precision can become inaccurate because the results are not representable to very many significant figures.

The following code restores the default setting for the number of significant digits that R displays.

options(digits=7)

Manipulating the Network Packets Data Set with R

Paul Murrell

Introduction

Variables

Data format

Exercises