Table of Contents
The data are measurements made on each packet of information that passes through the University of Auckland Internet Gateway. These measurement include a timestamp that represents the time at which the packet reached the network location, and the size of the packet, as a number of bytes. The data set described herein is provided in a space-delimited ASCII text file format. This is a very small sample of just 46 packets.
The data set contains the following variables:
timestamp - The time at which the packet arrived at the Internet Gateway, as a number of seconds since January 1st 1970. |
size - The size of the packet, in bytes. |
The data set is provided as a space-delimited ASCII text file called packetSample.txt.
A larger file, containing packets from a single minute of network traffic is also available (7MB).
The file NetPacketsMeta.xml provides a StatDataML description of the data set.
The first task is to read the data set into R. The result should look like this:
V1 V2 1 1156748010 60 2 1156748010 1254 3 1156748010 1514 4 1156748010 1494 5 1156748010 114 6 1156748010 1514
A:
packets <- read.table("packetSample.txt") head(packets)
Take a look at the original text file. You will see that the timestamp values appear to be much more precise than the values that are shown above when we read the file into R. For example, the first timestamp value is 1156748010.47817.
The problem is not that R cannot represent these numbers precisely; the problem is that R does not display numbers to full accuracy by default.
Change the number of significant digits that R displays so that we can see these values with full precision. The result should look like this:
V1 V2 1 1156748010.47817 60 2 1156748010.47865 1254 3 1156748010.47878 1514 4 1156748010.47890 1494 5 1156748010.47892 114 6 1156748010.47891 1514
A:
options(digits=15) head(packets)
The timestamp values are both extremely large and extremely precise. In fact, they are approaching the limits of precision for storing numeric values in four bytes (which is what R uses for real values).
The following code calculates the average time gap between packets.
mean(diff(packets[, 1]))
[1] 0.00226244396633572
Now consider the following code, which reads in just the timestamp values, but ignores the first 8 digits in each timestamp.
packetLines <- readLines("packetSample.txt") con <- textConnection(gsub("[0-9]{8}| [0-9]+$", "", packetLines)) precisePackets <- scan(con) close(con) head(precisePackets)
[1] 10.47817 10.47865 10.47878 10.47890 10.47892 10.47891
Why do we get a different answer when we calculate the average time gap between these values?
mean(diff(precisePackets))
[1] 0.00226244444444445
A:
Calculations on values that are close to the limits of numerical precision can become inaccurate because the results are not representable to very many significant figures.
The following code restores the default setting for the number of significant digits that R displays.
options(digits=7)