9.1 Case study: The Population Clock

This is very easy to do by hand; we simply navigate a web browser to the population clock web page and type out, cut-and-paste, or even just write down the current population value.

Navigating to a web page and downloading the information is not actually very difficult. This is an example of interacting with the network component of the computing environment. Downloading a web page is something that we can expect any modern programming language to be able to do, given the appropriate URL (which is visible in the “navigation bar” of the web browser in Figure 9.2).

We will not focus on understanding all of the details of the examples of R code in this section--that is the purpose of the remainder of this chapter. The code is just provided here as concrete evidence that the task can be done and as a simple visual indication of the level of effort and complexity involved.

Conceptually, the above code says “read the HTML code from the network location given by the URL and store it in RAM under the name clockHTML.” The images below illustrate this idea, showing how the information that we input at the keyboard (the URL) leads to the location of a file containing HTML code on the network, which is read into RAM and given the name clockHTML. The image on the left shows the main hardware components involved in this process in general and the image on the right shows the actual data values and files involved in this particular example. We will use diagrams like this throughout the chapter to illustrate which hardware components we are dealing with when we perform different tasks.

It is important to realize that the result of the R code above is not a nice picture of the web page like we see in a browser. Instead, we have the raw HTML code that describes the web page (see Figure 9.3).

This is actually a good thing because it would be incredibly difficult for the computer to extract the population information from a picture.

**Figure 9.3:** Part of the HTML code for the World Population Clock web page (see Figure 9.2). The line numbers (in grey) are just for reference.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>World POPClock Projection</title> <link rel="stylesheet" href="popclockworld%20Files/style.css" type="text/css"> <meta name="author" content="Population Division"> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="keywords" content="world, population"> <meta name="description" content="current world population estimate"> <style type="text/css"> #worldnumber { text-align: center; font-weight: bold; font-size: 400%; color: #ff0000; } </style> </head> <body> <div id="cb_header"> <a href="http://www.census.gov/"> <img src="popclockworld%20Files/cb_head.gif" alt="U.S. Census Bureau" border="0" height="25" width="639"> </a> </div> <h1>World POPClock Projection</h1> <p></p> According to the <a href="http://www.census.gov/ipc/www/"> International Programs Center</a>, U.S. Census Bureau, the total population of the World, projected to 09/12/07 at 07:05 GMT (EST+5) is<br><br> <div id="worldnumber">6,617,746,521</div> <p></p> <hr> ...

Figure 9.3: Part of the HTML code for the World Population Clock web page (see Figure 9.2). The line numbers (in grey) are just for reference.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" 
      xml:lang="en" lang="en">
<head>
    <title>World POPClock Projection</title>
    <link rel="stylesheet" 
          href="popclockworld%20Files/style.css" 
          type="text/css">
    <meta name="author" content="Population Division">
    <meta http-equiv="Content-Type" 
          content="text/html; charset=iso-8859-1">
    <meta name="keywords" content="world, population">
    <meta name="description" 
          content="current world population estimate">
    <style type="text/css">
        #worldnumber {
	    text-align: center;
	    font-weight: bold;
	    font-size: 400%;
	    color: #ff0000;
        }
    </style>
</head>
<body>
    <div id="cb_header">
    <a href="http://www.census.gov/">
    <img src="popclockworld%20Files/cb_head.gif" 
         alt="U.S. Census Bureau" 
         border="0" height="25" width="639">
    </a>
    </div>

    <h1>World POPClock Projection</h1>

    <p></p>
    According to the <a href="http://www.census.gov/ipc/www/">
    International Programs Center</a>, U.S. Census Bureau, 
    the total population of the World, projected to 09/12/07
    at 07:05 GMT (EST+5) is<br><br>
    <div id="worldnumber">6,617,746,521</div>
    <p></p>
    <hr>

...

The HTML code is better than a picture because the HTML code has a clear structure. If information has a pattern or structure, it is much easier to write computer code to navigate within the information. We will exploit the structure in the HTML code to get the computer to extract the relevant population value for us.

However, before we do anything with this HTML code, it is worth taking note of what sort of information we have. From Chapter 2, we know that HTML code is just plain text, so what we have downloaded is a plain text file. This means that, in order to extract the world population value from the HTML code, we will need to know something about how to perform text processing. We are going to need to search within the text to find the piece we want, and we are going to need to extract just that piece from the larger body of text.

The current population value on the web page is contained within the HTML code in a div tag that has an id attribute, with the unique value "worldnumber" (line 41 in Figure 9.3). This makes it very easy to find the line that contains the population estimate because we just need to search for the pattern id="worldnumber". This text search task can be performed using the following code:

This code says “find the line of HTML code that contains the text id="worldnumber" and store the answer in RAM under the name popLineNum.” The HTML code is fetched from RAM, we supply the pattern to search for by typing it at the keyboard, the computer searches the HTML code for our pattern and finds the matching line, and the result of our search is stored back in RAM.

We can see the value that has been stored in RAM by typing the appropriate name.

Notice that the result this time is not text; it is a number representing the appropriate line within the HTML code.

Also notice that each time we store a value in RAM, we provide a label for the value so that we can access the value again later. We stored the complete set of HTML code with the label clockHTML, and we have now also stored the result of our search with the label popLineNum.

What we want is the actual line of HTML code rather than just the number telling us which line, so we need to use popLineNum to extract a subset of the text in clockHTML. This action is performed by the following code.

Again, this task involves using information that we already have in RAM to calculate a new data value, and we store the new value back in RAM with the label popLine.

As before, we can just type the name to see the value that has been stored in RAM. The new value in this case is a line of text.

In many of the code examples throughout this chapter, we will follow this pattern: in one step, calculate a value and store it in RAM, with a label; then, in a second step, type the name of the label to display the value that has been stored.

Now that we have the important line of HTML code, we want to extract just the number, 6,617,746,521, from that line. This task consists of getting rid of the HTML tags. This is a text search-and-replace task and can be performed using the following code:

> popText <- gsub('^.*<div id="worldnumber">|</div>.*$', 
                   "", popLine)
> popText

[1] "6,617,746,521"

This code says “delete the start and end div tags (and any spaces in front of the start tag)”. We have used a regular expression, , to specify the part of the text that we want to get rid of, and we have specified "", which means an empty piece of text, as the text to replace it with.

Section 9.9 describes text processing tasks and regular expressions in more detail.

At this point, we are close to having what we want, but we are not quite there yet because the value that we have for the world's population is still a piece of text, not a number. This is a very important point. We always need to be aware of exactly what sort of information we are dealing with. As described in Chapter 5, computers represent different sorts of values in different ways, and certain operations are only possible with certain types of data. For example, we ultimately want to be able to perform arithmetic on the population value that we are getting from this web site. That means that we must have a number; it does not make sense to perform arithmetic with text values.

Thus, the final thing we need to do is turn the text of the population estimate into a number so that we can later carry out mathematical operations. This process is called type coercion and appropriate code is shown below.

Notice that we have to process the text still further to remove the commas that are so useful for human viewers but a complete distraction for computers.

And now we have what we were after: the current U.S. Census Bureau estimate of the world's population from the World Population Clock web site.

This first step provides a classic demonstration of the difference between performing a task by hand and writing code to get a computer to do the work. The manual method is simple, requires no new skills, and takes very little time. On the other hand, the computer code approach requires learning new information (it will take substantial chunks of this chapter to explain the code we have used so far), so it is more difficult and takes longer (the first time). However, the computer code approach will pay off in the long run, as we are about to see.