Manipulating the Indian Mothers Data Set with R

Paul Murrell


Table of Contents

Introduction
Variables
Data format
Exercises

Introduction

The data originated from the Demographic and Health Survey program. An SPSS file containing data for 1000 mothers was provided by Deepankar Basu. The data set described herein is provided in a space-delimited ASCII text file format. This data set contains demographic information for 1000 Indian mothers, including age, years of education, social status, and whether the mother has employment outside the home. There is also information about the gender of any children that the mother has and a count of how many of the children are currently alive.

Variables

The data set contains the following variables:

age - The mother's age.
edu - The number of years of formal schooling that the mother has received.
alive - How many of the mother's children are still alive.
middle - Whether the mother is middle-class
poor - Whether the mother is poor
work - Whether the mother works for pay outside the home
cord1 to cord14 - The gender of the mother's first child, second child, etc up to fourteenth child.

Data format

The data set is provided as a space-delimited ASCII text file called india.txt.

The file indianMothersMeta.xml provides a StatDataML description of the data set.

Exercises

Reading the data into R
Reshaping the data
Recoding the gender variable
Tables of counts
Q: Reading the data into R

The first task is to read the data set into R. The result should look like this:

  cord1 cord2 cord3 cord4 cord5 cord6 cord7 cord8 cord9 cord10 cord11 cord12 cord13 cord14 age edu alive middle poor work
1     2     1     1    NA    NA    NA    NA    NA    NA     NA     NA     NA     NA     NA  30   0     3      0    1    1
2     2    NA    NA    NA    NA    NA    NA    NA    NA     NA     NA     NA     NA     NA  32   0     1      0    1    0
3     2     1     1     1     2    NA    NA    NA    NA     NA     NA     NA     NA     NA  28   0     5      0    1    1
4     2     1     1     2    NA    NA    NA    NA    NA     NA     NA     NA     NA     NA  39   0     4      1    0    0
5     1     1    NA    NA    NA    NA    NA    NA    NA     NA     NA     NA     NA     NA  20   0     2      1    0    0
6     2     1     1    NA    NA    NA    NA    NA    NA     NA     NA     NA     NA     NA  25   0     3      1    0    1
Q: Reshaping the data

The original format, with one column or variable for each child, is inefficient because there are large swathes of missing values (because most mothers have nowhere near 14 children). The table below shows the number of mothers with 1, 2, 3, etc children:

  1   2   3   4   5   6 
 61 447 296 125  52  19 

So, in fact, there is no need (in this particular set of 1000 Indian mothers) for any column beyond cord6.

A more efficient approach would be just to have one row for each child. We want to produce output like that below:

  age edu alive middle poor work id  born value
1  30   0     3      0    1    1  1 cord1     2
2  32   0     1      0    1    0  2 cord1     2
3  28   0     5      0    1    1  3 cord1     2
4  39   0     4      1    0    0  4 cord1     2
5  20   0     2      1    0    0  5 cord1     1
6  25   0     3      1    0    1  6 cord1     2
Q: Recoding the gender variable

The gender variable in the data set (called "value" in the reshaped format) is just a numeric value: 1 or 2. This would be much better as a factor with meaningful level labels.

Create a new variable called "gender" for the indiaLong data frame, which is a factor with levels "boy" (value equals 1) and "girl" (value equals 2).

The result should look like this:

  age edu alive middle poor work id  born value gender
1  30   0     3      0    1    1  1 cord1     2   girl
2  32   0     1      0    1    0  2 cord1     2   girl
3  28   0     5      0    1    1  3 cord1     2   girl
4  39   0     4      1    0    0  4 cord1     2   girl
5  20   0     2      1    0    0  5 cord1     1    boy
6  25   0     3      1    0    1  6 cord1     2   girl
Q: Tables of counts

This exercise generates some simple counts to look at the number of girls versus the number of boys, under various conditions.

How many boys/girls overall?

 boy girl 
1421 1296 

How many boys/girls cross-classified by birth order?

        gender
born     boy girl
  cord1  527  473
  cord2  479  460
  cord3  261  231
  cord4  111   85
  cord5   33   38
  cord6   10    9
  cord7    0    0
  cord8    0    0
  cord9    0    0
  cord10   0    0
  cord11   0    0
  cord12   0    0
  cord13   0    0
  cord14   0    0

How many boys/girls cross-classified by social status?

            gender  boy girl
middle poor                 
0      0              3    1
       1            410  370
1      0           1008  925
       1              0    0