Storage Format Options for the Indian Mothers Data Set

Introduction

The data originated from the Demographic and Health Survey program. An SPSS file containing data for 1000 mothers was provided by Deepankar Basu. The data set described herein is provided in a space-delimited ASCII text file format. This data set contains demographic information for 1000 Indian mothers, including age, years of education, social status, and whether the mother has employment outside the home. There is also information about the gender of any children that the mother has and a count of how many of the children are currently alive.

Variables

The data set contains the following variables:

age - The mother's age.

edu - The number of years of formal schooling that the mother has received.

alive - How many of the mother's children are still alive.

middle - Whether the mother is middle-class

poor - Whether the mother is poor

work - Whether the mother works for pay outside the home

cord1 to cord14 - The gender of the mother's first child, second child, etc up to fourteenth child.

Data format

The data set is provided as a space-delimited ASCII text file called india.txt.

The file indianMothersMeta.xml provides a StatDataML description of the data set.

Q: The original storage format

The data were first obtained in the form of an Stata binary .dta file.

The .dta format is described in detail online.

What are the advantages and disadvantages of having the data stored in this binary format?

A:

The primary disadvantage of a binary format is that it usually requires a specific piece of software in order to access the data in the file.

The severity of this problem is dependent on whether the format is proprietary and hidden away or open and publicly documented.

In this case, the .dta format is well documented in a public document so it is not necessary to have Stata to access the file. For example, the foreign package in R has a function read.table() that can read .dta files.

Nevertheless, a binary format can only be used if appropriate software is available, so it is less easy to share with others than, for example, a plain text format.

Q: The storage format provided

The data are provided in this exercise in a plain text, space-delimited format.

What are the advantages and disadvantages of having the data stored in this plain text format?

A:

First of all, a plain text format is easy to share because almost any software can read plain text files.

The main problem with this sort of plain text format is that the file itself does not contain information about its own structure. For example, the file itself does not say that it is space-delimited.

Furthermore, the file does not contain any information about the type of values in the data set. For example, although there are lots of numbers in the file, it is not clear what the values represent.

An extension of this problem is the fact that there is no metadata in the file. For example, gender values are stored as either 1 or 2, but the file itself does not contain information to say which value represents "male" and which represents "female". Also, the meaning of the "NA" values is not stated anywhere. Even several of the variable names, such as cord1, are quite unhelpful.

Storage Format Options for the Indian Mothers Data Set

Paul Murrell

Introduction

Variables

Data format

Exercises