Table of Contents
The data originated from the Demographic and Health Survey program. An SPSS file containing data for 1000 mothers was provided by Deepankar Basu. The data set described herein is provided in a space-delimited ASCII text file format. This data set contains demographic information for 1000 Indian mothers, including age, years of education, social status, and whether the mother has employment outside the home. There is also information about the gender of any children that the mother has and a count of how many of the children are currently alive.
The data set contains the following variables:
age - The mother's age. |
edu - The number of years of formal schooling that the mother has received. |
alive - How many of the mother's children are still alive. |
middle - Whether the mother is middle-class |
poor - Whether the mother is poor |
work - Whether the mother works for pay outside the home |
cord1 to cord14 - The gender of the mother's first child, second child, etc up to fourteenth child. |
The data set is provided as a space-delimited ASCII text file called india.txt.
The file indianMothersMeta.xml provides a StatDataML description of the data set.
The data were first obtained in the form of an Stata binary .dta file.
The .dta format is described in detail online.
What are the advantages and disadvantages of having the data stored in this binary format?
A:
The primary disadvantage of a binary format is that it usually requires a specific piece of software in order to access the data in the file.
The severity of this problem is dependent on whether the format is proprietary and hidden away or open and publicly documented.
In this case, the .dta format is well documented in a public document so it is not necessary to have Stata to access the file. For example, the foreign package in R has a function read.table() that can read .dta files.
Nevertheless, a binary format can only be used if appropriate software is available, so it is less easy to share with others than, for example, a plain text format.
The data are provided in this exercise in a plain text, space-delimited format.
What are the advantages and disadvantages of having the data stored in this plain text format?
A:
First of all, a plain text format is easy to share because almost any software can read plain text files.
The main problem with this sort of plain text format is that the file itself does not contain information about its own structure. For example, the file itself does not say that it is space-delimited.
Furthermore, the file does not contain any information about the type of values in the data set. For example, although there are lots of numbers in the file, it is not clear what the values represent.
An extension of this problem is the fact that there is no metadata in the file. For example, gender values are stored as either 1 or 2, but the file itself does not contain information to say which value represents "male" and which represents "female". Also, the meaning of the "NA" values is not stated anywhere. Even several of the variable names, such as cord1, are quite unhelpful.