Manipulating the Crohn's Disease Data Set with R

Paul Murrell


Table of Contents

Introduction
Variables
Data format
Overall Goal
Exercises

Introduction

These data are genetic sequences for 387 individuals from a study of Crohn's disease, which is an inflammatory bowel disease (Daly et al., 2001a, Nature Genetics, 29, 223-228). The data are in a text-based LINKAGE format.

Variables

The data set contains the following variables:

pedigree - The pedigree of the individual (which genetic family tree the individual belongs to).
ID - A unique identifier for the individual.
parentA - The identifier for one parent (zero means parent not in data set).
parentB - The identifier for the other parent (zero means parent not in data set).
gender - The individual's gender.
status - Whether the individual has Crohn's disease.
marker1A to marker103B - The genotype for the individual at 103 different locations. marker1A gives the individual's first allele at locus 1 and marker1B gives the individual's second allele at locus 1 and so on.

Data format

The data set is provided as a tab-delimited ASCII text file called Dalydata.txt. The format is designed for use with the LINKAGE software.

The file CrohnsMeta.xml provides a StatDataML description of the data set.

Overall Goal

The overall goal of this set of exercises is to convert the data set from the original LINKAGE format to a format that is appropriate for use with the PHASE software package. In the PHASE format, the information for each individual is stored on three lines. The first line gives the individual's unique identifier, the second line gives the first allele at each locus, and the third line gives the second allele at each locus. Instead of alleles being in pairs of columns, they are in pairs of rows. Furthermore, any zeroes in the original allele information, which indicate missing values, must be encoded as question marks (e.g., individual 412 has missing values at the 18th locus). The desired end result is a new text file with the original data in the new format. We will tackle this overall task as a series of smaller exercises. This is, of course, just one way that this task could be performed using R.

Exercises

Reading the data into R
Extracting individual IDs
Extracting individual alleles
Replacing zeroes with question marks
Pasting rows together
Combining lines of output
Writing the new file
Q: Reading the data into R

The first steps is to read the original data set into R. The result should look like this (except with lots more columns):

head(crohns[, 1:20])
        
      V1  V2  V3  V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1 PED054 430   0   0  1  0  1  3  3   1   4   1   4   2   2   1   3   1   2   4
2 PED054 412 430 431  2  2  1  3  1   3   4   1   4   2   2   1   3   1   4   2
3 PED054 431   0   0  2  0  3  3  3   3   1   1   2   2   1   1   1   1   2   2
4 PED058 438   0   0  1  0  3  3  3   3   1   1   2   2   1   1   1   1   2   2
5 PED058 470 438 444  2  2  3  3  3   3   1   1   2   2   1   1   1   1   2   2
6 PED058 444   0   0  2  0  3  3  3   3   1   1   2   2   1   1   1   1   2   2
Q: Extracting individual IDs

The next step is to extract the unique identifier for each individual from the complete data set. This is just a matter of extracting the second column, a numeric vector, from the overall data frame. The result should look like this:

head(ids)
        
[1] "430" "412" "431" "438" "470" "444"
Q: Extracting individual alleles

In this step, we will extract the first allele at each locus for each individual. Again, this is just a subsetting task, though it is a little more complex than the previous one. The result this time is a data frame (just columns 7, 9, 11, ... from the original data frame). The result should look like this:

head(allele1[, 1:10])
        
  V7 V9 V11 V13 V15 V17 V19 V21 V23 V25
1  1  3   4   4   2   3   2   3   3   4
2  1  1   4   4   2   3   4   3   3   2
3  3  3   1   2   1   1   2   2   3   4
4  3  3   1   2   1   1   2   2   3   4
5  3  3   1   2   1   1   2   2   3   2
6  3  3   1   2   1   1   2   2   3   4
Q: Replacing zeroes with question marks

This step replaces any zeroes in the allelse with a question mark. The result should look like this:

head(allele1text[, 1:10])
        
     V7  V9  V11 V13 V15 V17 V19 V21 V23 V25
[1,] "1" "3" "4" "4" "2" "3" "2" "3" "3" "4"
[2,] "1" "1" "4" "4" "2" "3" "4" "3" "3" "2"
[3,] "3" "3" "1" "2" "1" "1" "2" "2" "3" "4"
[4,] "3" "3" "1" "2" "1" "1" "2" "2" "3" "4"
[5,] "3" "3" "1" "2" "1" "1" "2" "2" "3" "2"
[6,] "3" "3" "1" "2" "1" "1" "2" "2" "3" "4"
Q: Pasting rows together

Now that we have the data values that we require for the first allele for each individual, we need to turn each row of values into a line of text. This requires pasting the values on each row together. The paste() function can do the pasting and the apply() function can be used to do the work on each row. The result should look like this:

head(strtrim(allele1lines, 60))
        
[1] "1 3 4 4 2 3 2 3 3 4 4 2 2 3 2 2 3 3 1 3 1 2 3 1 2 4 3 3 4 2 "
[2] "1 1 4 4 2 3 4 3 3 2 2 2 1 1 2 2 3 ? 1 3 1 2 3 1 2 4 3 2 4 1 "
[3] "3 3 1 2 1 1 2 2 3 4 4 ? 2 3 2 2 3 3 1 3 1 2 3 1 2 3 2 3 2 2 "
[4] "3 3 1 2 1 1 2 2 3 4 4 2 2 3 2 2 3 3 1 3 1 2 3 1 2 4 3 3 4 2 "
[5] "3 3 1 2 1 1 2 2 3 2 2 2 1 1 2 2 3 2 1 3 1 2 3 1 2 3 2 2 4 1 "
[6] "3 3 1 2 1 1 2 2 3 4 4 ? 2 3 4 ? 3 3 1 3 1 4 3 3 2 4 3 3 2 2 "
Q: Combining lines of output

We now have all of the information that we need for the PHASE format, but we still need to combine the lines together in the right order. One way to do this is to create a large empty character vector of the appropriate size and then assign the appropriate information to the appropriate elements of the character vector. The result should look like this:

head(strtrim(crohnsLines, 60))
        
[1] "430"                                                         
[2] "1 3 4 4 2 3 2 3 3 4 4 2 2 3 2 2 3 3 1 3 1 2 3 1 2 4 3 3 4 2 "
[3] "3 1 1 2 1 1 4 2 3 2 2 1 1 1 2 2 3 2 1 3 1 2 3 1 2 3 2 2 2 1 "
[4] "412"                                                         
[5] "1 1 4 4 2 3 4 3 3 2 2 2 1 1 2 2 3 ? 1 3 1 2 3 1 2 4 3 2 4 1 "
[6] "3 3 1 2 1 1 2 2 3 4 4 1 2 3 2 2 3 ? 1 3 1 2 3 1 2 3 2 3 2 2 "
Q: Writing the new file

The final step is to write the lines that we have generated to a new file, using the writeLines() function.