Initial steps in creating a database for the TAAS records

1) Original data

  We received data on the mathematics test results from the Texas
  Assessment of Academic Skills (TAAS) in grades 3 to 8 during the
  years 1994 to 2000 in the form of files of newline-delimited records
  with comma-separated fields. There were two different types of
  files: the data from 1994 to 1999 were stored as one file per grade
  level but the data from year 2000 were in a single file that has an
  additional grade field.  We used two Python scripts to concatenate
  selected fields from these files files to a single data file with
  tab-separated fields matching the SQL definition

CREATE TABLE rawdata (
   year      INT2,
   grade     INT2,
   district  INTEGER,
   campus    INTEGER,
   dname     VARCHAR(15),
   cname     VARCHAR(15),
   sex       CHAR(1),
   disadvg   CHAR(3),
   ethnicity VARCHAR(8),
   stuid     VARCHAR(9),
   tlimth    INT2  -- mathematics score (Texas Learning Index - Math)
);

  While concatenating the records we converted from two-digit years to
  four-digit years.  [Scripts are raw_data.py and raw_data_2000.py.]

  The resulting file is very large, nearly a gigabyte.  We imported
  these records into the database using PostgreSQL's COPY command.
  Although the rows could be inserted one at a time, the COPY command
  for the whole table is much more efficient.

2) Eliminating redundancy

  There is considerable redundancy in the raw data table.  In each
  record the district number is stored as an integer and the district
  name is stored as a character string.  Also, the campus number and
  the campus name are both stored in each record.  We can eliminate
  these redundancies by creating a table of campus numbers and names
  and a table of district numbers and names then removing the campus
  names and district names from the raw data.  

  We must, however, be careful when creating the campus and district
  tables.  We don't know that the campus names and the district names
  are recorded consistently.  In fact they aren't.  Hence 
  we create the campus and district tables in two stages.  In the
  first stage we create a table of counts of each distinct
  combination of district and dname.  In the second stage we select
  the dname that occurs most frequently with the district number.

CREATE TEMPORARY TABLE district_count AS
   SELECT district, dname, count(*) FROM rawdata GROUP BY district, dname;
CREATE TABLE district (
   district INTEGER PRIMARY KEY,
   dname    VARCHAR(15),
   nobs     INTEGER);
INSERT INTO district
   SELECT d.district, d.dname, r.s FROM district_count d,
      (SELECT district, max(count) AS m, sum(count) as s FROM district_count
       GROUP BY district) AS r
      WHERE d.district = r.district AND d.count = r.m;

  The last select statement uses a subselect to determine which row
  contains the maximum count.

[There are only 3 district with different dnames.  One looks like an
inconsistency but the other two are probably coding errors.
taas=# SELECT * FROM district_count WHERE district IN
taas-# (SELECT district FROM district_count GROUP BY district HAVING count(*) > 1);
 district |      dname      | count 
----------+-----------------+-------
     8903 | BRAZOS ISD      |   672
     8903 | WALLIS-ORCHARD  |  1846
    57803 | NORTH HILLS SCH |   614
    57803 | THE NORTH HILLS |   198
    57809 | NOVA            |    39
    57809 | WEST OAK CLIFF  |    16
(6 rows)
]

  The campus table is created similarly.  Each campus is associated
  with a district.  After checking that the association is unique, we
  add a district column to the campus table with the requirement that
  any value in this column must correspond to a value of the primary
  key in the district table.  Such a requirement is called a
  referential integrity constraint.

CREATE TABLE campus (
   campus   INTEGER PRIMARY KEY,
   cname    VARCHAR(15),
   district INTEGER REFERENCES district ON UPDATE CASCADE,
   nobs     INTEGER);

  Having created these two tables we can remove the district, dname,
  and cname columns from the data and store only the campus for each
  observation.  This reduces the size of the table by more than 300
  MB.  If it is important to have records that contain the omitted
  information we can use a view, a type of virtual table, to make it
  appear as if all the fields are present in each record.

3) Handling missing data

  Often the demographic information, the sex and ethnicity of the
  student or whether the student is economically disadvantaged, has
  not been recorded.

taas=# select sex, count(*) from rawdata group by sex;
 sex |  count  
-----+---------
     | 1019025
 *   |      86
 F   | 4716084
 M   | 4724846
(4 rows)

  We replace all observations except 'F' and 'M' with NULL, the
  missing value code in SQL.

UPDATE rawdata SET sex = NULL WHERE sex NOT IN ('F', 'M');

  Similar replacements are made in the ethnicity and disadvg columns.
  
4) Extracting information on each student.

  The original records use a 9-character string, stuid, to identify
  the student.  Because we will want to select records by student and
  to sort by student, we generate and store integer identifiers which
  are easier to manipulate.  The number of observations per student
  can be determined while doing this.

CREATE TABLE student (
   ids        SERIAL     PRIMARY KEY,
   stuid      VARCHAR(9) NOT NULL UNIQUE,
   sex        CHAR(1),
   ethnicity  CHAR(1),
   nobs       INT2       NOT NULL);

INSERT INTO student(stuid, nobs)
   SELECT stuid, count(*) AS nobs FROM rawdata GROUP BY stuid ORDER BY stuid;

  The SERIAL data type in PostgreSQL generates sequential integers if
  the ids field is not specified in an insertion.

  Because the data are from years 1994 to 2000 the number of
  observations per student should be at most 7.  However several
  stuid's have many more than 7 observations.

taas=# select * from student order by nobs desc limit 20;
   ids   |   stuid   | sex | ethnicity | nobs  
---------+-----------+-----+-----------+-------
       1 |           |     |           | 27663
      15 | 000000000 |     |           |    58
 3584290 | 999999999 |     |           |    32
 3584292 | S         |     |           |    21
  601281 | 111111111 |     |           |    18
 1510555 | 4         |     |           |    16
 2467198 | 759658954 |     |           |    13
 2510187 | 773656634 |     |           |    13
 2506021 | 773534794 |     |           |    13
 2622198 | 795527644 |     |           |    12
 2544888 | 778542174 |     |           |    12
 2727049 | 815533174 |     |           |    12
 2463580 | 759572444 |     |           |    12
  748298 | 144273596 |     |           |    12
 2279793 | 718678374 |     |           |    12
 1075648 | 207544386 |     |           |    12
  349888 | 077292096 |     |           |    12
  872418 | 167410036 |     |           |    12
 2977249 | 871570524 |     |           |    11
 1068136 | 207291976 |     |           |    11
(20 rows)

  The first five rows probably represent cases where the student id
  was unknown and the person recording the data invented a missing
  value code.  The others may be cases where two or more students were
  accidently assigned the same stuid code.  

  We must take account of such cases when we generate the sex and
  ethnicity codes for each student.  We only assign the sex of a
  student if it has been recorded consistently.  A student's ethnicity
  is self-reported and may change.  The most common case is a student
  who initially identifies herself/himself as 'WHITE' and later as
  'HISPANIC'.   We create a separate category 'MIXED' for these cases.

  Detecting a unique value of sex is done by first selecting all
  combinations of stuid and sex then selecting the stuid's for which
  there is only one row.

CREATE TEMPORARY TABLE unique_sex AS
   SELECT stuid, min(sex) AS sex
   FROM (SELECT stuid, sex FROM rawdata
         WHERE sex IS NOT NULL
         GROUP BY stuid, sex) AS m
   GROUP BY stuid HAVING count(*) = 1;
UPDATE student SET sex = u.sex FROM unique_sex u
   WHERE student.stuid = u.stuid;
DROP TABLE unique_sex;

  The ethnicity is generated similarly with adjustments to record the
  MIXED ethnicity.  We also create indices on these columns to allow
  for faster selection of students by sex or by ethnicity or by number
  of observations.

CREATE INDEX student_sex_ind ON student(sex);
CREATE INDEX student_eth_ind ON student(ethnicity);
CREATE INDEX student_nobs_ind ON student(nobs);

  It is a good idea to use a PostgreSQL-specific operation called
  VACUUM ANALYZE that at this point to compact the storage used by the
  table and to collect summary statistics on the table.  (In
  PostgreSQL 7.2 the ANALYZE operation has been extended to store more
  information on the table and to complete faster.)

5) Creation of the results table and the all_data view

  Having abstracted the information at the student level, the school
  (or campus) level, and the district level that had been redundantly
  stored in each record, we can create a results table that contains
  only the information relevant to each test score.

CREATE TABLE results AS
   SELECT s.ids, r.year, r.grade, r.campus,
      substr(r.disadvg, 1, 1) AS disadvg, r.tlimth
   FROM rawdata r, student s
   WHERE r.stuid = s.stuid;
DROP TABLE rawdata;
CREATE INDEX results_year_ind ON results(year);
CREATE INDEX results_grade_ind ON results(grade);
CREATE INDEX results_campus_ind ON results(campus);

  Finally we create a view that describes how to associate information
  in the results table, the student table, the campus table and the
  district table.

CREATE VIEW all_data AS
   SELECT r.year, r.grade, r.campus, c.cname, c.district, d.dname,
      r.ids, s.stuid, s.sex, s.ethnicity, r.tlimth, s.nobs AS sobs,
      c.nobs AS cobs, d.nobs AS dobs
   FROM results r, student s, campus c, district d
   WHERE r.ids = s.ids AND r.campus = c.campus AND c.district = d.district;

  By selecting from the view we can easily generate records that
  have all the information in the original records, plus the
  information on the number of observations on the student, the
  campus, and the district, but are stored much more compactly.  The
  rawdata table had 10.4 million rows of about 65 bytes each for a
  total size of 700 MB.  Every time we access information in that
  table we need to move that 700 MB of information from a disk to
  memory to the processor, back to memory, back to the disk, and so
  on. 

  The results table also has 10.4 million rows but each row is only 15
  bytes.  This table is less than 1/4 the size of the raw data table.
  The student table has 3.6 million rows of 17 bytes each for an
  additional 60 MB of data.  The campus and district tables are of
  negligible total size because they only have 6000 and 1200 rows
  each.

  The database tables we have created store the information from the
  raw data but in a much more compact form.  Furthermore we have
  consistency in the information in these tables.  For example, each
  campus is named consistently.

6) Flagging questionable data

  Some of the data values are questionable.  We have seen that the
  stuid value '' occurs over 27000 times and that 000000000,
  111111111, and 999999999 each occur many times.  We want to flag
  these stuids as questionable data.  We could create another column
  in the student table to indicate if that stuid is considered
  questionable.  An alternative which saves some storage in the
  student and in the results table is to negate the ids value wherever
  we think the data are questionable.  If we add a referential
  integrity constraint on the results table with the "ON UPDATE
  CASCADE" clause, the changes in ids in the student table will be
  propagated to the results table ids values.  This is a
  time-consuming operation on tables of this size but it only needs to
  be done once.  After that, each row in the results table will have
  an indicator of legitimate data.

ALTER TABLE results ADD CONSTRAINT idsfk FOREIGN KEY (ids) REFERENCES
  student(ids) ON UPDATE CASCADE;
UPDATE STUDENT SET ids = -ids WHERE sex IS NULL AND nobs > 8;

7) Using subselects

  In our initial exploration of these data we concentrated on those
  students who had attended school in Dallas Independent School
  District.  We want all the observations on each of these students,
  even if those observations did not occur in Dallas.  We could create
  a table of ids values for students with at least one observation in
  Dallas then match ids to that table in a second select statement but
  PostgreSQL allows us to combine those operations.

SELECT year, grade, campus, ids, sex, ethnicity, disadvg, tlimth
  FROM all_data
  WHERE ids IN 
    (SELECT distinct ids FROM all_data WHERE dname LIKE 'DALLAS ISD%')
  ORDER BY ids;