Initial steps in creating a database for the TAAS records 1) Original data We received data on the mathematics test results from the Texas Assessment of Academic Skills (TAAS) in grades 3 to 8 during the years 1994 to 2000 in the form of files of newline-delimited records with comma-separated fields. There were two different types of files: the data from 1994 to 1999 were stored as one file per grade level but the data from year 2000 were in a single file that has an additional grade field. We used two Python scripts to concatenate selected fields from these files files to a single data file with tab-separated fields matching the SQL definition CREATE TABLE rawdata ( year INT2, grade INT2, district INTEGER, campus INTEGER, dname VARCHAR(15), cname VARCHAR(15), sex CHAR(1), disadvg CHAR(3), ethnicity VARCHAR(8), stuid VARCHAR(9), tlimth INT2 -- mathematics score (Texas Learning Index - Math) ); While concatenating the records we converted from two-digit years to four-digit years. [Scripts are raw_data.py and raw_data_2000.py.] The resulting file is very large, nearly a gigabyte. We imported these records into the database using PostgreSQL's COPY command. Although the rows could be inserted one at a time, the COPY command for the whole table is much more efficient. 2) Eliminating redundancy There is considerable redundancy in the raw data table. In each record the district number is stored as an integer and the district name is stored as a character string. Also, the campus number and the campus name are both stored in each record. We can eliminate these redundancies by creating a table of campus numbers and names and a table of district numbers and names then removing the campus names and district names from the raw data. We must, however, be careful when creating the campus and district tables. We don't know that the campus names and the district names are recorded consistently. In fact they aren't. Hence we create the campus and district tables in two stages. In the first stage we create a table of counts of each distinct combination of district and dname. In the second stage we select the dname that occurs most frequently with the district number. CREATE TEMPORARY TABLE district_count AS SELECT district, dname, count(*) FROM rawdata GROUP BY district, dname; CREATE TABLE district ( district INTEGER PRIMARY KEY, dname VARCHAR(15), nobs INTEGER); INSERT INTO district SELECT d.district, d.dname, r.s FROM district_count d, (SELECT district, max(count) AS m, sum(count) as s FROM district_count GROUP BY district) AS r WHERE d.district = r.district AND d.count = r.m; The last select statement uses a subselect to determine which row contains the maximum count. [There are only 3 district with different dnames. One looks like an inconsistency but the other two are probably coding errors. taas=# SELECT * FROM district_count WHERE district IN taas-# (SELECT district FROM district_count GROUP BY district HAVING count(*) > 1); district | dname | count ----------+-----------------+------- 8903 | BRAZOS ISD | 672 8903 | WALLIS-ORCHARD | 1846 57803 | NORTH HILLS SCH | 614 57803 | THE NORTH HILLS | 198 57809 | NOVA | 39 57809 | WEST OAK CLIFF | 16 (6 rows) ] The campus table is created similarly. Each campus is associated with a district. After checking that the association is unique, we add a district column to the campus table with the requirement that any value in this column must correspond to a value of the primary key in the district table. Such a requirement is called a referential integrity constraint. CREATE TABLE campus ( campus INTEGER PRIMARY KEY, cname VARCHAR(15), district INTEGER REFERENCES district ON UPDATE CASCADE, nobs INTEGER); Having created these two tables we can remove the district, dname, and cname columns from the data and store only the campus for each observation. This reduces the size of the table by more than 300 MB. If it is important to have records that contain the omitted information we can use a view, a type of virtual table, to make it appear as if all the fields are present in each record. 3) Handling missing data Often the demographic information, the sex and ethnicity of the student or whether the student is economically disadvantaged, has not been recorded. taas=# select sex, count(*) from rawdata group by sex; sex | count -----+--------- | 1019025 * | 86 F | 4716084 M | 4724846 (4 rows) We replace all observations except 'F' and 'M' with NULL, the missing value code in SQL. UPDATE rawdata SET sex = NULL WHERE sex NOT IN ('F', 'M'); Similar replacements are made in the ethnicity and disadvg columns. 4) Extracting information on each student. The original records use a 9-character string, stuid, to identify the student. Because we will want to select records by student and to sort by student, we generate and store integer identifiers which are easier to manipulate. The number of observations per student can be determined while doing this. CREATE TABLE student ( ids SERIAL PRIMARY KEY, stuid VARCHAR(9) NOT NULL UNIQUE, sex CHAR(1), ethnicity CHAR(1), nobs INT2 NOT NULL); INSERT INTO student(stuid, nobs) SELECT stuid, count(*) AS nobs FROM rawdata GROUP BY stuid ORDER BY stuid; The SERIAL data type in PostgreSQL generates sequential integers if the ids field is not specified in an insertion. Because the data are from years 1994 to 2000 the number of observations per student should be at most 7. However several stuid's have many more than 7 observations. taas=# select * from student order by nobs desc limit 20; ids | stuid | sex | ethnicity | nobs ---------+-----------+-----+-----------+------- 1 | | | | 27663 15 | 000000000 | | | 58 3584290 | 999999999 | | | 32 3584292 | S | | | 21 601281 | 111111111 | | | 18 1510555 | 4 | | | 16 2467198 | 759658954 | | | 13 2510187 | 773656634 | | | 13 2506021 | 773534794 | | | 13 2622198 | 795527644 | | | 12 2544888 | 778542174 | | | 12 2727049 | 815533174 | | | 12 2463580 | 759572444 | | | 12 748298 | 144273596 | | | 12 2279793 | 718678374 | | | 12 1075648 | 207544386 | | | 12 349888 | 077292096 | | | 12 872418 | 167410036 | | | 12 2977249 | 871570524 | | | 11 1068136 | 207291976 | | | 11 (20 rows) The first five rows probably represent cases where the student id was unknown and the person recording the data invented a missing value code. The others may be cases where two or more students were accidently assigned the same stuid code. We must take account of such cases when we generate the sex and ethnicity codes for each student. We only assign the sex of a student if it has been recorded consistently. A student's ethnicity is self-reported and may change. The most common case is a student who initially identifies herself/himself as 'WHITE' and later as 'HISPANIC'. We create a separate category 'MIXED' for these cases. Detecting a unique value of sex is done by first selecting all combinations of stuid and sex then selecting the stuid's for which there is only one row. CREATE TEMPORARY TABLE unique_sex AS SELECT stuid, min(sex) AS sex FROM (SELECT stuid, sex FROM rawdata WHERE sex IS NOT NULL GROUP BY stuid, sex) AS m GROUP BY stuid HAVING count(*) = 1; UPDATE student SET sex = u.sex FROM unique_sex u WHERE student.stuid = u.stuid; DROP TABLE unique_sex; The ethnicity is generated similarly with adjustments to record the MIXED ethnicity. We also create indices on these columns to allow for faster selection of students by sex or by ethnicity or by number of observations. CREATE INDEX student_sex_ind ON student(sex); CREATE INDEX student_eth_ind ON student(ethnicity); CREATE INDEX student_nobs_ind ON student(nobs); It is a good idea to use a PostgreSQL-specific operation called VACUUM ANALYZE that at this point to compact the storage used by the table and to collect summary statistics on the table. (In PostgreSQL 7.2 the ANALYZE operation has been extended to store more information on the table and to complete faster.) 5) Creation of the results table and the all_data view Having abstracted the information at the student level, the school (or campus) level, and the district level that had been redundantly stored in each record, we can create a results table that contains only the information relevant to each test score. CREATE TABLE results AS SELECT s.ids, r.year, r.grade, r.campus, substr(r.disadvg, 1, 1) AS disadvg, r.tlimth FROM rawdata r, student s WHERE r.stuid = s.stuid; DROP TABLE rawdata; CREATE INDEX results_year_ind ON results(year); CREATE INDEX results_grade_ind ON results(grade); CREATE INDEX results_campus_ind ON results(campus); Finally we create a view that describes how to associate information in the results table, the student table, the campus table and the district table. CREATE VIEW all_data AS SELECT r.year, r.grade, r.campus, c.cname, c.district, d.dname, r.ids, s.stuid, s.sex, s.ethnicity, r.tlimth, s.nobs AS sobs, c.nobs AS cobs, d.nobs AS dobs FROM results r, student s, campus c, district d WHERE r.ids = s.ids AND r.campus = c.campus AND c.district = d.district; By selecting from the view we can easily generate records that have all the information in the original records, plus the information on the number of observations on the student, the campus, and the district, but are stored much more compactly. The rawdata table had 10.4 million rows of about 65 bytes each for a total size of 700 MB. Every time we access information in that table we need to move that 700 MB of information from a disk to memory to the processor, back to memory, back to the disk, and so on. The results table also has 10.4 million rows but each row is only 15 bytes. This table is less than 1/4 the size of the raw data table. The student table has 3.6 million rows of 17 bytes each for an additional 60 MB of data. The campus and district tables are of negligible total size because they only have 6000 and 1200 rows each. The database tables we have created store the information from the raw data but in a much more compact form. Furthermore we have consistency in the information in these tables. For example, each campus is named consistently. 6) Flagging questionable data Some of the data values are questionable. We have seen that the stuid value '' occurs over 27000 times and that 000000000, 111111111, and 999999999 each occur many times. We want to flag these stuids as questionable data. We could create another column in the student table to indicate if that stuid is considered questionable. An alternative which saves some storage in the student and in the results table is to negate the ids value wherever we think the data are questionable. If we add a referential integrity constraint on the results table with the "ON UPDATE CASCADE" clause, the changes in ids in the student table will be propagated to the results table ids values. This is a time-consuming operation on tables of this size but it only needs to be done once. After that, each row in the results table will have an indicator of legitimate data. ALTER TABLE results ADD CONSTRAINT idsfk FOREIGN KEY (ids) REFERENCES student(ids) ON UPDATE CASCADE; UPDATE STUDENT SET ids = -ids WHERE sex IS NULL AND nobs > 8; 7) Using subselects In our initial exploration of these data we concentrated on those students who had attended school in Dallas Independent School District. We want all the observations on each of these students, even if those observations did not occur in Dallas. We could create a table of ids values for students with at least one observation in Dallas then match ids to that table in a second select statement but PostgreSQL allows us to combine those operations. SELECT year, grade, campus, ids, sex, ethnicity, disadvg, tlimth FROM all_data WHERE ids IN (SELECT distinct ids FROM all_data WHERE dname LIKE 'DALLAS ISD%') ORDER BY ids;