Designing a Database for the Cod Stomach Data Set

Introduction

The data are measurements made on the stomach contents of Atlantic Cod. The data set described herein is provided in a colon-delimited ASCII text file format. This is a small sample from a larger data set with more stomach contents and more variables.

Variables

The data set contains the following variables:

region -

shipType - The type of ship (e.g., commercial versus research).

shipID - Unique ship identifier.

trip - Trip identifier; unique within a region.

set - Set identifier; unique within a particular ship on a particular trip. A set is, e.g., one set of a fishing net.

fishID - Unique fish identifier.

fishLength - Length of a fish (mm).

preyMass - Weight of lump from cod stomach (grams).

preyType - Species of lump from cod stomach. Special value Empty means no lumps found in stomach. Rare species gathered together under special label Other.

Data format

The data set is provided as a colon-delimited ASCII text file called cod.txt.

The file CodMeta.xml provides a StatDataML description of the data set.

Exercises

Designing a database

Q: Designing a database

This data set is provided in a plain text format. The task in this exercise is to produce a database design to store these data in third normalised form.

A:

We can start off by identifying a few simple entities within this data set. For example, thinking about the physical objects involved, there is clearly information about individual ships and information about individual fish, so we will have a fish_table and a ship_table.

It is abundantly clear that there are going to be several tables in the database; from a normalisation point of view, it is clear that we could not have a single table because there would be columns that do not relate to the primary key, or relate not only to the primary key, but also to each other; not to mention that we would have trouble finding a primary key for one big table in the first place.

For each ship we have one or two identification numbers. We need both numbers in order to uniquely identify each ship (different commercial vessels share the same ship_type), so we cannot use either variable on its own as a primary key. Furthermore, the ship_id for research vessels is missing (in database terms, the value will be NULL), which means that the ship_id variable cannot be used as part of the primary key. We will use an artificial, auto-increment key for this table.

ship_table  (ID [PK], ship_type, ship_id)

For each fish, we have a numeric label and the length of the fish; we also have lots of information about what was in the stomach of the fish, but we will leave that until later. The fish_id label is not unique for each fish, so again we will use an auto-increment primary key.

fish_table  (ID [PK], fish_id, fish_length)

Another important physical entity in the data set is a prey item (a lump found in a fish stomach). We could have a lump_table where, for each lump, we have information about the species of the prey item and the weight of the lump. We will add an auto-increment variable to provide a unique identifier for each lump in the entire data set.

lump_table  (ID [PK], prey_mass, prey)

We will develop this table more a little later.

The prey_type variable is a good example where we might like to create a new table for validating the species entered in the lump_table. Another reason for considering this approach is the possibility that we might be interested in a species that does not occur in the data set we have (but could conceivably occur in future studies). We could also use a prey_table to provide a mapping between the generic Other species category and individual species which have been grouped therein. We could use the species name itself as the primary key for the prey_table, but in terms of efficiency of storage, having a single integer identifier for each species requires less storage in the (large) lump_table than storing the full species label.

prey_table  (ID [PK], prey_type)

The relationship between the lump_table and the prey_table is many-to-one, so we place a prey foreign key in the lump_table.

Lumps are also fairly obviously related to fish; each lump comes from exactly one fish and each fish can have several lumps. We also place a fish foreign key in the lump_table.

lump_table  (ID [PK], prey_mass, 
             prey [FK prey_table.ID], 
             fish [FK fish_table.ID]))

It is worth pointing out that, through the lump_table, we have resolved the many-to-many relationship between fish and prey.

Now we come to the more complicated part of modelling the cod data set. How are fish and ships related to each other? And how do we bring in the other information in the data set (region, trip, and set)? At this point, thinking about physical entities does not help us much; we need to think in terms of the events involved in the data collection instead.

Initially, the situation does not look too bad; each fish was caught by exactly one ship (and each ship caught many fish). However, the process was not that simple. Each fish was caught in exactly one set (one check of the net or hooks) and each set occurred on exactly one trip. However, some trips involved several ships and some ships conducted more than one trip. There is another many-to-many relationship lurking within the data. To resolve this, we will focus on the fishing sets.

For each set we have a label, set_num. The set occurred on exactly one trip, so we can include the label for the trip (if we had more information on trips, such as a date, we might split the trip information into a separate table). The set was performed by exactly one ship, so we can include a foreign key to the ship table. This resolves the many-to-many relationship between ships and trips. Finally, we include information about the region. This is expanded into a separate table, which allows us to provide a more expansive description of each region. The original region code makes a nice primary key and because it is fairly compact, will not result in too much inefficiency in terms of space. An auto-increment variable provides a unique identifier for each set.

region_table  (region [PK], name)

set_table  (ID [PK], set_num, trip, 
            fish [FK fish_table.ID],
            region [FK region_table.region])

Designing a Database for the Cod Stomach Data Set

Paul Murrell

Introduction

Variables

Data format

Exercises