Dataset ("Row") Operations

Page Contents

Data Operations Menu

Data Operations Menu

Filter Dataset

This tool provides several methods for filtering the dataset. The window that opens has four options for you to choose from:

Filter by categorical levels

(Filter by) Levels of a categorical variable

After selecting a categorical variable from the drop down box, you can select which levels you want to retain in the data set.

Filter by numeric condition

(Filter by) Numeric condition

This allows you to define a condition with which to filter your data.
For example, you could include only the observations of height over 180 cm by

  • selecting height from the drop down menu,
  • clicking on the > symbol, and
  • entering the value 180 in the third box.

Filter by row number

(Filter by) Row number

Exclude a range of row numbers as follows:

  • Entering 101:1000 (and then Submit) will exclude all rows from 101 to 1000
  • Similarly, 1, 5, 99, 101:1000 will exclude rows 1, 5, 99, and everything from 101 to 1000

Filter randomly

(Filter by) Randomly

Essentially, this allows you to perform bootstrap randomisation manually.
The current behaviour is this:

  • "Sample Size", n, is the number of observations to draw for each sample,
  • "Number of Samples", m, is the number of samples to create in the new data set.
  • The output will be a data set with n x m rows, which must be smaller than the total number of rows in the data set.
  • The observations are drawn randomly without replacement from the data set.

Sort data by variables

Sort data Sort the rows of the data by one or more variables

  • The ordering will be nested, so that the data is first ordered by "Variable 1", and then "Variable 2", etc.
  • For categorical variables, the ordering will be based on the order of the variable (by default, this will be alphabetical unless manually changed in "Manipulate Variables" > "Categorical Variables" > "Reorder Levels").

Aggregate data

Aggregate data

This function essentially allows you to obtain "summaries" of all of the numeric variables in the data set for combinations of categorical variables.

  • Variables: if only one variable is specified, the new data set will have one row for each level of the variable.
    If two (or more) are specified, then there will be one row for each combination. For example, the categorical variables gender = {male, female} and ethnicity = {white, black, asian, other} will result in a data set with 2x4 rows.
  • Summaries: each row will have the chosen summaries given for each numeric variable in the data set.
    For example, if the data set has the variables gender (cat) and height (num), and if the user selects Mean and Sd, then the new data set will have the columns gender, height.Mean and height.Sd. In the rows, the values will be for that combination of categorical variables; the row for gender = female will have the mean height of the females, and the standard deviation of height for the females.
    A visual example of this would be do drag height into the Variable 1 slot, and gender into the Variable 2 slot. Clicking on "Get Summary" would provide the same information. The advantage of using Aggregate is that the summaries are calculated for every numeric variable in the data set, not just one of them.

Stack variables

Stack variables

Convert from table form (rows corresponding to subjects) to long form (rows corresponding to observations).

In many cases, the data may be in tabular form, in which multiple observations are made but placed in different columns. An example of this may be a study of blood pressure on patients using several medications. The columns of this data set may be:, gender, drug, Week1, Week2, Week3. Here, each patient has their own row in the data set, but each row contains three observations of blood pressure. gender drug Week1 Week2 Week 3
1 male A 130 125 120
2 male B 140 130 110
3 female A 120 119 116

We may want to convert to long form, where we have each observation in a new row, and use a categorical variable to differentiate the weeks. In this case, we would select Week1, Week2, and Week3 as the variables in the list. The new data set will have the columns, gender, drug, Stack.variable ("Week"), and stack.value ("blood pressure"). gender drug stack.variable stack.value
1 male A Week1 130
1 male A Week2 125
1 male A Week3 120
2 male B Week1 140
2 male B Week2 130
2 male B Week3 110
3 female A Week1 120
3 female A Week2 119
3 female A Week3 116

Of course, you can rename the variables as appropriate using "Manipulate Variables" > "Rename Variables".

Dataset operations

dataset operation menu

It offers three types of tools for the users to modify their dataset:
   • Reshape dataset
   • Separate column
   • Unite columns

reshape dataset

Reshape dataset

reshape wide to long

(Reshape) Wide to long

This allows you to select a column or multiple columns from your dataset.

  • One new column (default name key) is populated by the column name(s) of the selected column(s)
  • The other new column (default name value) will contain the column value from the selected columns.
  • The selected column(s) will be removed and two new columns will be added to the dataset.
  • A preview panel shows what the new dataset will look like.

reshape long to wide

(Reshape) Long to wide

You can select a column to spread out into multiple columns (the column is named key in the example).
It will use the column values of the selected column as a set of names for new columns.

You then select another column with corresponding values to be put into the new columns (the column is named value in the example).

Separate menu

Separate column into ...

Separate to columns

Separate a column into several columns

Allows you to separate a column into several columns using a user-defined separator.

  • It will separate at every instance of the separator until no further separators are found

If no separator is found, the additional columns formed will contain NAs.

In the example on to the right, we have asked to separate column A using an underscore (“_”) as a separator.

Because only column A is being separated, column B (or any other columns) is left unchanged in the resulting new dataset

The maximum number of fields in column A after separation is 3 ("A_0.7K_2K") so column A in the original dataset is being replaced by 3 columns with default column names (Col1, etc).

  • Expanding Change column names (click the "+") allows you to change the default column names to something else.

Separate to rows

Separate a column into several rows

Instead of forming more columns this version of Separate keeps the same number of columns, with the same names, but writes more rows.

Using the same data as in the example above, the entry "A_0.7K_2K" in column A in the original dataset results in 3 rows in column A in the new dataset.

  • The corresponding entries in any other columns (e.g. column B in the example) are duplicated.

Unite columns

Unite columns

Allows you to select multiple columns and “unite” them using a defined separator (defaults to “_”). The united column name will be the combination of the selected columns with a “.” in between.

In this example, column “v1999” and column “v2000” are united by “”. The new column name is “v1999.v2000”._

  • It is alloable to have no separator. Just clear the Enter the separator field (delete the “_”)

Merge/Join datasets

  • Join datasets
  • Append rows

Join Datasets

This "joins", or brings together, information in two data sets: the current dataset in iNZight and a newly imported dataset (read in using the Import data facility) shown at the mid-right.

Left Join: The most important joining method is called a Left Join, the main purpose of which is to add new variables to the original dataset by extracting the information from the new dataset.

Join datasets

Matching rows: The main problem is to identify what pieces of information belong together. The most straightforward case occurs where there is a variable in the original dataset which is a unique identifier. If that variable is also in the imported dataset (even if under a different name) we can use it to match up the data which belongs to the same unit/entity.

To partially automate the process, iNZight looks for variables with the same name in both datasets (originally x1, x2, and x3 in the Example to the right) and offers those for determining matches.

In the Example, we have rejected x3 using the delete button beside it and so have effectively told the program that it is units with the same values of both x1 and x2 that belong together.

  • The Preview panel shows us the effects of our choices

Click the Join button at the bottom once you are happy with the way the data is being joined.

The details of how the data is treated depend on the type of Join and we will document that after finishing describing the Example.

In the Example, x4 is a new variable so that has been added to the preview-dataset. A complication is x3 which is in both datasets but with different values for the "same" units. So the program has decided to make two variables, one for the x3 values from the original dataset and one for the x3 values from the new dataset.

Types of Join

Left Join

  • The joined dataset has rows corresponding to all of the rows in the original dataset and all of its columns.
  • Rows of the new dataset that do not have a match in the original dataset are not used.
  • The joined dataset also has the columns from the new dataset that were not used for matching.
  • Rows in the original that have no match in the newly imported dataset get NAs for the additional columns
  • Warning: Rows from the original dataset that have more than one match in the new dataset generate multiple rows in the joined dataset (which invalidates many simple analyses). For example, if there are 3 matches then the original (single) row will be replaced by 3 rows. The cell-values for the additional columns will be obtained from the new data set and the values for the original columns from repeating the original cell values.

How other joins differ from the Left Join

  • Inner Join: Only use rows corresponding to matches between the two datasets

  • Full Join (Outer Join): Also use all the non-matching rows from both data sets

[Right Join: iNZight does not have this. Just import the datasets in the reverse order and use a left join.]

The following are just used to filter data. No columns are added to the join from the new dataset.

  • Semi Join: Use only rows in the original which have a match in the new

  • Anti Join: Use only rows in the original which have no match in the new

Append rows

Append rows

Adds to the bottom of the original dataset rows from a newly imported dataset (imported using the Import facility provided).

  • The column names from both datasets are matched so the right data goes into the right columns
  • Columns that appear in one dataset and not the other will appear in the result
  • An error will be reported and not appending will be possible if two column names match but their types (e.g. numeric or categorical) do not match

  • If the Tick-box option is selected a timestamp variable called When_Added will be added to the dataset recording when the new observations were added
  • If there is already a variable called When_Added present in the original data (must be of type date-time, "(t)" in View Variables) the new timestamps will be appended to that existing variable

Where data is periodically being added to a dataset, this facility can be used to keep track of when each row was added -- thus facilitating analyses of the data as it was up until any particular time point

Validate Dataset

Often, we want to validate that the data in a dataset adheres to our expectations about how that data should behave based on external real world-knowledge.

  • Validating a data set necessarily requires user-supplied validation rules.

For example, we expect the heights of humans are not negative or someone cannot work for more than 24 hours in a single day.

The validation window gives us the ability to define or import rules, and check whether the data conforms to those rules identifying any exceptions.

Validate dataset

  • In the validation window, rules can be typed into the Validation Rules text box in the top left or imported from a text file using the Open Rules button.

Using the first example, the rule to check whether heights are above 0 can be written in this textbox as height > 0 in this text box (given there is a variable named height in our dataset).

  • To check all of the rules that have been defined in this textbox, click the Validate Dataset button.

The results of each rule are presented in a table at the bottom of the window and show the number of observations that were checked ("Total"), the number of passes and failures ("Passes" and "Fails" respectively), and the fail percentage ("Fails (%)"). Initially this table is sorted by failure percentage, but you can click on other column headers to order the list in other ways.

  • The Unique Identifier selection box at the top right of the validation window allows you to select the name of a variable that contains unique identifiers for the units/cases/rows in the data set.

This is more useful than employing row numbers (the default setting) because unique-identifier values remain unchanged in the data when rows are deleted whereas row numbers will change.

  • Double clicking on a row of the results table will generate a detailed breakdown of the results in the Details section on the right-hand side of the window.

This breakdown will provide details about the observations which failed on that particular rule, giving the row numbers (or unique-identifier values if a unique-identifier has been selected) of these observations and the values used to assess the rule.

Validation rules or changes to imported rule files are discarded once the validation window is closed.

  • If you would like to store the set of rules you have defined or save any changes to an imported rule file, this can be done using the Save Rules button.

They are saved into a text file on your computer that can be imported again using the "Open Rules" button or viewed using a text editor.

The rules you use to validate the dataset do not need to be simple comparisons between a variable and a static value as in the previous example. More complex rules can be built by performing calculations on the variables, e.g. weight / height^2 < 50 will verify that each observation's body mass index is below 50. The values of each variable contained in the calculation as well as the end result are provided in the detailed breakdown.

  • Instead of comparisons between a variable/calculation and a static value, we can compare against another variable or a calculation based on the data.

For example, to check that the income of an individual (contained in variable Income) is no more than 1000 times their number of hours per week (contained in variable Hours), we can use the following rule: Income <= Hours * 1000. This will calculate a different value to compare income against for each observation.

For more information on what is possible using validation rules, the vignettes and help files of the underlying R package (validate) might be useful: Introduction to Validate vignette and Validate package (in particular, the syntax section of the reference manual).

Restore Dataset

Restores the data set to the way it was when it was initially imported.