Dataset ("Row") Operations

Data Operations Menu

Filter Dataset

This tool provides several methods for filtering the dataset. The window that opens has four options for you to choose from:

Data Operations Menu

  1. Levels of a categorical variables Filter by categorical levels

    After selecting a categorical variable from the drop down box, you can select which levels you want to retain in the data set.

  2. Numeric condition Filter by numeric condition

    This allows you to define a condition with which to filter your data.

    For example, you could include only the observations of height over 180 cm by

    • selecting height from the drop down menu,
    • clicking on the > symbol, and
    • entering the value 180 in the third box.
  3. Row number Filter by row number

    Exclude a range of row numbers as follows:

    • Entering 101:1000 (and then Submit) will exclude all rows from 101 to 1000
    • Similarly, 1, 5, 99, 101:1000 will exclude rows 1, 5, 99, and everything from 101 to 1000
  4. Randomly Filter randomly

    Essentially, this allows you to perform bootstrap randomisation manually.

    The current behaviour is this:

    • "Sample Size", n, is the number of observations to draw for each sample,
    • "Number of Samples", m, is the number of samples to create in the new data set.
    • The output will be a data set with n x m rows, which must be smaller than the total number of rows in the data set.
    • The observations are drawn randomly without replacement from the data set.

Sort data by variables

Sort data

Sort the rows of the data by one or more variables. The ordering will be nested, so that the data is first ordered by "Variable 1", and then "Variable 2", etc. For categorical variables, the ordering will be based on the order of the variable (by default, this will be alphabetical unless manually changed in "Manipulate Variables" > "Categorical Variables" > "Reorder Levels").

Aggregate data

Aggregate data

This function essentially allows you to obtain "summaries" of all of the numeric variables in the data set for combinations of categorical variables.

  • Variables: if only one variable is specified, the new data set will have one row for each level of the variable.

    If two (or more) are specified, then there will be one row for each combination. For example, the categorical variables gender = {male, female} and ethnicity = {white, black, asian, other} will result in a data set with 2x4 rows.

  • Summaries: each row will have the chosen summaries given for each numeric variable in the data set.

    For example, if the data set has the variables gender (cat) and height (num), and if the user selects Mean and Sd, then the new data set will have the columns gender, height.Mean and height.Sd. In the rows, the values will be for that combination of categorical variables; the row for gender = female will have the mean height of the females, and the standard deviation of height for the females.

A visual example of this would be do drag height into the Variable 1 slot, and gender into the Variable 2 slot. Clicking on "Get Summary" would provide the same information. The advantage of using Aggregate is that the summaries are calculated for every numeric variable in the data set, not just one of them.

Stack variables

Stack variables

Convert from table form (rows corresponding to subjects) to long form (rows corresponding to observations).

In many cases, the data may be in tabular form, in which multiple observations are made but placed in different columns. An example of this may be a study of blood pressure on patients using several medications. The columns of this data set may be: patient.id, gender, drug, Week1, Week2, Week3. Here, each patient has their own row in the data set, but each row contains three observations of blood pressure.

patient.id gender drug Week1 Week2 Week 3
1 male A 130 125 120
2 male B 140 130 110
3 female A 120 119 116

We may want to convert to long form, where we have each observation in a new row, and use a categorical variable to differentiate the weeks. In this case, we would select Week1, Week2, and Week3 as the variables in the list. The new data set will have the columns patient.id, gender, drug, Stack.variable ("Week"), and stack.value ("blood pressure").

patient.id gender drug stack.variable stack.value
1 male A Week1 130
1 male A Week2 125
1 male A Week3 120
2 male B Week1 140
2 male B Week2 130
2 male B Week3 110
3 female A Week1 120
3 female A Week2 119
3 female A Week3 116

Of course, you can rename the variables as appropriate using "Manipulate Variables" > "Rename Variables".

Restore Dataset

Restores the data set to the way it was when it was initially imported.