Autor de la sección: Sebastian Jentschke

How do I identify outliers and filter them out from being used in analyses?

Outliers_Filter_Shortcut

  • open the Data tab and select Filter (either by using the symbol in the icon bar or the one in the bottom-left corner of the jamovi window)
    in order to access functions, press the fx icon in the filter settings
    there also is a switch where you can activate or deactivate the filter (see the comment in red below)
    you close the filter settings by pressing the arrow in the top-right corner
  • there are three large approaches, to exclude outliers:
    1. based upon z-scores (the absolute value should be larger 3.3; this equals to a probability of 0.1% = 1 / 1000; based upon a standard normal distribution ~ parametric)
    2. based upon the IQR (like in a box plot; based upon ranks and quantiles ~ non-parametric)
    3. based on the Mahalanobis distance (multivariate outliers)
    for 1. and 2., there exist functions in jamovi (see next bullet points), for 3. you have to use R-code (decribed two bullet point below); for 2. you could also do it visually (three bullet points below)
  • you can either use an function-based selection; the functions below filter out lines based on either the z-scores (first line), the interquartile range (IQR, second line) or by excluding certain rows / row numbers (e.g., based upon the output from the calculation of the Mahalanobis distance further below; third line):
    MAXABSZ([VARIABLE1], [VARIABLE2], …)
    
    MAXABSIQR([VARIABLE1], [VARIABLE2], …)
    
    IFMISS(MATCH(ROW(), [ROWNUMBER 1], [ROWNUMBER 2], …), 1, 0)
    

    Outliers_Filter_Settings

    the following code example detects multivariate outliers based upon the Mahalanobis distance (remember to adjust the variable names in VL)
    # this list should contain the names of your INDEPENDENT VARIABLES
    # you should not include your dependent variables
    # if you already use a filter set it to inactive
    # hint: you can get the names of your variable with names(data)
    # the syntax is adjusted for jamovi (the data frame is called data,
    # but can easily be used within R by just changing data to the name of your data frame
    VL = c('dan.sleep', 'baby.sleep', 'day')
    # brief explanation: the code calculates the Mahalanobis distance for all variables in VL,
    # then calculates the p-value (pchisq) and show lines with variables that had a p-value < 0.001
    row.names(data)[
        pchisq(unname(
            mahalanobis(data[, VL], colMeans(data[, VL]), cov(data[, VL]))),
            df=length(VL), lower.tail=FALSE) < 0.001]
    
    the output from the R-code tells you which lines you should de-select
    you use the scripts within the Rj editor, just copy-and-paste them and run them by hitting the ►-button (the little green triangle)
  • the filter conditions can then be combined using boolean and / or:
    MAXABSZ([VARIABLE1], [VARIABLE2], …) < 3.3 and
    MAXABSIQR([VARIABLE1], [VARIABLE2], …) < 3 and
    IFMISS(MATCH(ROW(), [ROWNUMBER 1], [ROWNUMBER 2], …), 1, 0)
    
  • instead of using the second line (MAXABSIQR) you could also de-select cases by excluding their respective row numbers in the dataset (as in the third line; you would then visually check the outliers in the box-plots under Descriptives, ensuring that the tick box Label outliers is set and exclude the row numbers that are marked as outliers)