Skip to contents

Remove highly correlated variables from a data frame to reduce pair-wise redundancy and mitigate multicollinearity issues in predictive models. This preprocessing step is especially useful when the goal is prediction rather than interpretation, because hypotheses about each individual predictor are not the primary concern.

For example, in a genomic prediction study the authors removed highly correlated SNPs to avoid redundant information when working with thousands of markers, improving training efficiency and predictive performance (Wimmer et al. 2021).

In the paper "A Proposed Data Analytics Workflow and Example Using the R Caret Package", this filtering step is applied before model training, demonstrating how the core function caret::findCorrelation can be used to identify and remove highly correlated variable pairs.

Note that while high correlation can bias algorithms like clustering algorithms toward redundant variables, it is much less problematic for tree-based learners.

Usage

corr_rm(df, c, ...)

# S3 method for class 'clist'
corr_rm(
  df,
  c,
  col = c("infer.value", "stat.value"),
  isig = TRUE,
  cutoff = 0.75,
  ...
)

# S3 method for class 'list'
corr_rm(
  df,
  c,
  col = c("infer.value", "stat.value"),
  isig = TRUE,
  cutoff = 0.75,
  ...
)

# S3 method for class 'cmatrix'
corr_rm(df, c, cutoff = 0.75, ...)

# S3 method for class 'matrix'
corr_rm(df, c, cutoff = 0.75, ...)

Arguments

df

[data.frame(1)]
The input data frame whose columns will be evaluated and filtered.

c

[clist(1) | cmatrix]
A correlation list (class clist) produced by corrp, or a correlation matrix (class cmatrix) from corr_matrix.

...

Additional arguments passed to the corr_rm methods.

col

[character(1)]
Name of the column in the correlation output to use (e.g., "infer.value").

isig

[logical(1)]
If TRUE, non-significant correlations are set to NA in the matrix (or FALSE in a clist); otherwise all values are retained.

cutoff

[numeric(1)]
Absolute correlation threshold above which one of a pair of variables will be dropped. Defaults to 0.75.

Value

data.frame
A filtered version of df with highly correlated variables removed.

References

Wimmer, V.; Albrecht, T.; Auinger, H.-J.; Schön, C.-C. (2021). Genomic prediction studies in plants and animals: Removing highly correlated SNPs to reduce redundancy. PLoS Genetics, 17(3), e1009243. URL: https://doi.org/10.3389/fgene.2021.611506

Jones, S.; Ye, Z.; Xie, Z.; Root, C.; Prasutchai, T.; Anderson, J.; Roggenburg, M.; Lanham, M. A. (2018). A Proposed Data Analytics Workflow and Example Using the R Caret Package. Midwest Decision Sciences Institute (MWDSI) Conference. URL: https://www.matthewalanham.com/Students/2018_MWDSI_R%20caret%20paper.pdf

Author

Igor D.S. Siciliani, Paulo H. dos Santos

Examples

iris_clist <- corrp(iris)
iris_cmatrix <- corr_matrix(iris_clist)
corr_rm(df = iris, c = iris_clist, cutoff = 0.75, col = "infer.value", isig = FALSE)
#>     Sepal.Length Sepal.Width
#> 1            5.1         3.5
#> 2            4.9         3.0
#> 3            4.7         3.2
#> 4            4.6         3.1
#> 5            5.0         3.6
#> 6            5.4         3.9
#> 7            4.6         3.4
#> 8            5.0         3.4
#> 9            4.4         2.9
#> 10           4.9         3.1
#> 11           5.4         3.7
#> 12           4.8         3.4
#> 13           4.8         3.0
#> 14           4.3         3.0
#> 15           5.8         4.0
#> 16           5.7         4.4
#> 17           5.4         3.9
#> 18           5.1         3.5
#> 19           5.7         3.8
#> 20           5.1         3.8
#> 21           5.4         3.4
#> 22           5.1         3.7
#> 23           4.6         3.6
#> 24           5.1         3.3
#> 25           4.8         3.4
#> 26           5.0         3.0
#> 27           5.0         3.4
#> 28           5.2         3.5
#> 29           5.2         3.4
#> 30           4.7         3.2
#> 31           4.8         3.1
#> 32           5.4         3.4
#> 33           5.2         4.1
#> 34           5.5         4.2
#> 35           4.9         3.1
#> 36           5.0         3.2
#> 37           5.5         3.5
#> 38           4.9         3.6
#> 39           4.4         3.0
#> 40           5.1         3.4
#> 41           5.0         3.5
#> 42           4.5         2.3
#> 43           4.4         3.2
#> 44           5.0         3.5
#> 45           5.1         3.8
#> 46           4.8         3.0
#> 47           5.1         3.8
#> 48           4.6         3.2
#> 49           5.3         3.7
#> 50           5.0         3.3
#> 51           7.0         3.2
#> 52           6.4         3.2
#> 53           6.9         3.1
#> 54           5.5         2.3
#> 55           6.5         2.8
#> 56           5.7         2.8
#> 57           6.3         3.3
#> 58           4.9         2.4
#> 59           6.6         2.9
#> 60           5.2         2.7
#> 61           5.0         2.0
#> 62           5.9         3.0
#> 63           6.0         2.2
#> 64           6.1         2.9
#> 65           5.6         2.9
#> 66           6.7         3.1
#> 67           5.6         3.0
#> 68           5.8         2.7
#> 69           6.2         2.2
#> 70           5.6         2.5
#> 71           5.9         3.2
#> 72           6.1         2.8
#> 73           6.3         2.5
#> 74           6.1         2.8
#> 75           6.4         2.9
#> 76           6.6         3.0
#> 77           6.8         2.8
#> 78           6.7         3.0
#> 79           6.0         2.9
#> 80           5.7         2.6
#> 81           5.5         2.4
#> 82           5.5         2.4
#> 83           5.8         2.7
#> 84           6.0         2.7
#> 85           5.4         3.0
#> 86           6.0         3.4
#> 87           6.7         3.1
#> 88           6.3         2.3
#> 89           5.6         3.0
#> 90           5.5         2.5
#> 91           5.5         2.6
#> 92           6.1         3.0
#> 93           5.8         2.6
#> 94           5.0         2.3
#> 95           5.6         2.7
#> 96           5.7         3.0
#> 97           5.7         2.9
#> 98           6.2         2.9
#> 99           5.1         2.5
#> 100          5.7         2.8
#> 101          6.3         3.3
#> 102          5.8         2.7
#> 103          7.1         3.0
#> 104          6.3         2.9
#> 105          6.5         3.0
#> 106          7.6         3.0
#> 107          4.9         2.5
#> 108          7.3         2.9
#> 109          6.7         2.5
#> 110          7.2         3.6
#> 111          6.5         3.2
#> 112          6.4         2.7
#> 113          6.8         3.0
#> 114          5.7         2.5
#> 115          5.8         2.8
#> 116          6.4         3.2
#> 117          6.5         3.0
#> 118          7.7         3.8
#> 119          7.7         2.6
#> 120          6.0         2.2
#> 121          6.9         3.2
#> 122          5.6         2.8
#> 123          7.7         2.8
#> 124          6.3         2.7
#> 125          6.7         3.3
#> 126          7.2         3.2
#> 127          6.2         2.8
#> 128          6.1         3.0
#> 129          6.4         2.8
#> 130          7.2         3.0
#> 131          7.4         2.8
#> 132          7.9         3.8
#> 133          6.4         2.8
#> 134          6.3         2.8
#> 135          6.1         2.6
#> 136          7.7         3.0
#> 137          6.3         3.4
#> 138          6.4         3.1
#> 139          6.0         3.0
#> 140          6.9         3.1
#> 141          6.7         3.1
#> 142          6.9         3.1
#> 143          5.8         2.7
#> 144          6.8         3.2
#> 145          6.7         3.3
#> 146          6.7         3.0
#> 147          6.3         2.5
#> 148          6.5         3.0
#> 149          6.2         3.4
#> 150          5.9         3.0
corr_rm(df = iris, c = iris_cmatrix, cutoff = 0.75, col = "infer.value", isig = FALSE)
#>     Sepal.Length Sepal.Width
#> 1            5.1         3.5
#> 2            4.9         3.0
#> 3            4.7         3.2
#> 4            4.6         3.1
#> 5            5.0         3.6
#> 6            5.4         3.9
#> 7            4.6         3.4
#> 8            5.0         3.4
#> 9            4.4         2.9
#> 10           4.9         3.1
#> 11           5.4         3.7
#> 12           4.8         3.4
#> 13           4.8         3.0
#> 14           4.3         3.0
#> 15           5.8         4.0
#> 16           5.7         4.4
#> 17           5.4         3.9
#> 18           5.1         3.5
#> 19           5.7         3.8
#> 20           5.1         3.8
#> 21           5.4         3.4
#> 22           5.1         3.7
#> 23           4.6         3.6
#> 24           5.1         3.3
#> 25           4.8         3.4
#> 26           5.0         3.0
#> 27           5.0         3.4
#> 28           5.2         3.5
#> 29           5.2         3.4
#> 30           4.7         3.2
#> 31           4.8         3.1
#> 32           5.4         3.4
#> 33           5.2         4.1
#> 34           5.5         4.2
#> 35           4.9         3.1
#> 36           5.0         3.2
#> 37           5.5         3.5
#> 38           4.9         3.6
#> 39           4.4         3.0
#> 40           5.1         3.4
#> 41           5.0         3.5
#> 42           4.5         2.3
#> 43           4.4         3.2
#> 44           5.0         3.5
#> 45           5.1         3.8
#> 46           4.8         3.0
#> 47           5.1         3.8
#> 48           4.6         3.2
#> 49           5.3         3.7
#> 50           5.0         3.3
#> 51           7.0         3.2
#> 52           6.4         3.2
#> 53           6.9         3.1
#> 54           5.5         2.3
#> 55           6.5         2.8
#> 56           5.7         2.8
#> 57           6.3         3.3
#> 58           4.9         2.4
#> 59           6.6         2.9
#> 60           5.2         2.7
#> 61           5.0         2.0
#> 62           5.9         3.0
#> 63           6.0         2.2
#> 64           6.1         2.9
#> 65           5.6         2.9
#> 66           6.7         3.1
#> 67           5.6         3.0
#> 68           5.8         2.7
#> 69           6.2         2.2
#> 70           5.6         2.5
#> 71           5.9         3.2
#> 72           6.1         2.8
#> 73           6.3         2.5
#> 74           6.1         2.8
#> 75           6.4         2.9
#> 76           6.6         3.0
#> 77           6.8         2.8
#> 78           6.7         3.0
#> 79           6.0         2.9
#> 80           5.7         2.6
#> 81           5.5         2.4
#> 82           5.5         2.4
#> 83           5.8         2.7
#> 84           6.0         2.7
#> 85           5.4         3.0
#> 86           6.0         3.4
#> 87           6.7         3.1
#> 88           6.3         2.3
#> 89           5.6         3.0
#> 90           5.5         2.5
#> 91           5.5         2.6
#> 92           6.1         3.0
#> 93           5.8         2.6
#> 94           5.0         2.3
#> 95           5.6         2.7
#> 96           5.7         3.0
#> 97           5.7         2.9
#> 98           6.2         2.9
#> 99           5.1         2.5
#> 100          5.7         2.8
#> 101          6.3         3.3
#> 102          5.8         2.7
#> 103          7.1         3.0
#> 104          6.3         2.9
#> 105          6.5         3.0
#> 106          7.6         3.0
#> 107          4.9         2.5
#> 108          7.3         2.9
#> 109          6.7         2.5
#> 110          7.2         3.6
#> 111          6.5         3.2
#> 112          6.4         2.7
#> 113          6.8         3.0
#> 114          5.7         2.5
#> 115          5.8         2.8
#> 116          6.4         3.2
#> 117          6.5         3.0
#> 118          7.7         3.8
#> 119          7.7         2.6
#> 120          6.0         2.2
#> 121          6.9         3.2
#> 122          5.6         2.8
#> 123          7.7         2.8
#> 124          6.3         2.7
#> 125          6.7         3.3
#> 126          7.2         3.2
#> 127          6.2         2.8
#> 128          6.1         3.0
#> 129          6.4         2.8
#> 130          7.2         3.0
#> 131          7.4         2.8
#> 132          7.9         3.8
#> 133          6.4         2.8
#> 134          6.3         2.8
#> 135          6.1         2.6
#> 136          7.7         3.0
#> 137          6.3         3.4
#> 138          6.4         3.1
#> 139          6.0         3.0
#> 140          6.9         3.1
#> 141          6.7         3.1
#> 142          6.9         3.1
#> 143          5.8         2.7
#> 144          6.8         3.2
#> 145          6.7         3.3
#> 146          6.7         3.0
#> 147          6.3         2.5
#> 148          6.5         3.0
#> 149          6.2         3.4
#> 150          5.9         3.0