Remove highly correlated variables from a data frame to reduce pair-wise redundancy and mitigate multicollinearity issues in predictive models. This preprocessing step is especially useful when the goal is prediction rather than interpretation, because hypotheses about each individual predictor are not the primary concern.
For example, in a genomic prediction study the authors removed highly correlated SNPs to avoid redundant information when working with thousands of markers, improving training efficiency and predictive performance (Wimmer et al. 2021).
In the paper "A Proposed Data Analytics Workflow and Example Using the R Caret Package",
this filtering step is applied before model training, demonstrating how the core function
caret::findCorrelation
can be used to identify and remove highly correlated variable pairs.
Note that while high correlation can bias algorithms like clustering algorithms toward redundant variables, it is much less problematic for tree-based learners.
Usage
corr_rm(df, c, ...)
# S3 method for class 'clist'
corr_rm(
df,
c,
col = c("infer.value", "stat.value"),
isig = TRUE,
cutoff = 0.75,
...
)
# S3 method for class 'list'
corr_rm(
df,
c,
col = c("infer.value", "stat.value"),
isig = TRUE,
cutoff = 0.75,
...
)
# S3 method for class 'cmatrix'
corr_rm(df, c, cutoff = 0.75, ...)
# S3 method for class 'matrix'
corr_rm(df, c, cutoff = 0.75, ...)
Arguments
- df
[
data.frame(1)
]
The input data frame whose columns will be evaluated and filtered.- c
[
clist(1)
|cmatrix
]
A correlation list (classclist
) produced bycorrp
, or a correlation matrix (classcmatrix
) fromcorr_matrix
.- ...
Additional arguments passed to the
corr_rm
methods.- col
[
character(1)
]
Name of the column in the correlation output to use (e.g.,"infer.value"
).- isig
[
logical(1)
]
IfTRUE
, non-significant correlations are set toNA
in the matrix (orFALSE
in aclist
); otherwise all values are retained.- cutoff
[
numeric(1)
]
Absolute correlation threshold above which one of a pair of variables will be dropped. Defaults to0.75
.
References
Wimmer, V.; Albrecht, T.; Auinger, H.-J.; Schön, C.-C. (2021). Genomic prediction studies in plants and animals: Removing highly correlated SNPs to reduce redundancy. PLoS Genetics, 17(3), e1009243. URL: https://doi.org/10.3389/fgene.2021.611506
Jones, S.; Ye, Z.; Xie, Z.; Root, C.; Prasutchai, T.; Anderson, J.; Roggenburg, M.; Lanham, M. A. (2018). A Proposed Data Analytics Workflow and Example Using the R Caret Package. Midwest Decision Sciences Institute (MWDSI) Conference. URL: https://www.matthewalanham.com/Students/2018_MWDSI_R%20caret%20paper.pdf
Examples
iris_clist <- corrp(iris)
iris_cmatrix <- corr_matrix(iris_clist)
corr_rm(df = iris, c = iris_clist, cutoff = 0.75, col = "infer.value", isig = FALSE)
#> Sepal.Length Sepal.Width
#> 1 5.1 3.5
#> 2 4.9 3.0
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5.0 3.6
#> 6 5.4 3.9
#> 7 4.6 3.4
#> 8 5.0 3.4
#> 9 4.4 2.9
#> 10 4.9 3.1
#> 11 5.4 3.7
#> 12 4.8 3.4
#> 13 4.8 3.0
#> 14 4.3 3.0
#> 15 5.8 4.0
#> 16 5.7 4.4
#> 17 5.4 3.9
#> 18 5.1 3.5
#> 19 5.7 3.8
#> 20 5.1 3.8
#> 21 5.4 3.4
#> 22 5.1 3.7
#> 23 4.6 3.6
#> 24 5.1 3.3
#> 25 4.8 3.4
#> 26 5.0 3.0
#> 27 5.0 3.4
#> 28 5.2 3.5
#> 29 5.2 3.4
#> 30 4.7 3.2
#> 31 4.8 3.1
#> 32 5.4 3.4
#> 33 5.2 4.1
#> 34 5.5 4.2
#> 35 4.9 3.1
#> 36 5.0 3.2
#> 37 5.5 3.5
#> 38 4.9 3.6
#> 39 4.4 3.0
#> 40 5.1 3.4
#> 41 5.0 3.5
#> 42 4.5 2.3
#> 43 4.4 3.2
#> 44 5.0 3.5
#> 45 5.1 3.8
#> 46 4.8 3.0
#> 47 5.1 3.8
#> 48 4.6 3.2
#> 49 5.3 3.7
#> 50 5.0 3.3
#> 51 7.0 3.2
#> 52 6.4 3.2
#> 53 6.9 3.1
#> 54 5.5 2.3
#> 55 6.5 2.8
#> 56 5.7 2.8
#> 57 6.3 3.3
#> 58 4.9 2.4
#> 59 6.6 2.9
#> 60 5.2 2.7
#> 61 5.0 2.0
#> 62 5.9 3.0
#> 63 6.0 2.2
#> 64 6.1 2.9
#> 65 5.6 2.9
#> 66 6.7 3.1
#> 67 5.6 3.0
#> 68 5.8 2.7
#> 69 6.2 2.2
#> 70 5.6 2.5
#> 71 5.9 3.2
#> 72 6.1 2.8
#> 73 6.3 2.5
#> 74 6.1 2.8
#> 75 6.4 2.9
#> 76 6.6 3.0
#> 77 6.8 2.8
#> 78 6.7 3.0
#> 79 6.0 2.9
#> 80 5.7 2.6
#> 81 5.5 2.4
#> 82 5.5 2.4
#> 83 5.8 2.7
#> 84 6.0 2.7
#> 85 5.4 3.0
#> 86 6.0 3.4
#> 87 6.7 3.1
#> 88 6.3 2.3
#> 89 5.6 3.0
#> 90 5.5 2.5
#> 91 5.5 2.6
#> 92 6.1 3.0
#> 93 5.8 2.6
#> 94 5.0 2.3
#> 95 5.6 2.7
#> 96 5.7 3.0
#> 97 5.7 2.9
#> 98 6.2 2.9
#> 99 5.1 2.5
#> 100 5.7 2.8
#> 101 6.3 3.3
#> 102 5.8 2.7
#> 103 7.1 3.0
#> 104 6.3 2.9
#> 105 6.5 3.0
#> 106 7.6 3.0
#> 107 4.9 2.5
#> 108 7.3 2.9
#> 109 6.7 2.5
#> 110 7.2 3.6
#> 111 6.5 3.2
#> 112 6.4 2.7
#> 113 6.8 3.0
#> 114 5.7 2.5
#> 115 5.8 2.8
#> 116 6.4 3.2
#> 117 6.5 3.0
#> 118 7.7 3.8
#> 119 7.7 2.6
#> 120 6.0 2.2
#> 121 6.9 3.2
#> 122 5.6 2.8
#> 123 7.7 2.8
#> 124 6.3 2.7
#> 125 6.7 3.3
#> 126 7.2 3.2
#> 127 6.2 2.8
#> 128 6.1 3.0
#> 129 6.4 2.8
#> 130 7.2 3.0
#> 131 7.4 2.8
#> 132 7.9 3.8
#> 133 6.4 2.8
#> 134 6.3 2.8
#> 135 6.1 2.6
#> 136 7.7 3.0
#> 137 6.3 3.4
#> 138 6.4 3.1
#> 139 6.0 3.0
#> 140 6.9 3.1
#> 141 6.7 3.1
#> 142 6.9 3.1
#> 143 5.8 2.7
#> 144 6.8 3.2
#> 145 6.7 3.3
#> 146 6.7 3.0
#> 147 6.3 2.5
#> 148 6.5 3.0
#> 149 6.2 3.4
#> 150 5.9 3.0
corr_rm(df = iris, c = iris_cmatrix, cutoff = 0.75, col = "infer.value", isig = FALSE)
#> Sepal.Length Sepal.Width
#> 1 5.1 3.5
#> 2 4.9 3.0
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5.0 3.6
#> 6 5.4 3.9
#> 7 4.6 3.4
#> 8 5.0 3.4
#> 9 4.4 2.9
#> 10 4.9 3.1
#> 11 5.4 3.7
#> 12 4.8 3.4
#> 13 4.8 3.0
#> 14 4.3 3.0
#> 15 5.8 4.0
#> 16 5.7 4.4
#> 17 5.4 3.9
#> 18 5.1 3.5
#> 19 5.7 3.8
#> 20 5.1 3.8
#> 21 5.4 3.4
#> 22 5.1 3.7
#> 23 4.6 3.6
#> 24 5.1 3.3
#> 25 4.8 3.4
#> 26 5.0 3.0
#> 27 5.0 3.4
#> 28 5.2 3.5
#> 29 5.2 3.4
#> 30 4.7 3.2
#> 31 4.8 3.1
#> 32 5.4 3.4
#> 33 5.2 4.1
#> 34 5.5 4.2
#> 35 4.9 3.1
#> 36 5.0 3.2
#> 37 5.5 3.5
#> 38 4.9 3.6
#> 39 4.4 3.0
#> 40 5.1 3.4
#> 41 5.0 3.5
#> 42 4.5 2.3
#> 43 4.4 3.2
#> 44 5.0 3.5
#> 45 5.1 3.8
#> 46 4.8 3.0
#> 47 5.1 3.8
#> 48 4.6 3.2
#> 49 5.3 3.7
#> 50 5.0 3.3
#> 51 7.0 3.2
#> 52 6.4 3.2
#> 53 6.9 3.1
#> 54 5.5 2.3
#> 55 6.5 2.8
#> 56 5.7 2.8
#> 57 6.3 3.3
#> 58 4.9 2.4
#> 59 6.6 2.9
#> 60 5.2 2.7
#> 61 5.0 2.0
#> 62 5.9 3.0
#> 63 6.0 2.2
#> 64 6.1 2.9
#> 65 5.6 2.9
#> 66 6.7 3.1
#> 67 5.6 3.0
#> 68 5.8 2.7
#> 69 6.2 2.2
#> 70 5.6 2.5
#> 71 5.9 3.2
#> 72 6.1 2.8
#> 73 6.3 2.5
#> 74 6.1 2.8
#> 75 6.4 2.9
#> 76 6.6 3.0
#> 77 6.8 2.8
#> 78 6.7 3.0
#> 79 6.0 2.9
#> 80 5.7 2.6
#> 81 5.5 2.4
#> 82 5.5 2.4
#> 83 5.8 2.7
#> 84 6.0 2.7
#> 85 5.4 3.0
#> 86 6.0 3.4
#> 87 6.7 3.1
#> 88 6.3 2.3
#> 89 5.6 3.0
#> 90 5.5 2.5
#> 91 5.5 2.6
#> 92 6.1 3.0
#> 93 5.8 2.6
#> 94 5.0 2.3
#> 95 5.6 2.7
#> 96 5.7 3.0
#> 97 5.7 2.9
#> 98 6.2 2.9
#> 99 5.1 2.5
#> 100 5.7 2.8
#> 101 6.3 3.3
#> 102 5.8 2.7
#> 103 7.1 3.0
#> 104 6.3 2.9
#> 105 6.5 3.0
#> 106 7.6 3.0
#> 107 4.9 2.5
#> 108 7.3 2.9
#> 109 6.7 2.5
#> 110 7.2 3.6
#> 111 6.5 3.2
#> 112 6.4 2.7
#> 113 6.8 3.0
#> 114 5.7 2.5
#> 115 5.8 2.8
#> 116 6.4 3.2
#> 117 6.5 3.0
#> 118 7.7 3.8
#> 119 7.7 2.6
#> 120 6.0 2.2
#> 121 6.9 3.2
#> 122 5.6 2.8
#> 123 7.7 2.8
#> 124 6.3 2.7
#> 125 6.7 3.3
#> 126 7.2 3.2
#> 127 6.2 2.8
#> 128 6.1 3.0
#> 129 6.4 2.8
#> 130 7.2 3.0
#> 131 7.4 2.8
#> 132 7.9 3.8
#> 133 6.4 2.8
#> 134 6.3 2.8
#> 135 6.1 2.6
#> 136 7.7 3.0
#> 137 6.3 3.4
#> 138 6.4 3.1
#> 139 6.0 3.0
#> 140 6.9 3.1
#> 141 6.7 3.1
#> 142 6.9 3.1
#> 143 5.8 2.7
#> 144 6.8 3.2
#> 145 6.7 3.3
#> 146 6.7 3.0
#> 147 6.3 2.5
#> 148 6.5 3.0
#> 149 6.2 3.4
#> 150 5.9 3.0