Introduction
The corrp
package provides an efficient way to compute
correlations between variables in a dataset. It supports various
correlation measures for numerical and categorical data, making it a
powerful tool for general correlation analysis. In this vignette, we
will explore how to use corrp
for different types of
correlations, both with numerical and categorical variables.
Usage Example
We will use the penguins
dataset from the
palmerpenguins
package, which contains 344 observations of
8 variables, including numerical measurements (e.g., bill_length_mm,
bill_depth_mm, flipper_length_mm, and body_mass_g), categorical
variables (e.g., species, island, and sex), and year as an integer.
Basic Usage of corrp
1. Computing Correlations with corrp
The corrp
function can be used to compute correlations
between variables, user can select pair correlation type based on the
pair data types.
# Compute correlations for iris dataset
results <- corrp(penguins,
cor.nn = "pearson", # Correlation for numerical-numerical variables
cor.nc = "lm", # Correlation for numerical-categorical variables
cor.cc = "cramer", # Correlation for categorical-categorical variables
verbose = FALSE,
parallel = FALSE
)
results
#> $data
#> infer infer.value stat stat.value isig msg
#> 1 Cramer's V 1.0000000000 P-value 4.997501e-04 TRUE
#> 2 Cramer's V 0.6536913120 P-value 4.997501e-04 TRUE
#> 3 Linear Model 0.8405724476 P-value 1.380984e-88 TRUE
#> 4 Linear Model 0.8224108384 P-value 1.446616e-81 TRUE
#> 5 Linear Model 0.8801791989 P-value 1.587418e-107 TRUE
#> 6 Linear Model 0.8212726295 P-value 3.744505e-81 TRUE
#> 7 Cramer's V 0.0120817001 P-value 9.840080e-01 FALSE
#> 8 Linear Model 0.0451061249 P-value 7.145911e-01 FALSE
#> 9 Cramer's V 0.6536913120 P-value 4.997501e-04 TRUE
#> 10 Cramer's V 1.0000000000 P-value 4.997501e-04 TRUE
#> 11 Linear Model 0.3777991082 P-value 9.209312e-12 TRUE
#> 12 Linear Model 0.6264906937 P-value 1.933066e-36 TRUE
#> 13 Linear Model 0.6023124915 P-value 5.101675e-33 TRUE
#> 14 Linear Model 0.6237160373 P-value 4.946012e-36 TRUE
#> 15 Cramer's V 0.0131518104 P-value 9.870065e-01 FALSE
#> 16 Linear Model 0.0541461702 P-value 6.160305e-01 FALSE
#> 17 Linear Model 0.8405724476 P-value 1.380984e-88 TRUE
#> 18 Linear Model 0.3777991082 P-value 9.209312e-12 TRUE
#> 19 Pearson Correlation 1.0000000000 P-value 0.000000e+00 TRUE
#> 20 Pearson Correlation -0.2286256359 P-value 9.999874e-01 FALSE
#> 21 Pearson Correlation 0.6530956387 P-value 3.605670e-42 TRUE
#> 22 Pearson Correlation 0.5894511102 P-value 7.693068e-33 TRUE
#> 23 Linear Model 0.3440777822 P-value 1.094256e-10 TRUE
#> 24 Pearson Correlation 0.0326568975 P-value 2.763062e-01 FALSE
#> 25 Linear Model 0.8224108384 P-value 1.446616e-81 TRUE
#> 26 Linear Model 0.6264906937 P-value 1.933066e-36 TRUE
#> 27 Pearson Correlation -0.2286256359 P-value 9.999874e-01 FALSE
#> 28 Pearson Correlation 1.0000000000 P-value 0.000000e+00 TRUE
#> 29 Pearson Correlation -0.5777916963 P-value 1.000000e+00 FALSE
#> 30 Pearson Correlation -0.4720156602 P-value 1.000000e+00 FALSE
#> 31 Linear Model 0.3726732882 P-value 2.066410e-12 TRUE
#> 32 Pearson Correlation -0.0481815979 P-value 8.096031e-01 FALSE
#> 33 Linear Model 0.8801791989 P-value 1.587418e-107 TRUE
#> 34 Linear Model 0.6023124915 P-value 5.101675e-33 TRUE
#> 35 Pearson Correlation 0.6530956387 P-value 3.605670e-42 TRUE
#> 36 Pearson Correlation -0.5777916963 P-value 1.000000e+00 FALSE
#> 37 Pearson Correlation 1.0000000000 P-value 0.000000e+00 TRUE
#> 38 Pearson Correlation 0.8729788986 P-value 1.566418e-105 TRUE
#> 39 Linear Model 0.2551688758 P-value 2.391097e-06 TRUE
#> 40 Pearson Correlation 0.1510679183 P-value 2.870361e-03 TRUE
#> 41 Linear Model 0.8212726295 P-value 3.744505e-81 TRUE
#> 42 Linear Model 0.6237160373 P-value 4.946012e-36 TRUE
#> 43 Pearson Correlation 0.5894511102 P-value 7.693068e-33 TRUE
#> 44 Pearson Correlation -0.4720156602 P-value 1.000000e+00 FALSE
#> 45 Pearson Correlation 0.8729788986 P-value 1.566418e-105 TRUE
#> 46 Pearson Correlation 1.0000000000 P-value 0.000000e+00 TRUE
#> 47 Linear Model 0.4249869909 P-value 4.897247e-16 TRUE
#> 48 Pearson Correlation 0.0218621307 P-value 3.455017e-01 FALSE
#> 49 Cramer's V 0.0120817001 P-value 9.830085e-01 FALSE
#> 50 Cramer's V 0.0131518104 P-value 9.905047e-01 FALSE
#> 51 Linear Model 0.3440777822 P-value 1.094256e-10 TRUE
#> 52 Linear Model 0.3726732882 P-value 2.066410e-12 TRUE
#> 53 Linear Model 0.2551688758 P-value 2.391097e-06 TRUE
#> 54 Linear Model 0.4249869909 P-value 4.897247e-16 TRUE
#> 55 Cramer's V 0.9939935065 P-value 4.997501e-04 TRUE
#> 56 Linear Model 0.0004666282 P-value 9.932315e-01 FALSE
#> 57 Linear Model 0.0451061249 P-value 7.145911e-01 FALSE
#> 58 Linear Model 0.0541461702 P-value 6.160305e-01 FALSE
#> 59 Pearson Correlation 0.0326568975 P-value 2.763062e-01 FALSE
#> 60 Pearson Correlation -0.0481815979 P-value 8.096031e-01 FALSE
#> 61 Pearson Correlation 0.1510679183 P-value 2.870361e-03 TRUE
#> 62 Pearson Correlation 0.0218621307 P-value 3.455017e-01 FALSE
#> 63 Linear Model 0.0004666282 P-value 9.932315e-01 FALSE
#> 64 Pearson Correlation 1.0000000000 P-value 0.000000e+00 TRUE
#> varx vary
#> 1 species species
#> 2 species island
#> 3 species bill_length_mm
#> 4 species bill_depth_mm
#> 5 species flipper_length_mm
#> 6 species body_mass_g
#> 7 species sex
#> 8 species year
#> 9 island species
#> 10 island island
#> 11 island bill_length_mm
#> 12 island bill_depth_mm
#> 13 island flipper_length_mm
#> 14 island body_mass_g
#> 15 island sex
#> 16 island year
#> 17 bill_length_mm species
#> 18 bill_length_mm island
#> 19 bill_length_mm bill_length_mm
#> 20 bill_length_mm bill_depth_mm
#> 21 bill_length_mm flipper_length_mm
#> 22 bill_length_mm body_mass_g
#> 23 bill_length_mm sex
#> 24 bill_length_mm year
#> 25 bill_depth_mm species
#> 26 bill_depth_mm island
#> 27 bill_depth_mm bill_length_mm
#> 28 bill_depth_mm bill_depth_mm
#> 29 bill_depth_mm flipper_length_mm
#> 30 bill_depth_mm body_mass_g
#> 31 bill_depth_mm sex
#> 32 bill_depth_mm year
#> 33 flipper_length_mm species
#> 34 flipper_length_mm island
#> 35 flipper_length_mm bill_length_mm
#> 36 flipper_length_mm bill_depth_mm
#> 37 flipper_length_mm flipper_length_mm
#> 38 flipper_length_mm body_mass_g
#> 39 flipper_length_mm sex
#> 40 flipper_length_mm year
#> 41 body_mass_g species
#> 42 body_mass_g island
#> 43 body_mass_g bill_length_mm
#> 44 body_mass_g bill_depth_mm
#> 45 body_mass_g flipper_length_mm
#> 46 body_mass_g body_mass_g
#> 47 body_mass_g sex
#> 48 body_mass_g year
#> 49 sex species
#> 50 sex island
#> 51 sex bill_length_mm
#> 52 sex bill_depth_mm
#> 53 sex flipper_length_mm
#> 54 sex body_mass_g
#> 55 sex sex
#> 56 sex year
#> 57 year species
#> 58 year island
#> 59 year bill_length_mm
#> 60 year bill_depth_mm
#> 61 year flipper_length_mm
#> 62 year body_mass_g
#> 63 year sex
#> 64 year year
#>
#> $index
#> i j
#> 1 1 1
#> 2 2 1
#> 3 3 1
#> 4 4 1
#> 5 5 1
#> 6 6 1
#> 7 7 1
#> 8 8 1
#> 9 1 2
#> 10 2 2
#> 11 3 2
#> 12 4 2
#> 13 5 2
#> 14 6 2
#> 15 7 2
#> 16 8 2
#> 17 1 3
#> 18 2 3
#> 19 3 3
#> 20 4 3
#> 21 5 3
#> 22 6 3
#> 23 7 3
#> 24 8 3
#> 25 1 4
#> 26 2 4
#> 27 3 4
#> 28 4 4
#> 29 5 4
#> 30 6 4
#> 31 7 4
#> 32 8 4
#> 33 1 5
#> 34 2 5
#> 35 3 5
#> 36 4 5
#> 37 5 5
#> 38 6 5
#> 39 7 5
#> 40 8 5
#> 41 1 6
#> 42 2 6
#> 43 3 6
#> 44 4 6
#> 45 5 6
#> 46 6 6
#> 47 7 6
#> 48 8 6
#> 49 1 7
#> 50 2 7
#> 51 3 7
#> 52 4 7
#> 53 5 7
#> 54 6 7
#> 55 7 7
#> 56 8 7
#> 57 1 8
#> 58 2 8
#> 59 3 8
#> 60 4 8
#> 61 5 8
#> 62 6 8
#> 63 7 8
#> 64 8 8
#>
#> attr(,"class")
#> [1] "clist" "list"
2. Exploring the Results
The result returned by corrp
is an object of class
"clist"
, which contains the correlation values and
associated statistical information.
# Access the correlation data
results$data
#> infer infer.value stat stat.value isig msg
#> 1 Cramer's V 1.0000000000 P-value 4.997501e-04 TRUE
#> 2 Cramer's V 0.6536913120 P-value 4.997501e-04 TRUE
#> 3 Linear Model 0.8405724476 P-value 1.380984e-88 TRUE
#> 4 Linear Model 0.8224108384 P-value 1.446616e-81 TRUE
#> 5 Linear Model 0.8801791989 P-value 1.587418e-107 TRUE
#> 6 Linear Model 0.8212726295 P-value 3.744505e-81 TRUE
#> 7 Cramer's V 0.0120817001 P-value 9.840080e-01 FALSE
#> 8 Linear Model 0.0451061249 P-value 7.145911e-01 FALSE
#> 9 Cramer's V 0.6536913120 P-value 4.997501e-04 TRUE
#> 10 Cramer's V 1.0000000000 P-value 4.997501e-04 TRUE
#> 11 Linear Model 0.3777991082 P-value 9.209312e-12 TRUE
#> 12 Linear Model 0.6264906937 P-value 1.933066e-36 TRUE
#> 13 Linear Model 0.6023124915 P-value 5.101675e-33 TRUE
#> 14 Linear Model 0.6237160373 P-value 4.946012e-36 TRUE
#> 15 Cramer's V 0.0131518104 P-value 9.870065e-01 FALSE
#> 16 Linear Model 0.0541461702 P-value 6.160305e-01 FALSE
#> 17 Linear Model 0.8405724476 P-value 1.380984e-88 TRUE
#> 18 Linear Model 0.3777991082 P-value 9.209312e-12 TRUE
#> 19 Pearson Correlation 1.0000000000 P-value 0.000000e+00 TRUE
#> 20 Pearson Correlation -0.2286256359 P-value 9.999874e-01 FALSE
#> 21 Pearson Correlation 0.6530956387 P-value 3.605670e-42 TRUE
#> 22 Pearson Correlation 0.5894511102 P-value 7.693068e-33 TRUE
#> 23 Linear Model 0.3440777822 P-value 1.094256e-10 TRUE
#> 24 Pearson Correlation 0.0326568975 P-value 2.763062e-01 FALSE
#> 25 Linear Model 0.8224108384 P-value 1.446616e-81 TRUE
#> 26 Linear Model 0.6264906937 P-value 1.933066e-36 TRUE
#> 27 Pearson Correlation -0.2286256359 P-value 9.999874e-01 FALSE
#> 28 Pearson Correlation 1.0000000000 P-value 0.000000e+00 TRUE
#> 29 Pearson Correlation -0.5777916963 P-value 1.000000e+00 FALSE
#> 30 Pearson Correlation -0.4720156602 P-value 1.000000e+00 FALSE
#> 31 Linear Model 0.3726732882 P-value 2.066410e-12 TRUE
#> 32 Pearson Correlation -0.0481815979 P-value 8.096031e-01 FALSE
#> 33 Linear Model 0.8801791989 P-value 1.587418e-107 TRUE
#> 34 Linear Model 0.6023124915 P-value 5.101675e-33 TRUE
#> 35 Pearson Correlation 0.6530956387 P-value 3.605670e-42 TRUE
#> 36 Pearson Correlation -0.5777916963 P-value 1.000000e+00 FALSE
#> 37 Pearson Correlation 1.0000000000 P-value 0.000000e+00 TRUE
#> 38 Pearson Correlation 0.8729788986 P-value 1.566418e-105 TRUE
#> 39 Linear Model 0.2551688758 P-value 2.391097e-06 TRUE
#> 40 Pearson Correlation 0.1510679183 P-value 2.870361e-03 TRUE
#> 41 Linear Model 0.8212726295 P-value 3.744505e-81 TRUE
#> 42 Linear Model 0.6237160373 P-value 4.946012e-36 TRUE
#> 43 Pearson Correlation 0.5894511102 P-value 7.693068e-33 TRUE
#> 44 Pearson Correlation -0.4720156602 P-value 1.000000e+00 FALSE
#> 45 Pearson Correlation 0.8729788986 P-value 1.566418e-105 TRUE
#> 46 Pearson Correlation 1.0000000000 P-value 0.000000e+00 TRUE
#> 47 Linear Model 0.4249869909 P-value 4.897247e-16 TRUE
#> 48 Pearson Correlation 0.0218621307 P-value 3.455017e-01 FALSE
#> 49 Cramer's V 0.0120817001 P-value 9.830085e-01 FALSE
#> 50 Cramer's V 0.0131518104 P-value 9.905047e-01 FALSE
#> 51 Linear Model 0.3440777822 P-value 1.094256e-10 TRUE
#> 52 Linear Model 0.3726732882 P-value 2.066410e-12 TRUE
#> 53 Linear Model 0.2551688758 P-value 2.391097e-06 TRUE
#> 54 Linear Model 0.4249869909 P-value 4.897247e-16 TRUE
#> 55 Cramer's V 0.9939935065 P-value 4.997501e-04 TRUE
#> 56 Linear Model 0.0004666282 P-value 9.932315e-01 FALSE
#> 57 Linear Model 0.0451061249 P-value 7.145911e-01 FALSE
#> 58 Linear Model 0.0541461702 P-value 6.160305e-01 FALSE
#> 59 Pearson Correlation 0.0326568975 P-value 2.763062e-01 FALSE
#> 60 Pearson Correlation -0.0481815979 P-value 8.096031e-01 FALSE
#> 61 Pearson Correlation 0.1510679183 P-value 2.870361e-03 TRUE
#> 62 Pearson Correlation 0.0218621307 P-value 3.455017e-01 FALSE
#> 63 Linear Model 0.0004666282 P-value 9.932315e-01 FALSE
#> 64 Pearson Correlation 1.0000000000 P-value 0.000000e+00 TRUE
#> varx vary
#> 1 species species
#> 2 species island
#> 3 species bill_length_mm
#> 4 species bill_depth_mm
#> 5 species flipper_length_mm
#> 6 species body_mass_g
#> 7 species sex
#> 8 species year
#> 9 island species
#> 10 island island
#> 11 island bill_length_mm
#> 12 island bill_depth_mm
#> 13 island flipper_length_mm
#> 14 island body_mass_g
#> 15 island sex
#> 16 island year
#> 17 bill_length_mm species
#> 18 bill_length_mm island
#> 19 bill_length_mm bill_length_mm
#> 20 bill_length_mm bill_depth_mm
#> 21 bill_length_mm flipper_length_mm
#> 22 bill_length_mm body_mass_g
#> 23 bill_length_mm sex
#> 24 bill_length_mm year
#> 25 bill_depth_mm species
#> 26 bill_depth_mm island
#> 27 bill_depth_mm bill_length_mm
#> 28 bill_depth_mm bill_depth_mm
#> 29 bill_depth_mm flipper_length_mm
#> 30 bill_depth_mm body_mass_g
#> 31 bill_depth_mm sex
#> 32 bill_depth_mm year
#> 33 flipper_length_mm species
#> 34 flipper_length_mm island
#> 35 flipper_length_mm bill_length_mm
#> 36 flipper_length_mm bill_depth_mm
#> 37 flipper_length_mm flipper_length_mm
#> 38 flipper_length_mm body_mass_g
#> 39 flipper_length_mm sex
#> 40 flipper_length_mm year
#> 41 body_mass_g species
#> 42 body_mass_g island
#> 43 body_mass_g bill_length_mm
#> 44 body_mass_g bill_depth_mm
#> 45 body_mass_g flipper_length_mm
#> 46 body_mass_g body_mass_g
#> 47 body_mass_g sex
#> 48 body_mass_g year
#> 49 sex species
#> 50 sex island
#> 51 sex bill_length_mm
#> 52 sex bill_depth_mm
#> 53 sex flipper_length_mm
#> 54 sex body_mass_g
#> 55 sex sex
#> 56 sex year
#> 57 year species
#> 58 year island
#> 59 year bill_length_mm
#> 60 year bill_depth_mm
#> 61 year flipper_length_mm
#> 62 year body_mass_g
#> 63 year sex
#> 64 year year
The result of corrp is a list with two tables: data
and
index
.
-
data: A table containing all the statistical results. The columns of this table are as follows:
-
infer
: The method or metric used to assess the relationship between the variables (e.g., Maximal Information Coefficient or Predictive Power Score). -
infer.value
: The value or score obtained from the specified inference method, representing the strength or quality of the relationship between the variables. -
stat
: The statistical test or measure associated with the inference method (e.g., P-value or F1_weighted). - `stat.value: The numerical value corresponding to the statistical test or measure, providing additional context about the inference (e.g., significance or performance score).
-
isig
: A logical value indicating whether the statistical result is significant (TRUE
) or not, based on predefined criteria (e.g., threshold for P-value). -
msg
: A message or error related to the inference process. -
varx
: The name of the first variable in the analysis (independent variable or feature). -
vary
: The name of the second variable in the analysis (dependent/target variable).
-
index: A table that contains the pairs of indices used in each inference of the
data
table.
3. Filtering Significant Correlations
To focus on significant correlations, you can filter the results based on significance or another criterion. Here, we filter the results for all correlations that are significant according to the default p-value threshold of 0.05.
# Filter significant correlations (p-value < 0.05)
significant_results <- subset(results$data, isig)
significant_results
#> infer infer.value stat stat.value isig msg
#> 1 Cramer's V 1.0000000 P-value 4.997501e-04 TRUE
#> 2 Cramer's V 0.6536913 P-value 4.997501e-04 TRUE
#> 3 Linear Model 0.8405724 P-value 1.380984e-88 TRUE
#> 4 Linear Model 0.8224108 P-value 1.446616e-81 TRUE
#> 5 Linear Model 0.8801792 P-value 1.587418e-107 TRUE
#> 6 Linear Model 0.8212726 P-value 3.744505e-81 TRUE
#> 9 Cramer's V 0.6536913 P-value 4.997501e-04 TRUE
#> 10 Cramer's V 1.0000000 P-value 4.997501e-04 TRUE
#> 11 Linear Model 0.3777991 P-value 9.209312e-12 TRUE
#> 12 Linear Model 0.6264907 P-value 1.933066e-36 TRUE
#> 13 Linear Model 0.6023125 P-value 5.101675e-33 TRUE
#> 14 Linear Model 0.6237160 P-value 4.946012e-36 TRUE
#> 17 Linear Model 0.8405724 P-value 1.380984e-88 TRUE
#> 18 Linear Model 0.3777991 P-value 9.209312e-12 TRUE
#> 19 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 21 Pearson Correlation 0.6530956 P-value 3.605670e-42 TRUE
#> 22 Pearson Correlation 0.5894511 P-value 7.693068e-33 TRUE
#> 23 Linear Model 0.3440778 P-value 1.094256e-10 TRUE
#> 25 Linear Model 0.8224108 P-value 1.446616e-81 TRUE
#> 26 Linear Model 0.6264907 P-value 1.933066e-36 TRUE
#> 28 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 31 Linear Model 0.3726733 P-value 2.066410e-12 TRUE
#> 33 Linear Model 0.8801792 P-value 1.587418e-107 TRUE
#> 34 Linear Model 0.6023125 P-value 5.101675e-33 TRUE
#> 35 Pearson Correlation 0.6530956 P-value 3.605670e-42 TRUE
#> 37 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 38 Pearson Correlation 0.8729789 P-value 1.566418e-105 TRUE
#> 39 Linear Model 0.2551689 P-value 2.391097e-06 TRUE
#> 40 Pearson Correlation 0.1510679 P-value 2.870361e-03 TRUE
#> 41 Linear Model 0.8212726 P-value 3.744505e-81 TRUE
#> 42 Linear Model 0.6237160 P-value 4.946012e-36 TRUE
#> 43 Pearson Correlation 0.5894511 P-value 7.693068e-33 TRUE
#> 45 Pearson Correlation 0.8729789 P-value 1.566418e-105 TRUE
#> 46 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 47 Linear Model 0.4249870 P-value 4.897247e-16 TRUE
#> 51 Linear Model 0.3440778 P-value 1.094256e-10 TRUE
#> 52 Linear Model 0.3726733 P-value 2.066410e-12 TRUE
#> 53 Linear Model 0.2551689 P-value 2.391097e-06 TRUE
#> 54 Linear Model 0.4249870 P-value 4.897247e-16 TRUE
#> 55 Cramer's V 0.9939935 P-value 4.997501e-04 TRUE
#> 61 Pearson Correlation 0.1510679 P-value 2.870361e-03 TRUE
#> 64 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> varx vary
#> 1 species species
#> 2 species island
#> 3 species bill_length_mm
#> 4 species bill_depth_mm
#> 5 species flipper_length_mm
#> 6 species body_mass_g
#> 9 island species
#> 10 island island
#> 11 island bill_length_mm
#> 12 island bill_depth_mm
#> 13 island flipper_length_mm
#> 14 island body_mass_g
#> 17 bill_length_mm species
#> 18 bill_length_mm island
#> 19 bill_length_mm bill_length_mm
#> 21 bill_length_mm flipper_length_mm
#> 22 bill_length_mm body_mass_g
#> 23 bill_length_mm sex
#> 25 bill_depth_mm species
#> 26 bill_depth_mm island
#> 28 bill_depth_mm bill_depth_mm
#> 31 bill_depth_mm sex
#> 33 flipper_length_mm species
#> 34 flipper_length_mm island
#> 35 flipper_length_mm bill_length_mm
#> 37 flipper_length_mm flipper_length_mm
#> 38 flipper_length_mm body_mass_g
#> 39 flipper_length_mm sex
#> 40 flipper_length_mm year
#> 41 body_mass_g species
#> 42 body_mass_g island
#> 43 body_mass_g bill_length_mm
#> 45 body_mass_g flipper_length_mm
#> 46 body_mass_g body_mass_g
#> 47 body_mass_g sex
#> 51 sex bill_length_mm
#> 52 sex bill_depth_mm
#> 53 sex flipper_length_mm
#> 54 sex body_mass_g
#> 55 sex sex
#> 61 year flipper_length_mm
#> 64 year year
You can modify the p-value threshold of 0.05 by using the argument
p.value
in corrp
function.
# Set the p-value treshold to 0.3
results <- corrp(
penguins,
cor.nn = "pearson", # Correlation for numerical-numerical variables
cor.nc = "lm", # Correlation for numerical-categorical variables
cor.cc = "cramer", # Correlation for categorical-categorical variables
verbose = FALSE,
p.value = 0.30,
parallel = FALSE
)
significant_results <- subset(results$data, isig)
significant_results
#> infer infer.value stat stat.value isig msg
#> 1 Cramer's V 1.0000000 P-value 4.997501e-04 TRUE
#> 2 Cramer's V 0.6536913 P-value 4.997501e-04 TRUE
#> 3 Linear Model 0.8405724 P-value 1.380984e-88 TRUE
#> 4 Linear Model 0.8224108 P-value 1.446616e-81 TRUE
#> 5 Linear Model 0.8801792 P-value 1.587418e-107 TRUE
#> 6 Linear Model 0.8212726 P-value 3.744505e-81 TRUE
#> 9 Cramer's V 0.6536913 P-value 4.997501e-04 TRUE
#> 10 Cramer's V 1.0000000 P-value 4.997501e-04 TRUE
#> 11 Linear Model 0.3777991 P-value 9.209312e-12 TRUE
#> 12 Linear Model 0.6264907 P-value 1.933066e-36 TRUE
#> 13 Linear Model 0.6023125 P-value 5.101675e-33 TRUE
#> 14 Linear Model 0.6237160 P-value 4.946012e-36 TRUE
#> 17 Linear Model 0.8405724 P-value 1.380984e-88 TRUE
#> 18 Linear Model 0.3777991 P-value 9.209312e-12 TRUE
#> 19 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 21 Pearson Correlation 0.6530956 P-value 3.605670e-42 TRUE
#> 22 Pearson Correlation 0.5894511 P-value 7.693068e-33 TRUE
#> 23 Linear Model 0.3440778 P-value 1.094256e-10 TRUE
#> 24 Pearson Correlation 0.0326569 P-value 2.763062e-01 TRUE
#> 25 Linear Model 0.8224108 P-value 1.446616e-81 TRUE
#> 26 Linear Model 0.6264907 P-value 1.933066e-36 TRUE
#> 28 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 31 Linear Model 0.3726733 P-value 2.066410e-12 TRUE
#> 33 Linear Model 0.8801792 P-value 1.587418e-107 TRUE
#> 34 Linear Model 0.6023125 P-value 5.101675e-33 TRUE
#> 35 Pearson Correlation 0.6530956 P-value 3.605670e-42 TRUE
#> 37 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 38 Pearson Correlation 0.8729789 P-value 1.566418e-105 TRUE
#> 39 Linear Model 0.2551689 P-value 2.391097e-06 TRUE
#> 40 Pearson Correlation 0.1510679 P-value 2.870361e-03 TRUE
#> 41 Linear Model 0.8212726 P-value 3.744505e-81 TRUE
#> 42 Linear Model 0.6237160 P-value 4.946012e-36 TRUE
#> 43 Pearson Correlation 0.5894511 P-value 7.693068e-33 TRUE
#> 45 Pearson Correlation 0.8729789 P-value 1.566418e-105 TRUE
#> 46 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 47 Linear Model 0.4249870 P-value 4.897247e-16 TRUE
#> 51 Linear Model 0.3440778 P-value 1.094256e-10 TRUE
#> 52 Linear Model 0.3726733 P-value 2.066410e-12 TRUE
#> 53 Linear Model 0.2551689 P-value 2.391097e-06 TRUE
#> 54 Linear Model 0.4249870 P-value 4.897247e-16 TRUE
#> 55 Cramer's V 0.9939935 P-value 4.997501e-04 TRUE
#> 59 Pearson Correlation 0.0326569 P-value 2.763062e-01 TRUE
#> 61 Pearson Correlation 0.1510679 P-value 2.870361e-03 TRUE
#> 64 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> varx vary
#> 1 species species
#> 2 species island
#> 3 species bill_length_mm
#> 4 species bill_depth_mm
#> 5 species flipper_length_mm
#> 6 species body_mass_g
#> 9 island species
#> 10 island island
#> 11 island bill_length_mm
#> 12 island bill_depth_mm
#> 13 island flipper_length_mm
#> 14 island body_mass_g
#> 17 bill_length_mm species
#> 18 bill_length_mm island
#> 19 bill_length_mm bill_length_mm
#> 21 bill_length_mm flipper_length_mm
#> 22 bill_length_mm body_mass_g
#> 23 bill_length_mm sex
#> 24 bill_length_mm year
#> 25 bill_depth_mm species
#> 26 bill_depth_mm island
#> 28 bill_depth_mm bill_depth_mm
#> 31 bill_depth_mm sex
#> 33 flipper_length_mm species
#> 34 flipper_length_mm island
#> 35 flipper_length_mm bill_length_mm
#> 37 flipper_length_mm flipper_length_mm
#> 38 flipper_length_mm body_mass_g
#> 39 flipper_length_mm sex
#> 40 flipper_length_mm year
#> 41 body_mass_g species
#> 42 body_mass_g island
#> 43 body_mass_g bill_length_mm
#> 45 body_mass_g flipper_length_mm
#> 46 body_mass_g body_mass_g
#> 47 body_mass_g sex
#> 51 sex bill_length_mm
#> 52 sex bill_depth_mm
#> 53 sex flipper_length_mm
#> 54 sex body_mass_g
#> 55 sex sex
#> 59 year bill_length_mm
#> 61 year flipper_length_mm
#> 64 year year
4. Correlation Types
The corrp
function allows you to specify different
correlation methods based on the types of variables being compared:
- Numerical-Numerical Correlations: Options include PPS, Pearson, MIC, and Dcor.
- Numerical-Categorical Correlations: Options include PPS and LM.
- Categorical-Categorical Correlations: Options include PPS, Cramer’s V and Uncertainty Coefficient.
For example, let’s compute the correlations using different methods for numerical-numerical, numerical-categorical, and categorical-categorical data.
# Example of changing correlation methods
results_custom <- corrp(
penguins,
cor.nn = "mic",
cor.nc = "pps",
cor.cc = "uncoef",
verbose = FALSE,
parallel = FALSE
)
results_custom$data
#> infer infer.value stat stat.value isig
#> 1 Uncertainty coefficient 9.999973e-01 P-value 0.0000000 TRUE
#> 2 Uncertainty coefficient 5.022665e-01 P-value 0.0000000 TRUE
#> 3 Predictive Power Score 4.792168e-01 MAE 2.4535540 NA
#> 4 Predictive Power Score 4.489136e-01 MAE 0.9160688 NA
#> 5 Predictive Power Score 5.603289e-01 MAE 5.3396202 NA
#> 6 Predictive Power Score 4.375413e-01 MAE 379.0035986 NA
#> 7 Uncertainty coefficient 8.357163e-05 P-value 0.9880000 FALSE
#> 8 Predictive Power Score 0.000000e+00 MAE 0.6785903 NA
#> 9 Uncertainty coefficient 5.022665e-01 P-value 0.0000000 TRUE
#> 10 Uncertainty coefficient 9.999972e-01 P-value 0.0000000 TRUE
#> 11 Predictive Power Score 1.020869e-01 MAE 4.2569938 NA
#> 12 Predictive Power Score 2.631403e-01 MAE 1.2270729 NA
#> 13 Predictive Power Score 2.955518e-01 MAE 8.5994051 NA
#> 14 Predictive Power Score 2.773420e-01 MAE 491.1618030 NA
#> 15 Uncertainty coefficient 1.025293e-04 P-value 0.9600000 FALSE
#> 16 Predictive Power Score 0.000000e+00 MAE 0.6785903 NA
#> 17 Predictive Power Score 3.804169e-01 F1_weighted 0.5811024 NA
#> 18 Predictive Power Score 7.056978e-02 F1_weighted 0.2597492 NA
#> 19 Maximal Information Coefficient 9.999935e-01 P-value 0.0000000 TRUE
#> 20 Maximal Information Coefficient 3.080616e-01 P-value 0.0000000 TRUE
#> 21 Maximal Information Coefficient 4.610182e-01 P-value 0.0000000 TRUE
#> 22 Maximal Information Coefficient 3.862266e-01 P-value 0.0000000 TRUE
#> 23 Predictive Power Score 1.367963e-01 F1_weighted 0.4302709 NA
#> 24 Maximal Information Coefficient 1.525751e-01 P-value 0.3840000 FALSE
#> 25 Predictive Power Score 5.000552e-02 F1_weighted 0.3495137 NA
#> 26 Predictive Power Score 1.330752e-01 F1_weighted 0.2522539 NA
#> 27 Maximal Information Coefficient 3.080616e-01 P-value 0.0000000 TRUE
#> 28 Maximal Information Coefficient 9.999754e-01 P-value 0.0000000 TRUE
#> 29 Maximal Information Coefficient 6.458689e-01 P-value 0.0000000 TRUE
#> 30 Maximal Information Coefficient 5.032379e-01 P-value 0.0000000 TRUE
#> 31 Predictive Power Score 2.011406e-01 F1_weighted 0.4053895 NA
#> 32 Maximal Information Coefficient 1.251901e-01 P-value 0.2240000 FALSE
#> 33 Predictive Power Score 7.165068e-02 F1_weighted 0.3697977 NA
#> 34 Predictive Power Score 1.146489e-01 F1_weighted 0.2430145 NA
#> 35 Maximal Information Coefficient 4.610182e-01 P-value 0.0000000 TRUE
#> 36 Maximal Information Coefficient 6.458689e-01 P-value 0.0000000 TRUE
#> 37 Maximal Information Coefficient 9.997042e-01 P-value 0.0000000 TRUE
#> 38 Maximal Information Coefficient 7.044209e-01 P-value 0.0000000 TRUE
#> 39 Predictive Power Score 1.247432e-01 F1_weighted 0.4707015 NA
#> 40 Maximal Information Coefficient 1.476128e-01 P-value 0.0040000 TRUE
#> 41 Predictive Power Score 3.134767e-02 F1_weighted 0.3356898 NA
#> 42 Predictive Power Score 1.095742e-01 F1_weighted 0.2580854 NA
#> 43 Maximal Information Coefficient 3.862266e-01 P-value 0.0000000 TRUE
#> 44 Maximal Information Coefficient 5.032379e-01 P-value 0.0000000 TRUE
#> 45 Maximal Information Coefficient 7.044209e-01 P-value 0.0000000 TRUE
#> 46 Maximal Information Coefficient 9.999935e-01 P-value 0.0000000 TRUE
#> 47 Predictive Power Score 1.751586e-01 F1_weighted 0.4218315 NA
#> 48 Maximal Information Coefficient 1.430199e-01 P-value 0.1040000 FALSE
#> 49 Uncertainty coefficient 8.357163e-05 P-value 0.9680000 FALSE
#> 50 Uncertainty coefficient 1.025293e-04 P-value 0.9840000 FALSE
#> 51 Predictive Power Score 4.101157e-02 MAE 4.6045578 NA
#> 52 Predictive Power Score 3.788597e-02 MAE 1.6092622 NA
#> 53 Predictive Power Score 3.192199e-02 MAE 12.0482848 NA
#> 54 Predictive Power Score 7.332111e-02 MAE 644.5052594 NA
#> 55 Uncertainty coefficient 9.999986e-01 P-value 0.0000000 TRUE
#> 56 Predictive Power Score 0.000000e+00 MAE 0.6785903 NA
#> 57 Predictive Power Score 0.000000e+00 F1_weighted 0.2671039 NA
#> 58 Predictive Power Score 1.467916e-02 F1_weighted 0.1867607 NA
#> 59 Maximal Information Coefficient 1.525751e-01 P-value 0.1640000 FALSE
#> 60 Maximal Information Coefficient 1.251901e-01 P-value 0.1680000 FALSE
#> 61 Maximal Information Coefficient 1.476128e-01 P-value 0.0040000 TRUE
#> 62 Maximal Information Coefficient 1.430199e-01 P-value 0.0480000 TRUE
#> 63 Predictive Power Score 6.164871e-02 F1_weighted 0.4824119 NA
#> 64 Maximal Information Coefficient 9.987079e-01 P-value 0.0000000 TRUE
#> msg varx vary
#> 1 species species
#> 2 species island
#> 3 species bill_length_mm
#> 4 species bill_depth_mm
#> 5 species flipper_length_mm
#> 6 species body_mass_g
#> 7 species sex
#> 8 species year
#> 9 island species
#> 10 island island
#> 11 island bill_length_mm
#> 12 island bill_depth_mm
#> 13 island flipper_length_mm
#> 14 island body_mass_g
#> 15 island sex
#> 16 island year
#> 17 bill_length_mm species
#> 18 bill_length_mm island
#> 19 bill_length_mm bill_length_mm
#> 20 bill_length_mm bill_depth_mm
#> 21 bill_length_mm flipper_length_mm
#> 22 bill_length_mm body_mass_g
#> 23 bill_length_mm sex
#> 24 bill_length_mm year
#> 25 bill_depth_mm species
#> 26 bill_depth_mm island
#> 27 bill_depth_mm bill_length_mm
#> 28 bill_depth_mm bill_depth_mm
#> 29 bill_depth_mm flipper_length_mm
#> 30 bill_depth_mm body_mass_g
#> 31 bill_depth_mm sex
#> 32 bill_depth_mm year
#> 33 flipper_length_mm species
#> 34 flipper_length_mm island
#> 35 flipper_length_mm bill_length_mm
#> 36 flipper_length_mm bill_depth_mm
#> 37 flipper_length_mm flipper_length_mm
#> 38 flipper_length_mm body_mass_g
#> 39 flipper_length_mm sex
#> 40 flipper_length_mm year
#> 41 body_mass_g species
#> 42 body_mass_g island
#> 43 body_mass_g bill_length_mm
#> 44 body_mass_g bill_depth_mm
#> 45 body_mass_g flipper_length_mm
#> 46 body_mass_g body_mass_g
#> 47 body_mass_g sex
#> 48 body_mass_g year
#> 49 sex species
#> 50 sex island
#> 51 sex bill_length_mm
#> 52 sex bill_depth_mm
#> 53 sex flipper_length_mm
#> 54 sex body_mass_g
#> 55 sex sex
#> 56 sex year
#> 57 year species
#> 58 year island
#> 59 year bill_length_mm
#> 60 year bill_depth_mm
#> 61 year flipper_length_mm
#> 62 year body_mass_g
#> 63 year sex
#> 64 year year
Originally, Pearson, LM, and Cramér’s V were used to capture primarily linear and straightforward associations among variables. The updated configuration replaces these with MIC, PPS, and the Uncertainty Coefficient, which are designed to detect non-linear, complex relationships and provide directional insights into predictive strength.
Advanced Usage
1. Parallel Processing
You can enable parallel processing to speed up the computation,
especially when working with large datasets. Set the
n.cores
argument to the number of cores you’d like to use.
To demonstrate the efficiency of the corrp
package on
larger datasets, we benchmark the performance of computing correlations
on the eusilc
dataset using parallel processing. The
following code compares execution times using 8 cores versus 2 cores. In
order to simulate a large dataset, we are going to sample with
replacement 100,000 times so that our data consists of 28 columns and
100,000 rows, having 4 character variables and the rest as numeric.
library(corrp)
library(laeken)
# Load and prepare the eusilc dataset
data(eusilc)
eusilc <- na.omit(eusilc)
eusilc <- eusilc[sample(NROW(eusilc), 100000, replace = TRUE), ]
8 Cores
# Use bench to measure the performance of parallel processing using 8 cores
bench::mark(
corrp(eusilc,
cor.nn = "pearson",
cor.nc = "lm",
cor.cc = "cramer",
n.cores = 8, # Enable parallel processing with 8 cores
verbose = FALSE
),
iterations = 10
)
- Minimum execution time: 20.8s
- Median execution time: 21.8s
- Iterations per second: 0.0458
- Total time: 2.55 minutes
2 Cores
# Use bench to measure the performance of parallel processing using 2 cores
r = bench::mark(
corrp(eusilc,
cor.nn = "pearson",
cor.nc = "lm",
cor.cc = "cramer",
n.cores = 2, # Enable parallel processing with 2 cores
verbose = FALSE
),
iterations = 1
)
- Minimum execution time: 49s
- Median execution time: 49.2s
- Iterations per second: 0.0203
- Total time: 4.92 minutes
Performance Comparison
Based on your benchmark results, there’s a very significant
improvement in performance when increasing from 2 cores to 8 cores for
the corrp
function on the eusilc
dataset.
Using 8 cores instead of 2 cores gave us more than double the speed. Each run took about 22 seconds with 8 cores compared to 49 seconds with just 2 cores. This clearly shows how adding more processing power can dramatically cut down computation time for large datasets.
These results demonstrate that the corrp
function
benefits from additional cores, showing that parallel processing can
enhance the efficiency of computations with large datasets like
eusilc
. While the scaling isn’t near-linear (4x increase in
cores yielded about a 2.3x speedup), the performance improvement is
still substantial and worthwhile for correlation computations, which can
be computationally intensive when working with large datasets.
2. Custom Inferences with corr_fun
The corr_fun
function can be used directly if you need
finer control over the correlation calculation for specific pairs of
variables. It allows you to specify the variables and methods for
computing the correlation.
# Using corr_fun to compute Pearson correlation between body_mass_g and flipper_length_mm
corr_fun(
penguins,
nx = "body_mass_g",
ny = "flipper_length_mm",
cor.nn = "pearson",
verbose = FALSE
)
#> $infer
#> [1] "Pearson Correlation"
#>
#> $infer.value
#> [1] 0.8729789
#>
#> $stat
#> [1] "P-value"
#>
#> $stat.value
#> [1] 1.566418e-105
#>
#> $isig
#> [1] TRUE
#>
#> $msg
#> [1] ""
#>
#> $varx
#> [1] "body_mass_g"
#>
#> $vary
#> [1] "flipper_length_mm"
Conclusion
The corrp
package provides a simple way to compute
correlations across different types of variables. If you are working
with mixed data, corrp
offers a solution for your
correlation analysis needs. By leveraging parallel processing and C++
implementation, corrp
can handle large datasets
efficiently.