Introduction
The corrp
package provides an efficient way to compute
correlations between variables in a dataset. It supports various
correlation measures for numerical and categorical data, making it a
powerful tool for general correlation analysis. In this vignette, we
will explore how to use corrp
for different types of
correlations, both with numerical and categorical variables.
Usage Example
We will use the built-in iris
dataset, which includes
150 observations of 5 variables: 4 numerical variables and 1 categorical
variable (the species of the flower).
Basic Usage of corrp
1. Computing Correlations with corrp
The corrp
function can be used to compute correlations
between variables, user can select pair correlation type based on the
pair data types.
# Compute correlations for iris dataset
results <- corrp(iris,
cor.nn = "pearson", # Correlation for numerical-numerical variables
cor.nc = "pps", # Correlation for numerical-categorical variables
cor.cc = "cramer", # Correlation for categorical-categorical variables
verbose = FALSE
)
results
#> $data
#> infer infer.value stat stat.value isig msg
#> 1 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 2 Pearson Correlation -0.1175698 P-value 9.240509e-01 FALSE
#> 3 Pearson Correlation 0.8717538 P-value 5.193337e-48 TRUE
#> 4 Pearson Correlation 0.8179411 P-value 1.162749e-37 TRUE
#> 5 Predictive Power Score 0.5591864 F1_weighted 7.028029e-01 NA
#> 6 Pearson Correlation -0.1175698 P-value 9.240509e-01 FALSE
#> 7 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 8 Pearson Correlation -0.4284401 P-value 1.000000e+00 FALSE
#> 9 Pearson Correlation -0.3661259 P-value 9.999980e-01 FALSE
#> 10 Predictive Power Score 0.3134401 F1_weighted 5.377587e-01 NA
#> 11 Pearson Correlation 0.8717538 P-value 5.193337e-48 TRUE
#> 12 Pearson Correlation -0.4284401 P-value 1.000000e+00 FALSE
#> 13 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 14 Pearson Correlation 0.9628654 P-value 2.337502e-86 TRUE
#> 15 Predictive Power Score 0.9167580 F1_weighted 9.404972e-01 NA
#> 16 Pearson Correlation 0.8179411 P-value 1.162749e-37 TRUE
#> 17 Pearson Correlation -0.3661259 P-value 9.999980e-01 FALSE
#> 18 Pearson Correlation 0.9628654 P-value 2.337502e-86 TRUE
#> 19 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 20 Predictive Power Score 0.9398532 F1_weighted 9.599148e-01 NA
#> 21 Predictive Power Score 0.4075487 MAE 4.076661e-01 NA
#> 22 Predictive Power Score 0.2012876 MAE 2.677963e-01 NA
#> 23 Predictive Power Score 0.7904907 MAE 3.280552e-01 NA
#> 24 Predictive Power Score 0.7561113 MAE 1.608119e-01 NA
#> 25 Cramer's V 1.0000000 P-value 4.997501e-04 TRUE
#> varx vary
#> 1 Sepal.Length Sepal.Length
#> 2 Sepal.Length Sepal.Width
#> 3 Sepal.Length Petal.Length
#> 4 Sepal.Length Petal.Width
#> 5 Sepal.Length Species
#> 6 Sepal.Width Sepal.Length
#> 7 Sepal.Width Sepal.Width
#> 8 Sepal.Width Petal.Length
#> 9 Sepal.Width Petal.Width
#> 10 Sepal.Width Species
#> 11 Petal.Length Sepal.Length
#> 12 Petal.Length Sepal.Width
#> 13 Petal.Length Petal.Length
#> 14 Petal.Length Petal.Width
#> 15 Petal.Length Species
#> 16 Petal.Width Sepal.Length
#> 17 Petal.Width Sepal.Width
#> 18 Petal.Width Petal.Length
#> 19 Petal.Width Petal.Width
#> 20 Petal.Width Species
#> 21 Species Sepal.Length
#> 22 Species Sepal.Width
#> 23 Species Petal.Length
#> 24 Species Petal.Width
#> 25 Species Species
#>
#> $index
#> i j
#> 1 1 1
#> 2 2 1
#> 3 3 1
#> 4 4 1
#> 5 5 1
#> 6 1 2
#> 7 2 2
#> 8 3 2
#> 9 4 2
#> 10 5 2
#> 11 1 3
#> 12 2 3
#> 13 3 3
#> 14 4 3
#> 15 5 3
#> 16 1 4
#> 17 2 4
#> 18 3 4
#> 19 4 4
#> 20 5 4
#> 21 1 5
#> 22 2 5
#> 23 3 5
#> 24 4 5
#> 25 5 5
#>
#> attr(,"class")
#> [1] "clist" "list"
2. Exploring the Results
The result returned by corrp
is an object of class
"clist"
, which contains the correlation values and
associated statistical information.
# Access the correlation data
results$data
#> infer infer.value stat stat.value isig msg
#> 1 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 2 Pearson Correlation -0.1175698 P-value 9.240509e-01 FALSE
#> 3 Pearson Correlation 0.8717538 P-value 5.193337e-48 TRUE
#> 4 Pearson Correlation 0.8179411 P-value 1.162749e-37 TRUE
#> 5 Predictive Power Score 0.5591864 F1_weighted 7.028029e-01 NA
#> 6 Pearson Correlation -0.1175698 P-value 9.240509e-01 FALSE
#> 7 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 8 Pearson Correlation -0.4284401 P-value 1.000000e+00 FALSE
#> 9 Pearson Correlation -0.3661259 P-value 9.999980e-01 FALSE
#> 10 Predictive Power Score 0.3134401 F1_weighted 5.377587e-01 NA
#> 11 Pearson Correlation 0.8717538 P-value 5.193337e-48 TRUE
#> 12 Pearson Correlation -0.4284401 P-value 1.000000e+00 FALSE
#> 13 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 14 Pearson Correlation 0.9628654 P-value 2.337502e-86 TRUE
#> 15 Predictive Power Score 0.9167580 F1_weighted 9.404972e-01 NA
#> 16 Pearson Correlation 0.8179411 P-value 1.162749e-37 TRUE
#> 17 Pearson Correlation -0.3661259 P-value 9.999980e-01 FALSE
#> 18 Pearson Correlation 0.9628654 P-value 2.337502e-86 TRUE
#> 19 Pearson Correlation 1.0000000 P-value 0.000000e+00 TRUE
#> 20 Predictive Power Score 0.9398532 F1_weighted 9.599148e-01 NA
#> 21 Predictive Power Score 0.4075487 MAE 4.076661e-01 NA
#> 22 Predictive Power Score 0.2012876 MAE 2.677963e-01 NA
#> 23 Predictive Power Score 0.7904907 MAE 3.280552e-01 NA
#> 24 Predictive Power Score 0.7561113 MAE 1.608119e-01 NA
#> 25 Cramer's V 1.0000000 P-value 4.997501e-04 TRUE
#> varx vary
#> 1 Sepal.Length Sepal.Length
#> 2 Sepal.Length Sepal.Width
#> 3 Sepal.Length Petal.Length
#> 4 Sepal.Length Petal.Width
#> 5 Sepal.Length Species
#> 6 Sepal.Width Sepal.Length
#> 7 Sepal.Width Sepal.Width
#> 8 Sepal.Width Petal.Length
#> 9 Sepal.Width Petal.Width
#> 10 Sepal.Width Species
#> 11 Petal.Length Sepal.Length
#> 12 Petal.Length Sepal.Width
#> 13 Petal.Length Petal.Length
#> 14 Petal.Length Petal.Width
#> 15 Petal.Length Species
#> 16 Petal.Width Sepal.Length
#> 17 Petal.Width Sepal.Width
#> 18 Petal.Width Petal.Length
#> 19 Petal.Width Petal.Width
#> 20 Petal.Width Species
#> 21 Species Sepal.Length
#> 22 Species Sepal.Width
#> 23 Species Petal.Length
#> 24 Species Petal.Width
#> 25 Species Species
3. Filtering Significant Correlations
To focus on significant correlations, you can filter the results based on significance or another criterion. Here, we filter the results for all correlations that are significant according to the default p-value threshold of 0.05.
4. Correlation Types
The corrp
function allows you to specify different
correlation methods based on the types of variables being compared:
- Numerical-Numerical Correlations: Options include Pearson, MIC, and Dcor.
- Numerical-Categorical Correlations: Options include PPS and MIC.
- Categorical-Categorical Correlations: Options include Cramer’s V and Uncertainty Coefficient.
For example, let’s compute the correlations using different methods for numerical-numerical, numerical-categorical, and categorical-categorical data.
# Example of changing correlation methods
results_custom <- corrp(iris,
cor.nn = "mic",
cor.nc = "pps",
cor.cc = "uncoef",
verbose = FALSE
)
results_custom$data
#> infer infer.value stat stat.value isig msg
#> 1 Maximal Information Coefficient 0.9994870 P-value 0.0000000 TRUE
#> 2 Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE
#> 3 Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE
#> 4 Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE
#> 5 Predictive Power Score 0.5591864 F1_weighted 0.7028029 NA
#> 6 Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE
#> 7 Maximal Information Coefficient 0.9967831 P-value 0.0000000 TRUE
#> 8 Maximal Information Coefficient 0.4391362 P-value 0.0000000 TRUE
#> 9 Maximal Information Coefficient 0.4354146 P-value 0.0000000 TRUE
#> 10 Predictive Power Score 0.3134401 F1_weighted 0.5377587 NA
#> 11 Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE
#> 12 Maximal Information Coefficient 0.4391362 P-value 0.0000000 TRUE
#> 13 Maximal Information Coefficient 1.0000000 P-value 0.0000000 TRUE
#> 14 Maximal Information Coefficient 0.9182958 P-value 0.0000000 TRUE
#> 15 Predictive Power Score 0.9167580 F1_weighted 0.9404972 NA
#> 16 Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE
#> 17 Maximal Information Coefficient 0.4354146 P-value 0.0000000 TRUE
#> 18 Maximal Information Coefficient 0.9182958 P-value 0.0000000 TRUE
#> 19 Maximal Information Coefficient 0.9995144 P-value 0.0000000 TRUE
#> 20 Predictive Power Score 0.9398532 F1_weighted 0.9599148 NA
#> 21 Predictive Power Score 0.4075487 MAE 0.4076661 NA
#> 22 Predictive Power Score 0.2012876 MAE 0.2677963 NA
#> 23 Predictive Power Score 0.7904907 MAE 0.3280552 NA
#> 24 Predictive Power Score 0.7561113 MAE 0.1608119 NA
#> 25 Uncertainty coefficient 0.9999758 P-value 0.0000000 TRUE
#> varx vary
#> 1 Sepal.Length Sepal.Length
#> 2 Sepal.Length Sepal.Width
#> 3 Sepal.Length Petal.Length
#> 4 Sepal.Length Petal.Width
#> 5 Sepal.Length Species
#> 6 Sepal.Width Sepal.Length
#> 7 Sepal.Width Sepal.Width
#> 8 Sepal.Width Petal.Length
#> 9 Sepal.Width Petal.Width
#> 10 Sepal.Width Species
#> 11 Petal.Length Sepal.Length
#> 12 Petal.Length Sepal.Width
#> 13 Petal.Length Petal.Length
#> 14 Petal.Length Petal.Width
#> 15 Petal.Length Species
#> 16 Petal.Width Sepal.Length
#> 17 Petal.Width Sepal.Width
#> 18 Petal.Width Petal.Length
#> 19 Petal.Width Petal.Width
#> 20 Petal.Width Species
#> 21 Species Sepal.Length
#> 22 Species Sepal.Width
#> 23 Species Petal.Length
#> 24 Species Petal.Width
#> 25 Species Species
Advanced Usage
1. Parallel Processing
You can enable parallel processing to speed up the computation,
especially when working with large datasets. Set the
n.cores
argument to the number of cores you’d like to
use.
# Using 2 cores for parallel processing
results_parallel <- corrp(iris,
cor.nn = "pearson",
cor.nc = "pps",
cor.cc = "cramer",
n.cores = 2,
verbose = FALSE
)
2. Custom Inferences with corr_fun
The corr_fun
function can be used directly if you need
finer control over the correlation calculation for specific pairs of
variables. It allows you to specify the variables and methods for
computing the correlation.
# Using corr_fun to compute Pearson correlation between Sepal.Length and Petal.Length
corr_fun(
iris,
nx = "Sepal.Length",
ny = "Petal.Length",
cor.nn = "pearson",
verbose = FALSE
)
#> $infer
#> [1] "Pearson Correlation"
#>
#> $infer.value
#> [1] 0.8717538
#>
#> $stat
#> [1] "P-value"
#>
#> $stat.value
#> [1] 5.193337e-48
#>
#> $isig
#> [1] TRUE
#>
#> $msg
#> [1] ""
#>
#> $varx
#> [1] "Sepal.Length"
#>
#> $vary
#> [1] "Petal.Length"
Conclusion
The corrp
package provides a simple way to compute
correlations across different types of variables. If you are working
with mixed data, corrp
offers a solution for your
correlation analysis needs. By leveraging parallel processing and C++
implementation, corrp
can handle large datasets
efficiently.