Skip to contents

Introduction

The corrp package provides an efficient way to compute correlations between variables in a dataset. It supports various correlation measures for numerical and categorical data, making it a powerful tool for general correlation analysis. In this vignette, we will explore how to use corrp for different types of correlations, both with numerical and categorical variables.

Usage Example

We will use the built-in iris dataset, which includes 150 observations of 5 variables: 4 numerical variables and 1 categorical variable (the species of the flower).

# Load the iris dataset
library(corrp)
data(iris)
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

Basic Usage of corrp

1. Computing Correlations with corrp

The corrp function can be used to compute correlations between variables, user can select pair correlation type based on the pair data types.

# Compute correlations for iris dataset
results <- corrp(iris,
  cor.nn = "pearson", # Correlation for numerical-numerical variables
  cor.nc = "pps", # Correlation for numerical-categorical variables
  cor.cc = "cramer", # Correlation for categorical-categorical variables
  verbose = FALSE
)

results
#> $data
#>                     infer infer.value        stat   stat.value  isig msg
#> 1     Pearson Correlation   1.0000000     P-value 0.000000e+00  TRUE    
#> 2     Pearson Correlation  -0.1175698     P-value 9.240509e-01 FALSE    
#> 3     Pearson Correlation   0.8717538     P-value 5.193337e-48  TRUE    
#> 4     Pearson Correlation   0.8179411     P-value 1.162749e-37  TRUE    
#> 5  Predictive Power Score   0.5591864 F1_weighted 7.028029e-01    NA    
#> 6     Pearson Correlation  -0.1175698     P-value 9.240509e-01 FALSE    
#> 7     Pearson Correlation   1.0000000     P-value 0.000000e+00  TRUE    
#> 8     Pearson Correlation  -0.4284401     P-value 1.000000e+00 FALSE    
#> 9     Pearson Correlation  -0.3661259     P-value 9.999980e-01 FALSE    
#> 10 Predictive Power Score   0.3134401 F1_weighted 5.377587e-01    NA    
#> 11    Pearson Correlation   0.8717538     P-value 5.193337e-48  TRUE    
#> 12    Pearson Correlation  -0.4284401     P-value 1.000000e+00 FALSE    
#> 13    Pearson Correlation   1.0000000     P-value 0.000000e+00  TRUE    
#> 14    Pearson Correlation   0.9628654     P-value 2.337502e-86  TRUE    
#> 15 Predictive Power Score   0.9167580 F1_weighted 9.404972e-01    NA    
#> 16    Pearson Correlation   0.8179411     P-value 1.162749e-37  TRUE    
#> 17    Pearson Correlation  -0.3661259     P-value 9.999980e-01 FALSE    
#> 18    Pearson Correlation   0.9628654     P-value 2.337502e-86  TRUE    
#> 19    Pearson Correlation   1.0000000     P-value 0.000000e+00  TRUE    
#> 20 Predictive Power Score   0.9398532 F1_weighted 9.599148e-01    NA    
#> 21 Predictive Power Score   0.4075487         MAE 4.076661e-01    NA    
#> 22 Predictive Power Score   0.2012876         MAE 2.677963e-01    NA    
#> 23 Predictive Power Score   0.7904907         MAE 3.280552e-01    NA    
#> 24 Predictive Power Score   0.7561113         MAE 1.608119e-01    NA    
#> 25             Cramer's V   1.0000000     P-value 4.997501e-04  TRUE    
#>            varx         vary
#> 1  Sepal.Length Sepal.Length
#> 2  Sepal.Length  Sepal.Width
#> 3  Sepal.Length Petal.Length
#> 4  Sepal.Length  Petal.Width
#> 5  Sepal.Length      Species
#> 6   Sepal.Width Sepal.Length
#> 7   Sepal.Width  Sepal.Width
#> 8   Sepal.Width Petal.Length
#> 9   Sepal.Width  Petal.Width
#> 10  Sepal.Width      Species
#> 11 Petal.Length Sepal.Length
#> 12 Petal.Length  Sepal.Width
#> 13 Petal.Length Petal.Length
#> 14 Petal.Length  Petal.Width
#> 15 Petal.Length      Species
#> 16  Petal.Width Sepal.Length
#> 17  Petal.Width  Sepal.Width
#> 18  Petal.Width Petal.Length
#> 19  Petal.Width  Petal.Width
#> 20  Petal.Width      Species
#> 21      Species Sepal.Length
#> 22      Species  Sepal.Width
#> 23      Species Petal.Length
#> 24      Species  Petal.Width
#> 25      Species      Species
#> 
#> $index
#>    i j
#> 1  1 1
#> 2  2 1
#> 3  3 1
#> 4  4 1
#> 5  5 1
#> 6  1 2
#> 7  2 2
#> 8  3 2
#> 9  4 2
#> 10 5 2
#> 11 1 3
#> 12 2 3
#> 13 3 3
#> 14 4 3
#> 15 5 3
#> 16 1 4
#> 17 2 4
#> 18 3 4
#> 19 4 4
#> 20 5 4
#> 21 1 5
#> 22 2 5
#> 23 3 5
#> 24 4 5
#> 25 5 5
#> 
#> attr(,"class")
#> [1] "clist" "list"

2. Exploring the Results

The result returned by corrp is an object of class "clist", which contains the correlation values and associated statistical information.

# Access the correlation data
results$data
#>                     infer infer.value        stat   stat.value  isig msg
#> 1     Pearson Correlation   1.0000000     P-value 0.000000e+00  TRUE    
#> 2     Pearson Correlation  -0.1175698     P-value 9.240509e-01 FALSE    
#> 3     Pearson Correlation   0.8717538     P-value 5.193337e-48  TRUE    
#> 4     Pearson Correlation   0.8179411     P-value 1.162749e-37  TRUE    
#> 5  Predictive Power Score   0.5591864 F1_weighted 7.028029e-01    NA    
#> 6     Pearson Correlation  -0.1175698     P-value 9.240509e-01 FALSE    
#> 7     Pearson Correlation   1.0000000     P-value 0.000000e+00  TRUE    
#> 8     Pearson Correlation  -0.4284401     P-value 1.000000e+00 FALSE    
#> 9     Pearson Correlation  -0.3661259     P-value 9.999980e-01 FALSE    
#> 10 Predictive Power Score   0.3134401 F1_weighted 5.377587e-01    NA    
#> 11    Pearson Correlation   0.8717538     P-value 5.193337e-48  TRUE    
#> 12    Pearson Correlation  -0.4284401     P-value 1.000000e+00 FALSE    
#> 13    Pearson Correlation   1.0000000     P-value 0.000000e+00  TRUE    
#> 14    Pearson Correlation   0.9628654     P-value 2.337502e-86  TRUE    
#> 15 Predictive Power Score   0.9167580 F1_weighted 9.404972e-01    NA    
#> 16    Pearson Correlation   0.8179411     P-value 1.162749e-37  TRUE    
#> 17    Pearson Correlation  -0.3661259     P-value 9.999980e-01 FALSE    
#> 18    Pearson Correlation   0.9628654     P-value 2.337502e-86  TRUE    
#> 19    Pearson Correlation   1.0000000     P-value 0.000000e+00  TRUE    
#> 20 Predictive Power Score   0.9398532 F1_weighted 9.599148e-01    NA    
#> 21 Predictive Power Score   0.4075487         MAE 4.076661e-01    NA    
#> 22 Predictive Power Score   0.2012876         MAE 2.677963e-01    NA    
#> 23 Predictive Power Score   0.7904907         MAE 3.280552e-01    NA    
#> 24 Predictive Power Score   0.7561113         MAE 1.608119e-01    NA    
#> 25             Cramer's V   1.0000000     P-value 4.997501e-04  TRUE    
#>            varx         vary
#> 1  Sepal.Length Sepal.Length
#> 2  Sepal.Length  Sepal.Width
#> 3  Sepal.Length Petal.Length
#> 4  Sepal.Length  Petal.Width
#> 5  Sepal.Length      Species
#> 6   Sepal.Width Sepal.Length
#> 7   Sepal.Width  Sepal.Width
#> 8   Sepal.Width Petal.Length
#> 9   Sepal.Width  Petal.Width
#> 10  Sepal.Width      Species
#> 11 Petal.Length Sepal.Length
#> 12 Petal.Length  Sepal.Width
#> 13 Petal.Length Petal.Length
#> 14 Petal.Length  Petal.Width
#> 15 Petal.Length      Species
#> 16  Petal.Width Sepal.Length
#> 17  Petal.Width  Sepal.Width
#> 18  Petal.Width Petal.Length
#> 19  Petal.Width  Petal.Width
#> 20  Petal.Width      Species
#> 21      Species Sepal.Length
#> 22      Species  Sepal.Width
#> 23      Species Petal.Length
#> 24      Species  Petal.Width
#> 25      Species      Species

3. Filtering Significant Correlations

To focus on significant correlations, you can filter the results based on significance or another criterion. Here, we filter the results for all correlations that are significant according to the default p-value threshold of 0.05.

# Filter significant correlations (p-value < 0.05)
significant_results <- subset(results$data, isTRUE(isig))
significant_results
#> [1] infer       infer.value stat        stat.value  isig        msg        
#> [7] varx        vary       
#> <0 rows> (or 0-length row.names)

4. Correlation Types

The corrp function allows you to specify different correlation methods based on the types of variables being compared:

  • Numerical-Numerical Correlations: Options include Pearson, MIC, and Dcor.
  • Numerical-Categorical Correlations: Options include PPS and MIC.
  • Categorical-Categorical Correlations: Options include Cramer’s V and Uncertainty Coefficient.

For example, let’s compute the correlations using different methods for numerical-numerical, numerical-categorical, and categorical-categorical data.

# Example of changing correlation methods
results_custom <- corrp(iris,
  cor.nn = "mic",
  cor.nc = "pps",
  cor.cc = "uncoef",
  verbose = FALSE
)
results_custom$data
#>                              infer infer.value        stat stat.value isig msg
#> 1  Maximal Information Coefficient   0.9994870     P-value  0.0000000 TRUE    
#> 2  Maximal Information Coefficient   0.2770503     P-value  0.0000000 TRUE    
#> 3  Maximal Information Coefficient   0.7682996     P-value  0.0000000 TRUE    
#> 4  Maximal Information Coefficient   0.6683281     P-value  0.0000000 TRUE    
#> 5           Predictive Power Score   0.5591864 F1_weighted  0.7028029   NA    
#> 6  Maximal Information Coefficient   0.2770503     P-value  0.0000000 TRUE    
#> 7  Maximal Information Coefficient   0.9967831     P-value  0.0000000 TRUE    
#> 8  Maximal Information Coefficient   0.4391362     P-value  0.0000000 TRUE    
#> 9  Maximal Information Coefficient   0.4354146     P-value  0.0000000 TRUE    
#> 10          Predictive Power Score   0.3134401 F1_weighted  0.5377587   NA    
#> 11 Maximal Information Coefficient   0.7682996     P-value  0.0000000 TRUE    
#> 12 Maximal Information Coefficient   0.4391362     P-value  0.0000000 TRUE    
#> 13 Maximal Information Coefficient   1.0000000     P-value  0.0000000 TRUE    
#> 14 Maximal Information Coefficient   0.9182958     P-value  0.0000000 TRUE    
#> 15          Predictive Power Score   0.9167580 F1_weighted  0.9404972   NA    
#> 16 Maximal Information Coefficient   0.6683281     P-value  0.0000000 TRUE    
#> 17 Maximal Information Coefficient   0.4354146     P-value  0.0000000 TRUE    
#> 18 Maximal Information Coefficient   0.9182958     P-value  0.0000000 TRUE    
#> 19 Maximal Information Coefficient   0.9995144     P-value  0.0000000 TRUE    
#> 20          Predictive Power Score   0.9398532 F1_weighted  0.9599148   NA    
#> 21          Predictive Power Score   0.4075487         MAE  0.4076661   NA    
#> 22          Predictive Power Score   0.2012876         MAE  0.2677963   NA    
#> 23          Predictive Power Score   0.7904907         MAE  0.3280552   NA    
#> 24          Predictive Power Score   0.7561113         MAE  0.1608119   NA    
#> 25         Uncertainty coefficient   0.9999758     P-value  0.0000000 TRUE    
#>            varx         vary
#> 1  Sepal.Length Sepal.Length
#> 2  Sepal.Length  Sepal.Width
#> 3  Sepal.Length Petal.Length
#> 4  Sepal.Length  Petal.Width
#> 5  Sepal.Length      Species
#> 6   Sepal.Width Sepal.Length
#> 7   Sepal.Width  Sepal.Width
#> 8   Sepal.Width Petal.Length
#> 9   Sepal.Width  Petal.Width
#> 10  Sepal.Width      Species
#> 11 Petal.Length Sepal.Length
#> 12 Petal.Length  Sepal.Width
#> 13 Petal.Length Petal.Length
#> 14 Petal.Length  Petal.Width
#> 15 Petal.Length      Species
#> 16  Petal.Width Sepal.Length
#> 17  Petal.Width  Sepal.Width
#> 18  Petal.Width Petal.Length
#> 19  Petal.Width  Petal.Width
#> 20  Petal.Width      Species
#> 21      Species Sepal.Length
#> 22      Species  Sepal.Width
#> 23      Species Petal.Length
#> 24      Species  Petal.Width
#> 25      Species      Species

Advanced Usage

1. Parallel Processing

You can enable parallel processing to speed up the computation, especially when working with large datasets. Set the n.cores argument to the number of cores you’d like to use.

# Using 2 cores for parallel processing
results_parallel <- corrp(iris,
  cor.nn = "pearson",
  cor.nc = "pps",
  cor.cc = "cramer",
  n.cores = 2,
  verbose = FALSE
)

2. Custom Inferences with corr_fun

The corr_fun function can be used directly if you need finer control over the correlation calculation for specific pairs of variables. It allows you to specify the variables and methods for computing the correlation.

# Using corr_fun to compute Pearson correlation between Sepal.Length and Petal.Length
corr_fun(
  iris,
  nx = "Sepal.Length",
  ny = "Petal.Length",
  cor.nn = "pearson",
  verbose = FALSE
)
#> $infer
#> [1] "Pearson Correlation"
#> 
#> $infer.value
#> [1] 0.8717538
#> 
#> $stat
#> [1] "P-value"
#> 
#> $stat.value
#> [1] 5.193337e-48
#> 
#> $isig
#> [1] TRUE
#> 
#> $msg
#> [1] ""
#> 
#> $varx
#> [1] "Sepal.Length"
#> 
#> $vary
#> [1] "Petal.Length"

Conclusion

The corrp package provides a simple way to compute correlations across different types of variables. If you are working with mixed data, corrp offers a solution for your correlation analysis needs. By leveraging parallel processing and C++ implementation, corrp can handle large datasets efficiently.