Compute multiple types of correlation analysis (Pearson correlation, R^2 coefficient of linear regression, Cramer’s V measure of association, Distance Correlation, The Maximal Information Coefficient, Uncertainty coefficient and Predictive Power Score) in large dataframes with mixed columns classes(integer, numeric, factor and character) in parallel R backend. This package also has a C++ implementation of the Average correlation clustering algorithm ACCA that works directly with the correlation matrix. In this sense, this implementation differs from the original, it works with mixed data and several correlation types methods.
Details
The corrp package under development by Meantrix team and original based on Srikanth KS (talegari) cor2 function can provide to R users a way to work with correlation analysis among large data.frames, tibbles or data.tables through a R parallel backend and C++ functions.
The data.frame is allowed to have columns of these four classes: integer, numeric, factor and character. The character column is considered as categorical variable.
In this new package the correlation is automatically computed according to the follow options:
integer/numeric - factor/categorical pair:
- correlation coefficient or squared root of R^2 coefficient of linear regression;
- Predictive Power Score.
factor/categorical pair:
- cramersV a measure of association between two nominal .;
- Uncertainty coefficient.
- Predictive Power Score.
Also, all statistical tests are controlled by the significance of p.value param. If the statistical tests do not obtain a significance greater/less than p.value, by default the output of variable isig
will be FALSE
. There is no statistical significance test for the pps
algorithm, isig = TRUE
in this case. If any errors occur during operations by default the correlation will be NA
.
Installation
Before you begin, ensure you have met the following requirement(s):
- You have
R >= 3.6.2
installed.
Install the development version from GitHub:
library('remotes')
remotes::install_github("meantrix/corrp@main")
Basic Usage
corrp package
provides seven main functions for correlation calculations, clustering and basic data manipulation: corrp
, corr_fun
, corr_matrix
, corr_rm
, acca
, sil_acca
and best_acca
.
corrp
Next, we calculate the correlations for the data set iris using: Maximal Information Coefficient for numeric pair, the Power Predictive Score algorithm for numeric/categorical pair and Uncertainty coefficient for categorical pair.
results = corrp::corrp(iris, cor.nn = 'mic',cor.nc = 'pps',cor.cc = 'uncoef', n.cores = 2 , verbose = FALSE)
head(results$data)
# infer infer.value stat stat.value isig msg varx vary
# Maximal Information Coefficient 0.9994870 P-value 0.0000000 TRUE Sepal.Length Sepal.Length
# Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Length Sepal.Width
# Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE Sepal.Length Petal.Length
# Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE Sepal.Length Petal.Width
# Predictive Power Score 0.5591864 F1_weighted 0.7028029 TRUE Sepal.Length Species
# Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Width Sepal.Length
corr_matrix
Using the previous result we can create a correlation matrix as follows:
m = corr_matrix(results,col = 'infer.value',isig = TRUE)
m
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# Sepal.Length 0.9994870 0.2770503 0.7682996 0.6683281 0.4075487
# Sepal.Width 0.2770503 0.9967831 0.4391362 0.4354146 0.2012876
# Petal.Length 0.7682996 0.4391362 1.0000000 0.9182958 0.7904907
# Petal.Width 0.6683281 0.4354146 0.9182958 0.9995144 0.7561113
# Species 0.5591864 0.3134401 0.9167580 0.9398532 0.9999758
# attr(,"class")
# [1] "cmatrix" "matrix"
Now, we can clustering the data set variables through ACCA and the correlation matrix. By way of example, consider 2 clusters k = 2
:
acca.res = acca(m,2)
acca.res
# $cluster1
# [1] "Species" "Sepal.Length" "Petal.Width"
#
# $cluster2
# [1] "Petal.Length" "Sepal.Width"
#
# attr(,"class")
# [1] "acca_list" "list"
Also,we can calculate The average silhouette width to the cluster acca.res
:
sil_acca(acca.res,m)
# [1] -0.02831006
# attr(,"class")
# [1] "corrpstat"
# attr(,"statistic")
# [1] "Silhouette"
Observations with a large average silhouette width (almost 1) are very well clustered.
Contributing to corrp
To contribute to corrp
, follow these steps:
- Fork this repository.
- Create a branch:
git checkout -b <branch_name>
. - Make your changes and commit them:
git commit -m '<commit_message>'
- Push to the original branch:
git push origin corrp/<location>
- Create the pull request.
Alternatively see the GitHub documentation on creating a pull request.
Bug Reports
If you have detected a bug (or want to ask for a new feature), please file an issue with a minimal reproducible example on GitHub.
License
This project uses the following license: GLP3 License.