| Title: | Weighted Clustering of Quantitative and Categorical Variables |
|---|---|
| Description: | Methods for weighted hierarchical clustering of variables. Variables may be quantitative, categorical, or a mixture of both. Following the selection of a cutpoint, the resulting clusters may be summarized by their first principal component score for the purpose of dimensionality reduction. |
| Authors: | Brett Klamer [aut, cre] |
| Maintainer: | Brett Klamer <[email protected]> |
| License: | GPL-3 |
| Version: | 0.0.1 |
| Built: | 2026-05-18 06:05:23 UTC |
| Source: | https://bitbucket.org/bklamer/vclust |
Performs ascendant hierarchical clustering of variables for quantitative, categorical, or mixed datasets with row weights.
For each candidate cluster, a quantitative synthetic variable is computed as the first component of a weighted factorial method (FactoMineR::FAMD() for mixed data, FactoMineR::PCA() for purely quantitative data, FactoMineR::MCA() for purely categorical data).
The homogeneity of a cluster is defined as the sum, over variables in the cluster, of weighted squared Pearson correlations (for quantitative variables) and weighted squared correlation ratios () (for categorical variables) with the synthetic variable.
Clusters are agglomerated by minimizing the loss of homogeneity when two clusters are merged, reproducing the methods of ClustOfVar::hclustvar().
build_tree(data, weights = NULL, monotone = FALSE)build_tree(data, weights = NULL, monotone = FALSE)
data |
(data.frame) |
weights |
(Scalar character or |
monotone |
(Scalar logical: |
If missing values are present in your data, you must either impute them or subset to complete cases before analysis. For example,
data_cc <- data[complete.cases(data), ]
If variables with zero-variance or less than two factor levels are present in your data, you must exclude them before analysis. For example,
w <- data$weights
valid_vars <- function(x) {
if(is.numeric(x)) {
vclust:::weighted_var(x, w) > .Machine$double.eps
} else if(is.factor(x) || is.character(x) || is.logical(x)) {
nlevels(droplevels(as.factor(x))) > 1
} else {
FALSE
}
}
are_valid <- vapply(data, valid_vars, logical(1))
data_valid <- data[are_valid]
Let be quantitative variables and be categorical variables measured on observations, with .
Denote a candidate cluster of variables as .
A partition groups variables into clusters.
For a given cluster , a quantitative synthetic variable is defined as the variable most related to all variables in :
where is the weighted squared Pearson correlation and is the weighted squared correlation ratio (the fraction of the variance of explained by the categories of ).
In the unweighted ClustOfVar::hclustvar() method, is the first component of PCAmixdata::PCAmix() on the variables in , and the homogeneity of the cluster equals the first eigenvalue of PCAmixdata::PCAmix().
In this weighted extension, we replace PCAmixdata::PCAmix() using FactoMineR:
FactoMineR::FAMD for mixed clusters,
FactoMineR::PCA for purely quantitative clusters,
FactoMineR::MCA for purely categorical clusters,
each with row weights. The homogeneity is computed as
where is the first‑axis score (res$ind$coord[,1L]) of the corresponding weighted FactoMineR result.
Let denote row weights for , normalized to sum to one: .
The weighted mean, variance, and covariance are
and the weighted Pearson correlation is
For a numeric variable and a factor with groups indexed by , the weighted squared correlation ratio is
where , , and .
Following ClustOfVar::hclustvar(), the dissimilarity between two clusters and is the loss of homogeneity when merging:
At each step, the algorithm merges the pair with the smallest .
The dendrogram height recorded for the new node is .
A dendrogram inversion occurs when a merge step records a smaller dissimilarity than a preceding merge step.
In the dendrogram, this appears as a branch that merges at a lower height than its children branches.
Generally, hierarchical clustering should be defined by non-decreasing distances as the algorithm progresses and combines larger clusters.
For plotting stability, a base::cummax() adjustment (via monotone = TRUE) is used to enforce monotone heights.
This adjusts the recorded heights for plotting but does not change the merge order, cluster memberships, or variable ordering.
Dendrogram inversions are rare in practice.
A list of class "build_tree" with elements:
merge: Integer matrix of size describing the merge history in hclust convention.
Row i of merge describes the merging of clusters at step i of the clustering.
If an element j in the row is negative, then variable -j was merged at this stage.
If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm.
Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.
height: Numeric vector of length with dissimilarities recorded at each merge.
order: A vector giving the permutation of the original variables suitable for plotting, in the sense that a cluster plot using this ordering and matrix merge will not have crossings of the branches.
labels: Character vector of variable names.
clusmat: Integer matrix giving cluster memberships of variables at each level (columns correspond to clusters, where column j gives the cluster assignment when the tree is cut into j clusters).
call: The matched call.
data: The data frame of variables used for clustering, including the weight column (if any).
The attribute ".row.weights" stores the weight column name.
method: "build_tree"
The attribute ".row.weights" stores the weight column name.
Chavent M, Kuentz-Simonet V, Liquet B, Saracco J (2012). “ClustOfVar: An R Package for the Clustering of Variables.” Journal of Statistical Software, 50(13), 1–16. doi:10.18637/jss.v050.i13. https://www.jstatsoft.org/index.php/jss/article/view/v050i13.
Lê S, Josse J, Husson F (2008). “FactoMineR: A Package for Multivariate Analysis.” Journal of Statistical Software, 25(1), 1–18. doi:10.18637/jss.v025.i01.
#---------------------------------------------------------------------------- # build_tree() examples #---------------------------------------------------------------------------- library(vclust) # Example 1: Quantitative variables with equal weights set.seed(42) n <- 60 data <- data.frame( col1 = rnorm(n), col2 = rnorm(n), col3 = rnorm(n), col4 = rnorm(n), weights = rep(1, n) ) tree <- build_tree(data = data, weights = "weights") plot(tree) rect.hclust(tree = tree, k = 2) head(tree$merge) head(tree$height) # Example 2: Mixed data with equal weights set.seed(70) n <- 80 X <- data.frame( col1 = rnorm(n), col2 = rnorm(n) + 0.6*rnorm(n) ) Z <- data.frame( col3 = factor(sample(letters[1:3], n, TRUE)), col4 = factor(sample(c("L","M","H"), n, TRUE)) ) w <- rep(1, n) # equal weights data <- cbind(X, Z, w) tree_mix <- build_tree(data = data, weights = "w") plot(tree_mix) rect.hclust(tree = tree_mix, k = 2) head(tree_mix$merge) head(tree_mix$height) # Example 3: Mixed data with unequal weights set.seed(70) w <- rexp(n) # positive, exponential weights data <- cbind(X, Z, w) tree_mix <- build_tree(data = data, weights = "w") plot(tree_mix) rect.hclust(tree = tree_mix, k = 2) head(tree_mix$merge) head(tree_mix$height)#---------------------------------------------------------------------------- # build_tree() examples #---------------------------------------------------------------------------- library(vclust) # Example 1: Quantitative variables with equal weights set.seed(42) n <- 60 data <- data.frame( col1 = rnorm(n), col2 = rnorm(n), col3 = rnorm(n), col4 = rnorm(n), weights = rep(1, n) ) tree <- build_tree(data = data, weights = "weights") plot(tree) rect.hclust(tree = tree, k = 2) head(tree$merge) head(tree$height) # Example 2: Mixed data with equal weights set.seed(70) n <- 80 X <- data.frame( col1 = rnorm(n), col2 = rnorm(n) + 0.6*rnorm(n) ) Z <- data.frame( col3 = factor(sample(letters[1:3], n, TRUE)), col4 = factor(sample(c("L","M","H"), n, TRUE)) ) w <- rep(1, n) # equal weights data <- cbind(X, Z, w) tree_mix <- build_tree(data = data, weights = "w") plot(tree_mix) rect.hclust(tree = tree_mix, k = 2) head(tree_mix$merge) head(tree_mix$height) # Example 3: Mixed data with unequal weights set.seed(70) w <- rexp(n) # positive, exponential weights data <- cbind(X, Z, w) tree_mix <- build_tree(data = data, weights = "w") plot(tree_mix) rect.hclust(tree = tree_mix, k = 2) head(tree_mix$merge) head(tree_mix$height)
Computes a symmetric matrix of Pearson correlations, optionally using row weights.
cor_matrix(data, weights = NULL)cor_matrix(data, weights = NULL)
data |
(data.frame or matrix) |
weights |
(Scalar character or |
Pairwise complete associations are calculated for instances of missing values.
A numeric symmetric matrix with row and column names equal to names(data) (excluding the weight column, if any).
#---------------------------------------------------------------------------- # cor_matrix() examples #---------------------------------------------------------------------------- library(vclust) set.seed(123) n <- 200 df <- data.frame( x = rnorm(n), y = rnorm(n), z = rnorm(n), w = rexp(n, rate = 1) ) # Correlation matrix S1 <- cor_matrix(df, weights = "w") round(S1, 3)#---------------------------------------------------------------------------- # cor_matrix() examples #---------------------------------------------------------------------------- library(vclust) set.seed(123) n <- 200 df <- data.frame( x = rnorm(n), y = rnorm(n), z = rnorm(n), w = rexp(n, rate = 1) ) # Correlation matrix S1 <- cor_matrix(df, weights = "w") round(S1, 3)
Computes a symmetric matrix of squared association measures for a mix of numeric and categorical variables, optionally using row weights.
For numeric–numeric pairs the entry is the weighted squared Pearson correlation .
For numeric–factor pairs the entry is the weighted eta-squared .
For factor–factor pairs the entry is the weighted squared first canonical correlation .
cor_sq_matrix(data, weights = NULL)cor_sq_matrix(data, weights = NULL)
data |
(data.frame) |
weights |
(Scalar character or |
Pairwise complete associations are calculated for instances of missing values.
Character and logical variables in data are coerced to factors.
A numeric symmetric matrix with row and column names equal to names(data) (excluding the weight column, if any).
#---------------------------------------------------------------------------- # cor_sq_matrix() examples #---------------------------------------------------------------------------- library(vclust) set.seed(123) n <- 200 df <- data.frame( x = rnorm(n, 0, 1), y = 0.6 * rnorm(n) + 0.4 * rnorm(n), # another numeric g = factor(sample(letters[1:3], n, replace = TRUE, prob = c(0.4, 0.4, 0.2))), h = factor(sample(c("yes", "no"), n, replace = TRUE)), z = rnorm(n), w = rexp(n, rate = 1) ) # Use all variables S1 <- cor_sq_matrix(df, weights = "w") round(S1, 3) # Select a subset of variables explicitly, unweighted S2 <- cor_sq_matrix(df[c("x", "y", "g", "h")]) round(S2, 3) # Include character and logical columns; they are coerced to factors internally df$char_fac <- sample(c("A", "B", "C"), n, replace = TRUE) df$logi_fac <- sample(c(TRUE, FALSE), n, replace = TRUE) S3 <- cor_sq_matrix(df, weights = "w") round(S3, 3) # Missing values df$logi_fac[1:30] <- NA S4 <- cor_sq_matrix(df, weights = "w") round(S4, 3)#---------------------------------------------------------------------------- # cor_sq_matrix() examples #---------------------------------------------------------------------------- library(vclust) set.seed(123) n <- 200 df <- data.frame( x = rnorm(n, 0, 1), y = 0.6 * rnorm(n) + 0.4 * rnorm(n), # another numeric g = factor(sample(letters[1:3], n, replace = TRUE, prob = c(0.4, 0.4, 0.2))), h = factor(sample(c("yes", "no"), n, replace = TRUE)), z = rnorm(n), w = rexp(n, rate = 1) ) # Use all variables S1 <- cor_sq_matrix(df, weights = "w") round(S1, 3) # Select a subset of variables explicitly, unweighted S2 <- cor_sq_matrix(df[c("x", "y", "g", "h")]) round(S2, 3) # Include character and logical columns; they are coerced to factors internally df$char_fac <- sample(c("A", "B", "C"), n, replace = TRUE) df$logi_fac <- sample(c(TRUE, FALSE), n, replace = TRUE) S3 <- cor_sq_matrix(df, weights = "w") round(S3, 3) # Missing values df$logi_fac[1:30] <- NA S4 <- cor_sq_matrix(df, weights = "w") round(S4, 3)
build_tree()
Cuts a tree produced by build_tree() into k clusters, recomputing for each cluster the weighted synthetic variable (first component score).
cut_tree(x, k, cor_matrix = FALSE)cut_tree(x, k, cor_matrix = FALSE)
x |
An object returned by |
k |
(Scalar integer) |
cor_matrix |
(Scalar logical: |
A list with elements:
scores: data frame of synthetic variables for each cluster (first principal component).
scores_cor_matrix: Pearson correlation matrix for the synthetic variables.
vars: A four-column data frame:
cluster: Cluster membership
variable: Variable names
squared_association: If numeric, the weighted squared Pearson correlations with the cluster PC1. If factor, the weighted squared correlations with the cluster PC1.
association: If numeric, weighted Pearson correlation with the cluster PC1. If factor, the weighted correlations with the cluster PC1.
vars_cor_matrix: List of squared association matrices among variables within each cluster.
Similarity for numeric:numeric is squared Pearson's correlation ; numeric:categorical is the correlation ratio ; categorical:categorical is the squared canonical correlation between their dummy sets.
cluster: Integer vector (length ) assigning each variable to a cluster.
H: Homogeneity for each cluster, defined as the sum of squared associations with the cluster's first principal component.
Hprop: Proportion of total homogeneity accounted for by the partition . The gain in cohesion.
Defined as .
Where is the chosen cluster partition (K=k), is a single cluster partition (K=1), and is the singleton cluster partition (K=p).
size: Number of variables in each cluster.
k: The chosen number of clusters.
call: The matched call.
method: "cut_tree"
#---------------------------------------------------------------------------- # cut_tree() examples #---------------------------------------------------------------------------- library(vclust) # Example 1: Quantitative variables with equal weights set.seed(42) n <- 60 data <- data.frame( col1 = rnorm(n), col2 = rnorm(n), col3 = rnorm(n), col4 = rnorm(n), weights = rep(1, n) ) tree <- build_tree(data = data, weights = "weights") plot(tree) rect.hclust(tree = tree, k = 2) cut <- cut_tree(tree, k = 2) cut # Example 2: Mixed data with equal weights set.seed(70) n <- 80 data <- data.frame( col1 = rnorm(n), col2 = rnorm(n) + 0.6*rnorm(n), col3 = factor(sample(letters[1:3], n, TRUE)), col4 = factor(sample(c("L","M","H"), n, TRUE)), w = rep(1, n) ) tree2 <- build_tree(data = data, weights = "w") plot(tree2) rect.hclust(tree = tree2, k = 2) cut2 <- cut_tree(tree2, k = 2) cut2 # Example 3: Mixed data with unequal weights data$w <- rexp(n) tree3 <- build_tree(data = data, weights = "w") plot(tree3) rect.hclust(tree = tree3, k = 2) cut3 <- cut_tree(tree3, k = 2) cut3#---------------------------------------------------------------------------- # cut_tree() examples #---------------------------------------------------------------------------- library(vclust) # Example 1: Quantitative variables with equal weights set.seed(42) n <- 60 data <- data.frame( col1 = rnorm(n), col2 = rnorm(n), col3 = rnorm(n), col4 = rnorm(n), weights = rep(1, n) ) tree <- build_tree(data = data, weights = "weights") plot(tree) rect.hclust(tree = tree, k = 2) cut <- cut_tree(tree, k = 2) cut # Example 2: Mixed data with equal weights set.seed(70) n <- 80 data <- data.frame( col1 = rnorm(n), col2 = rnorm(n) + 0.6*rnorm(n), col3 = factor(sample(letters[1:3], n, TRUE)), col4 = factor(sample(c("L","M","H"), n, TRUE)), w = rep(1, n) ) tree2 <- build_tree(data = data, weights = "w") plot(tree2) rect.hclust(tree = tree2, k = 2) cut2 <- cut_tree(tree2, k = 2) cut2 # Example 3: Mixed data with unequal weights data$w <- rexp(n) tree3 <- build_tree(data = data, weights = "w") plot(tree3) rect.hclust(tree = tree3, k = 2) cut3 <- cut_tree(tree3, k = 2) cut3
Compute a low‑dimensional representation of a data frame that may contain both numeric and categorical variables.
The routine dispatches to FactoMineR::PCA(), FactoMineR::MCA(), or FactoMineR::FAMD() based on the variable types present.
famd(data, weights = NULL, ncp = 1L)famd(data, weights = NULL, ncp = 1L)
data |
(data.frame) |
weights |
(Scalar character or |
ncp |
(Scalar integer) |
If missing values are present in your data, you must either impute them or subset to complete cases before analysis. For example,
data_cc <- data[complete.cases(data), ]
If variables with zero-variance or less than two factor levels are present in your data, you must exclude them before analysis. For example,
w <- data$weights
valid_vars <- function(x) {
if(is.numeric(x)) {
vclust:::weighted_var(x, w) > .Machine$double.eps
} else if(is.factor(x) || is.character(x) || is.logical(x)) {
nlevels(droplevels(as.factor(x))) > 1
} else {
FALSE
}
}
are_valid <- vapply(data, valid_vars, logical(1))
data_valid <- data[are_valid]
Character and logical columns are coerced to factors, and factors have unused levels dropped. Observation weights are validated to be numeric, finite, non‑negative, and with positive sum.
Numeric variables are centered and standardized. Factor variables are encoded as a set of indicator variables and scaled so that the total contribution of that factor variable is comparable to a single quantitative variable. FAMD can be seen as a PCA performed on a concatenation of standardized numeric variables and properly scaled indicator variables for factor variables so that each original variable contributes equally.
An object returned by the corresponding FactoMineR function, of class "FAMD", "PCA", or "MCA" depending on the input variable types.
#---------------------------------------------------------------------------- # famd() examples #---------------------------------------------------------------------------- library(vclust) # Mixed data example with weights set.seed(123) n <- 100 df <- data.frame( x1 = rnorm(n), x2 = rnorm(n, mean = 2), grp = factor(sample(letters[1:3], n, replace = TRUE)), w = rexp(n) ) res_mix <- famd(df, weights = "w", ncp = 3) res_mix$eig # individual scores head(res_mix$ind$coord) # Numeric‑only (PCA) res_pca <- famd(iris[1:4], ncp = 2) res_pca$eig # Factor‑only (MCA) toy <- data.frame( a = factor(sample(c("yes", "no"), 50, TRUE)), b = factor(sample(c("L", "M", "H"), 50, TRUE)) ) res_mca <- famd(toy, ncp = 2) res_mca$eig#---------------------------------------------------------------------------- # famd() examples #---------------------------------------------------------------------------- library(vclust) # Mixed data example with weights set.seed(123) n <- 100 df <- data.frame( x1 = rnorm(n), x2 = rnorm(n, mean = 2), grp = factor(sample(letters[1:3], n, replace = TRUE)), w = rexp(n) ) res_mix <- famd(df, weights = "w", ncp = 3) res_mix$eig # individual scores head(res_mix$ind$coord) # Numeric‑only (PCA) res_pca <- famd(iris[1:4], ncp = 2) res_pca$eig # Factor‑only (MCA) toy <- data.frame( a = factor(sample(c("yes", "no"), 50, TRUE)), b = factor(sample(c("L", "M", "H"), 50, TRUE)) ) res_mca <- famd(toy, ncp = 2) res_mca$eig
Dendrogram of the hierarchical cluster of variables resulting from build_tree().
## S3 method for class 'build_tree' plot(x, type = "tree", sub = "", ...)## S3 method for class 'build_tree' plot(x, type = "tree", sub = "", ...)
x |
An object returned by |
type |
(Scalar character) |
sub |
(Scalar character) |
... |
Additional arguments passed to |
Compute the first canonical correlation coefficient between two categorical variables via their indicator (dummy) sets.
The coefficient equals the largest singular value of the standardized cross-covariance between the indicator sets and lies in [0, 1].
For a table, this reduces to the absolute coefficient and Cramer's .
weighted_cancor(x, y, w = NULL) weighted_cancor_sq(x, y, w = NULL)weighted_cancor(x, y, w = NULL) weighted_cancor_sq(x, y, w = NULL)
x |
(factor) |
y |
(factor) |
w |
(numeric or |
Let and denote categorical variables, and let be case weights (internally normalized to sum to one).
Denote the weighted joint distribution by
with
and marginals
The covariance blocks for the indicator sets are
The first canonical correlation is then
where is the largest singular value and the inverses are taken on the support of the positive eigenvalues.
This equals the dominant singular value in correspondence analysis of the contingency table built from and , i.e., the square root of the largest principal inertia.
The helper weighted_cancor_sq() returns , the weighted squared first canonical correlation coefficient, which takes values in .
weighted_cancor(): Scalar numeric in or NA_real_.
weighted_cancor_sq(): Scalar numeric or NA_real_.
Returns NA_real_ if all observations are removed during filtering or the total weight is nonpositive.
Returns 0 if all mass lies in a single level of either variable or if numerical degeneracy prevents computation.
#---------------------------------------------------------------------------- # weighted_cancor() examples #---------------------------------------------------------------------------- library(vclust) # 2x2 example: equals |phi| and Cramer's V x <- factor(rep(c("A", "B"), times = c(30, 70))) y <- factor(c(rep("U", 25), rep("V", 5), # mostly A-U rep("U", 20), rep("V", 50))) # mostly B-V weighted_cancor(x, y) # Compare to |phi| computed from weighted proportions tab <- prop.table(table(x, y)) px <- rowSums(tab) py <- colSums(tab) phi <- (tab[1,1]*tab[2,2] - tab[1,2]*tab[2,1]) / sqrt(px[1]*px[2]*py[1]*py[2]) phi <- as.numeric(phi) c(cc = weighted_cancor(x, y), abs_phi = abs(phi)) # Rows with negative weights and missing values filtered. set.seed(1) xw <- factor(sample(letters[1:3], 100, replace = TRUE)) yw <- factor(sample(LETTERS[1:4], 100, replace = TRUE)) w <- runif(100) w[c(5, 10)] <- NA_real_ w[20] <- -1 # dropped xw[30] <- NA # dropped yw[40] <- NA # dropped weighted_cancor(xw, yw, w = w) # Larger table where the value equals the dominant CA singular value x3 <- factor(sample(paste0("x", 1:5), 500, TRUE)) y3 <- factor(sample(paste0("y", 1:6), 500, TRUE)) weighted_cancor(x3, y3) # Degenerate cases # Single observed level in x after filtering weighted_cancor(factor(rep("only", 10)), factor(sample(c("a","b"), 10, TRUE))) # No usable data weighted_cancor(factor(c(NA, NA)), factor(c(NA, NA)), w = c(1, 1))#---------------------------------------------------------------------------- # weighted_cancor() examples #---------------------------------------------------------------------------- library(vclust) # 2x2 example: equals |phi| and Cramer's V x <- factor(rep(c("A", "B"), times = c(30, 70))) y <- factor(c(rep("U", 25), rep("V", 5), # mostly A-U rep("U", 20), rep("V", 50))) # mostly B-V weighted_cancor(x, y) # Compare to |phi| computed from weighted proportions tab <- prop.table(table(x, y)) px <- rowSums(tab) py <- colSums(tab) phi <- (tab[1,1]*tab[2,2] - tab[1,2]*tab[2,1]) / sqrt(px[1]*px[2]*py[1]*py[2]) phi <- as.numeric(phi) c(cc = weighted_cancor(x, y), abs_phi = abs(phi)) # Rows with negative weights and missing values filtered. set.seed(1) xw <- factor(sample(letters[1:3], 100, replace = TRUE)) yw <- factor(sample(LETTERS[1:4], 100, replace = TRUE)) w <- runif(100) w[c(5, 10)] <- NA_real_ w[20] <- -1 # dropped xw[30] <- NA # dropped yw[40] <- NA # dropped weighted_cancor(xw, yw, w = w) # Larger table where the value equals the dominant CA singular value x3 <- factor(sample(paste0("x", 1:5), 500, TRUE)) y3 <- factor(sample(paste0("y", 1:6), 500, TRUE)) weighted_cancor(x3, y3) # Degenerate cases # Single observed level in x after filtering weighted_cancor(factor(rep("only", 10)), factor(sample(c("a","b"), 10, TRUE))) # No usable data weighted_cancor(factor(c(NA, NA)), factor(c(NA, NA)), w = c(1, 1))
Computes the weighted Pearson correlation coefficient between two numeric vectors.
weighted_cor(x, y, w = NULL) weighted_cor_sq(x, y, w = NULL)weighted_cor(x, y, w = NULL) weighted_cor_sq(x, y, w = NULL)
x |
(numeric) |
y |
(numeric) |
w |
(numeric or |
This implementation uses the population definition of correlation (no Bessel () correction in variance).
Observations with any non-finite value in x, y, or w, or with negative weight, are excluded.
Zero weights are allowed and contribute nothing.
Let denote row weights for and be the sum-to-1 normalized weights.
For the pairwise complete subset, :
The helper weighted_cor_sq() returns , the weighted squared Pearson correlation.
weighted_cor(): Scalar numeric in or NA_real_.
weighted_cor_sq(): Scalar numeric or NA_real_.
Returns NA_real_ if all observations are removed during filtering or the total weight is nonpositive.
Returns 0 if either weighted variance is nonpositive.
#---------------------------------------------------------------------------- # weighted_cor() examples #---------------------------------------------------------------------------- library(vclust) # Basic usage with equal weights x <- c(1, 2, 3, 4, 5) y <- c(2, 1, 4, 3, 5) w <- rep(1, length(x)) weighted_cor(x, y, w) weighted_cor(x, y) # Heavier weight on the last two observations w2 <- c(1, 1, 1, 5, 5) weighted_cor(x, y, w2) weighted_cor_sq(x, y, w2) # Pairwise handling of missing values x_na <- c(1, NA, 3, 4, 5) y_na <- c(2, 1, 4, NA, 5) w_na <- c(1, 1, 1, NA, 1) weighted_cor(x_na, y_na, w_na) # Zero weights exclude observations without removing length alignment w_zero <- c(1, 0, 1, 0, 1) weighted_cor(x, y, w_zero) # Degenerate case: zero variance in x x_const <- c(3, 3, 3, 3) y_any <- c(1, 2, 3, 4) weighted_cor(x_const, y_any)#---------------------------------------------------------------------------- # weighted_cor() examples #---------------------------------------------------------------------------- library(vclust) # Basic usage with equal weights x <- c(1, 2, 3, 4, 5) y <- c(2, 1, 4, 3, 5) w <- rep(1, length(x)) weighted_cor(x, y, w) weighted_cor(x, y) # Heavier weight on the last two observations w2 <- c(1, 1, 1, 5, 5) weighted_cor(x, y, w2) weighted_cor_sq(x, y, w2) # Pairwise handling of missing values x_na <- c(1, NA, 3, 4, 5) y_na <- c(2, 1, 4, NA, 5) w_na <- c(1, 1, 1, NA, 1) weighted_cor(x_na, y_na, w_na) # Zero weights exclude observations without removing length alignment w_zero <- c(1, 0, 1, 0, 1) weighted_cor(x, y, w_zero) # Degenerate case: zero variance in x x_const <- c(3, 3, 3, 3) y_any <- c(1, 2, 3, 4) weighted_cor(x_const, y_any)
associationComputes the weighted and , a measure for the association between a numeric variable and a single categorical variable .
weighted_eta_sq(x, y, w = NULL) weighted_eta(x, y, w = NULL)weighted_eta_sq(x, y, w = NULL) weighted_eta(x, y, w = NULL)
x |
(numeric) |
y |
(factor) |
w |
(numeric or |
Let denote the numeric response for observation , denote the group membership, and be the weight for observation .
Internally, non-finite or , missing , and negative observations are removed pairwise prior to computation.
The weights are normalized to sum to one, i.e., with .
Define group weights and group means for groups with .
The weighted grand mean is .
The between-group sum of squares is
and the total sum of squares is
The weighted eta-squared is then
This quantity lies in and is truncated to that interval for numerical robustness.
Groups with zero total weight are dropped from the SSB computation.
If there are fewer than two nonempty groups, or if is non-finite, the function returns 0.
This implementation corresponds to the standard (fixed-effects, one-way) eta-squared effect size.
The helper weighted_eta() returns , the weighted association, which takes values in .
weighted_eta_sq(): Scalar numeric or NA_real_.
weighted_eta(): Scalar numeric in or NA_real_.
Returns NA_real_ if all observations are removed during filtering or the total weight is nonpositive.
Returns 0 if fewer than two nonempty groups remain or if the weighted total sum of squares is non-finite or effectively zero.
#---------------------------------------------------------------------------- # weighted_eta_sq() examples #---------------------------------------------------------------------------- library(vclust) set.seed(123) n <- 30 x <- rnorm(n, mean = rep(c(0, 0.5, 1), each = n/3), sd = 1) y <- factor(rep(LETTERS[1:3], each = n/3)) # Unweighted eta-squared (all weights equal) weighted_eta_sq(x, y) weighted_eta_sq(x, y, rep_len(1, n)) # With nonnegative weights that upweight group C w <- rep_len(1, n) w[y == "C"] <- 2 weighted_eta_sq(x, y, w) # Rows with NA/Inf in x or w are dropped automatically x_bad <- x x_bad[c(2, 10)] <- NA w_bad <- w w_bad[5] <- Inf weighted_eta_sq(x_bad, y, w_bad) # Degenerate cases ## Single nonempty group after zeroing weights returns 0 w_single <- as.numeric(y == "A") weighted_eta_sq(x, y, w_single) ## All rows removed (e.g., all weights negative) returns NA_real_ w_all_bad <- rep(-1, n) weighted_eta_sq(x, y, w_all_bad)#---------------------------------------------------------------------------- # weighted_eta_sq() examples #---------------------------------------------------------------------------- library(vclust) set.seed(123) n <- 30 x <- rnorm(n, mean = rep(c(0, 0.5, 1), each = n/3), sd = 1) y <- factor(rep(LETTERS[1:3], each = n/3)) # Unweighted eta-squared (all weights equal) weighted_eta_sq(x, y) weighted_eta_sq(x, y, rep_len(1, n)) # With nonnegative weights that upweight group C w <- rep_len(1, n) w[y == "C"] <- 2 weighted_eta_sq(x, y, w) # Rows with NA/Inf in x or w are dropped automatically x_bad <- x x_bad[c(2, 10)] <- NA w_bad <- w w_bad[5] <- Inf weighted_eta_sq(x_bad, y, w_bad) # Degenerate cases ## Single nonempty group after zeroing weights returns 0 w_single <- as.numeric(y == "A") weighted_eta_sq(x, y, w_single) ## All rows removed (e.g., all weights negative) returns NA_real_ w_all_bad <- rep(-1, n) weighted_eta_sq(x, y, w_all_bad)