| Title: | Statistical Modeling Utility Functions |
|---|---|
| Description: | A collection of statistical modeling utility functions. |
| Authors: | Brett Klamer [aut, cre] |
| Maintainer: | Brett Klamer <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.1 |
| Built: | 2026-06-04 11:07:12 UTC |
| Source: | https://bitbucket.org/bklamer/bkmodel |
design_bootstrap() creates a lazy design-aware resampling object for bootstrap validation.
It computes the retained row universe from a design_spec() object and creates bootstrap split indices on those rows.
It does not fit a statistical model and does not build a full-data design matrix.
Use design_fold() to materialize one bootstrap split.
Each materialized split fits its design schema on the bootstrap analysis sample only and then encodes the full retained-row assessment set and out-of-bag rows with that analysis schema.
design_bootstrap( spec, data, n_boot = 1000L, strata = NULL, group = NULL, response_type = c("response", "ordinal_counts") )design_bootstrap( spec, data, n_boot = 1000L, strata = NULL, group = NULL, response_type = c("response", "ordinal_counts") )
spec |
(design_spec) |
data |
(data.frame) |
n_boot |
(numeric scalar: |
strata |
(formula: |
group |
(formula: |
response_type |
(Scalar character: |
A typical bootstrap validation workflow starts with design_spec() to declare the design rules.
Call design_bootstrap() after setting the random-number seed to create reproducible bootstrap samples.
Then iterate over seq_along(bs$splits), materialize each split with design_fold(), fit the model on fold$analysis, and evaluate predictions on fold$assessment, fold$oob, or both.
The full retained-row assessment input is useful for bootstrap optimism correction.
The oob input is useful for out-of-bag error estimation.
Keeping split construction separate from fold materialization prevents assessment and out-of-bag rows from determining factor levels, dropped categorical terms, column order, or other fitted design-schema details.
The retained row universe is computed once before bootstrap samples are drawn.
The na_action value stored in spec controls whether rows with missing predictor or response values are kept, removed, or rejected.
Weights and offsets are checked on retained rows using the same rules as design_fit().
Each split stores row identities from the retained universe.
When a split is materialized, row_index values in analysis, assessment, and oob refer to row numbers in the original data.
Assessment and OOB row identities are preserved after split construction so prediction errors can be aligned back to the original data.
Assessment and OOB predictors and responses are still encoded with the bootstrap-analysis schema.
n_boot controls the number of bootstrap replicates.
Each split contains:
analysis: retained row identities drawn with replacement.
Duplicate row identities are expected and their draw order is preserved.
assessment: the full retained-row universe in retained-row order.
oob: retained row identities not present in unique(analysis).
The OOB set can be empty for a bootstrap split.
seed: single positive integer for reproducible model fits on this split.
Bootstrap assignment is over retained rows when group is NULL and over retained groups when group is supplied.
When group is supplied, all rows from a drawn group enter analysis together.
If retained group sizes differ, the length of analysis can vary across bootstrap splits.
Each bootstrap split advances the current random-number generator state.
Call base::set.seed() immediately before design_bootstrap() for reproducible split assignment.
Each split carries a seed, a single positive integer for reproducible model fits on that split.
design_fold() copies this seed onto the materialized fold.
See resample_bootstrap() for the seed recipe and its rationale.
group and strata are optional operational controls for drawing bootstrap samples.
They are evaluated on the retained rows after the design specification computes the resampling row universe.
If the design formula uses ., variables used only for group or strata must not enter the predictor set through . expansion.
Put those variables in case_id, weights, offset, or exclude, or write an explicit predictor formula.
Operational variables may also be named explicitly as predictors when that is the intended statistical design.
Use strata = NULL for unstratified bootstrap sampling.
Use a one-sided formula such as strata = ~ treatment to draw samples within observed covariate levels.
Use strata = ~ 1 to stratify by the raw response.
Response stratification evaluates the response before any response_type conversion.
Each bootstrap stratum must contain at least two sampling units.
See resample_bootstrap() for additional stratification and grouping details.
response_typeresponse_type = "response" returns the ordinary evaluated response in each materialized split.
response_type = "ordinal_counts" returns weighted ordinal-count response matrices in analysis$y, assessment$y, and oob$y.
Ordinal-count splits require a two-sided bare-symbol factor or ordered-factor response with non-missing retained responses.
They also require weights when splits are materialized because the response matrix stores weighted counts.
The response type is fixed when the bootstrap object is created. Every materialized split uses the same response representation for analysis, assessment, and OOB inputs. Create a separate design bootstrap object when a different response representation is needed.
A list of class c("design_bootstrap_resamples", "design_resamples") with elements:
spec: the original design_spec() object.
data: the original analysis data.
row_index: retained row identities used for bootstrap construction.
response_type: response representation used by materialized splits.
response_levels: declared response levels for ordinal-count output, or NULL.
resamples: the underlying index-only bootstrap_resamples object.
splits: a named list of bootstrap split indices.
Each split contains analysis, assessment, oob, and seed.
Use design_fold() to convert a split into model-ready inputs.
design_spec(),
design_fold(),
design_fit(),
design_data(),
design_cv(),
resample_bootstrap(),
base::set.seed()
#---------------------------------------------------------------------------- # Estimate optimism-corrected prediction error #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:36), age = round(rnorm(36, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 18)), stage = factor(rep(c("I", "II", "III"), each = 12)), exposure = runif(36, min = 0.5, max = 2), inverse_prob_weight = runif(36, min = 0.8, max = 1.2), chart_score = rnorm(36) ) analysis$risk_group <- interaction( analysis$treatment, analysis$stage, drop = TRUE ) analysis$outcome <- with( analysis, 5 + 0.25 * age + 1.5 * (treatment == "active") + 0.8 * (stage == "III") + log(exposure) + rnorm(36) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score + risk_group, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) weighted_rmse <- function(observed, predicted, weights) { sqrt(stats::weighted.mean((observed - predicted)^2, weights)) } fit_lm <- function(inputs) { stats::lm.wfit( x = inputs$x, y = inputs$y, w = inputs$weights, offset = inputs$offset ) } predict_lm <- function(fit, inputs) { predicted <- drop(inputs$x %*% fit$coefficients) predicted + inputs$offset } full_fit <- design_fit(spec, analysis) full_inputs <- design_data(full_fit) full_model <- fit_lm(full_inputs) apparent_rmse <- weighted_rmse( full_inputs$y, predict_lm(full_model, full_inputs), full_inputs$weights ) set.seed(42L) bs <- design_bootstrap( spec, analysis, n_boot = 25L, strata = ~ risk_group ) optimism <- vapply(seq_along(bs$splits), function(i) { fold <- design_fold(bs, i) model <- fit_lm(fold$analysis) bootstrap_error <- weighted_rmse( fold$analysis$y, predict_lm(model, fold$analysis), fold$analysis$weights ) original_error <- weighted_rmse( fold$assessment$y, predict_lm(model, fold$assessment), fold$assessment$weights ) original_error - bootstrap_error }, numeric(1)) apparent_rmse apparent_rmse + mean(optimism) #---------------------------------------------------------------------------- # Estimate out-of-bag prediction error #---------------------------------------------------------------------------- oob_rmse <- vapply(seq_along(bs$splits), function(i) { fold <- design_fold(bs, i) model <- fit_lm(fold$analysis) weighted_rmse( fold$oob$y, predict_lm(model, fold$oob), fold$oob$weights ) }, numeric(1)) mean(oob_rmse) #---------------------------------------------------------------------------- # Draw repeated patient observations as groups #---------------------------------------------------------------------------- repeated <- data.frame( patient_id = rep(sprintf("P%02d", 1:18), each = 2), visit = rep(c("baseline", "follow_up"), times = 18), treatment = factor(rep(rep(c("control", "active"), each = 9), each = 2)), age = rep(round(rnorm(18, mean = 60, sd = 8)), each = 2), outcome = rnorm(36) ) repeated_spec <- design_spec( outcome ~ age + treatment + visit, case_id = ~ patient_id, encoding = "dummy", intercept = TRUE ) set.seed(7L) grouped_bs <- design_bootstrap( repeated_spec, repeated, n_boot = 10L, group = ~ patient_id, strata = ~ treatment ) grouped_fold <- design_fold(grouped_bs, 1L) grouped_fold$analysis$case_id[duplicated(grouped_fold$analysis$case_id)]#---------------------------------------------------------------------------- # Estimate optimism-corrected prediction error #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:36), age = round(rnorm(36, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 18)), stage = factor(rep(c("I", "II", "III"), each = 12)), exposure = runif(36, min = 0.5, max = 2), inverse_prob_weight = runif(36, min = 0.8, max = 1.2), chart_score = rnorm(36) ) analysis$risk_group <- interaction( analysis$treatment, analysis$stage, drop = TRUE ) analysis$outcome <- with( analysis, 5 + 0.25 * age + 1.5 * (treatment == "active") + 0.8 * (stage == "III") + log(exposure) + rnorm(36) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score + risk_group, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) weighted_rmse <- function(observed, predicted, weights) { sqrt(stats::weighted.mean((observed - predicted)^2, weights)) } fit_lm <- function(inputs) { stats::lm.wfit( x = inputs$x, y = inputs$y, w = inputs$weights, offset = inputs$offset ) } predict_lm <- function(fit, inputs) { predicted <- drop(inputs$x %*% fit$coefficients) predicted + inputs$offset } full_fit <- design_fit(spec, analysis) full_inputs <- design_data(full_fit) full_model <- fit_lm(full_inputs) apparent_rmse <- weighted_rmse( full_inputs$y, predict_lm(full_model, full_inputs), full_inputs$weights ) set.seed(42L) bs <- design_bootstrap( spec, analysis, n_boot = 25L, strata = ~ risk_group ) optimism <- vapply(seq_along(bs$splits), function(i) { fold <- design_fold(bs, i) model <- fit_lm(fold$analysis) bootstrap_error <- weighted_rmse( fold$analysis$y, predict_lm(model, fold$analysis), fold$analysis$weights ) original_error <- weighted_rmse( fold$assessment$y, predict_lm(model, fold$assessment), fold$assessment$weights ) original_error - bootstrap_error }, numeric(1)) apparent_rmse apparent_rmse + mean(optimism) #---------------------------------------------------------------------------- # Estimate out-of-bag prediction error #---------------------------------------------------------------------------- oob_rmse <- vapply(seq_along(bs$splits), function(i) { fold <- design_fold(bs, i) model <- fit_lm(fold$analysis) weighted_rmse( fold$oob$y, predict_lm(model, fold$oob), fold$oob$weights ) }, numeric(1)) mean(oob_rmse) #---------------------------------------------------------------------------- # Draw repeated patient observations as groups #---------------------------------------------------------------------------- repeated <- data.frame( patient_id = rep(sprintf("P%02d", 1:18), each = 2), visit = rep(c("baseline", "follow_up"), times = 18), treatment = factor(rep(rep(c("control", "active"), each = 9), each = 2)), age = rep(round(rnorm(18, mean = 60, sd = 8)), each = 2), outcome = rnorm(36) ) repeated_spec <- design_spec( outcome ~ age + treatment + visit, case_id = ~ patient_id, encoding = "dummy", intercept = TRUE ) set.seed(7L) grouped_bs <- design_bootstrap( repeated_spec, repeated, n_boot = 10L, group = ~ patient_id, strata = ~ treatment ) grouped_fold <- design_fold(grouped_bs, 1L) grouped_fold$analysis$case_id[duplicated(grouped_fold$analysis$case_id)]
design_cv() creates a lazy design-aware resampling object for K-fold cross-validation.
It computes the retained row universe from a design_spec() object and creates K-fold split indices on those rows.
It does not fit a statistical model and does not build a full-data design matrix.
Use design_fold() to materialize one split.
Each materialized fold fits its design schema on the analysis rows only and then encodes the assessment rows with that analysis schema.
design_cv( spec, data, n_folds = 5L, n_iters = 1L, strata = NULL, group = NULL, response_type = c("response", "ordinal_counts") )design_cv( spec, data, n_folds = 5L, n_iters = 1L, strata = NULL, group = NULL, response_type = c("response", "ordinal_counts") )
spec |
(design_spec) |
data |
(data.frame) |
n_folds |
(numeric scalar: |
n_iters |
(numeric scalar: |
strata |
(formula: |
group |
(formula: |
response_type |
(Scalar character: |
A typical validation workflow starts with design_spec() to declare the design rules.
Call design_cv() after setting the random-number seed to create reproducible folds.
Then iterate over seq_along(cv$splits), materialize each split with design_fold(), fit the model on fold$analysis, and evaluate predictions on fold$assessment.
Keeping split construction separate from fold materialization prevents assessment rows from determining factor levels, dropped categorical terms, column order, or other fitted design-schema details.
The retained row universe is computed once before folds are assigned.
The na_action value stored in spec controls whether rows with missing predictor or response values are kept, removed, or rejected.
Weights and offsets are checked on retained rows using the same rules as design_fit().
Each split stores row identities from the retained universe.
When a fold is materialized, fold$analysis$row_index and fold$assessment$row_index refer to row numbers in the original data.
Assessment row identities are preserved after split construction so prediction errors can be aligned back to the original data.
Assessment predictors and responses are still encoded with the fold-specific analysis schema.
n_folds controls the number of folds in each K-fold partition.
n_iters controls the number of independent K-fold partitions.
The total number of splits is n_folds * n_iters.
Within each iteration, every retained row appears in exactly one assessment set.
Across different iterations, the same row can appear in different assessment folds.
Call base::set.seed() immediately before design_cv() for reproducible split assignment.
Each split carries a seed, a single positive integer for a reproducible model fit on that split.
design_fold() copies this seed onto the materialized fold.
See resample_cv() for the seeding recipe and its rationale.
group and strata are optional operational controls for assigning folds.
They are evaluated on the retained rows after the design specification computes the resampling row universe.
If the design formula uses ., variables used only for group or strata must not enter the predictor set through . expansion.
Put those variables in case_id, weights, offset, or exclude, or write an explicit predictor formula.
Operational variables may also be named explicitly as predictors when that is the intended statistical design.
Use strata = NULL for unstratified folds.
Use a one-sided formula such as strata = ~ treatment to balance folds by observed covariate levels.
Use strata = ~ 1 to stratify by the raw response.
Response stratification evaluates the response before any response_type conversion.
For ordinal-count workflows with rare levels, strata = ~ 1 can reduce the chance that an analysis fold has an all-zero declared-level response column.
See resample_cv() for additional stratification and grouping details.
response_typeresponse_type = "response" returns the ordinary evaluated response in each materialized fold.
response_type = "ordinal_counts" returns weighted ordinal-count response matrices in fold$analysis$y and fold$assessment$y.
Ordinal-count folds require a two-sided bare-symbol factor or ordered-factor response with non-missing retained responses.
They also require evaluated weights because the response matrix stores weighted counts.
The response type is fixed when the CV object is created. Every materialized fold uses the same response representation for analysis and assessment inputs. Create a separate design CV object when a different response representation is needed.
A list of class c("design_cv_resamples", "design_resamples") with elements:
spec: the original design_spec() object.
data: the original analysis data.
row_index: retained row identities used for split construction.
response_type: response representation used by materialized folds.
response_levels: declared response levels for ordinal-count output, or NULL.
resamples: the underlying index-only cv_resamples object.
splits: a named list of CV split indices.
Each split contains analysis, assessment, iter, fold, and seed.
Use design_fold() to convert a split into model-ready inputs.
design_spec(),
design_fold(),
design_fit(),
design_data(),
resample_cv(),
base::set.seed()
#---------------------------------------------------------------------------- # Estimate prediction error with stratified K-fold cross-validation #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:24), age = round(rnorm(24, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 12)), stage = factor(rep(c("I", "II", "III"), each = 8)), exposure = runif(24, min = 0.5, max = 2), inverse_prob_weight = runif(24, min = 0.8, max = 1.2), chart_score = rnorm(24) ) analysis$risk_group <- interaction( analysis$treatment, analysis$stage, drop = TRUE ) analysis$outcome <- with( analysis, 5 + 0.25 * age + 1.5 * (treatment == "active") + 0.8 * (stage == "III") + log(exposure) + rnorm(24) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score + risk_group, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) set.seed(42L) cv <- design_cv( spec, analysis, n_folds = 4L, strata = ~ risk_group ) fold_rmse <- vapply(seq_along(cv$splits), function(i) { fold <- design_fold(cv, i) fit <- stats::lm.wfit( x = fold$analysis$x, y = fold$analysis$y, w = fold$analysis$weights, offset = fold$analysis$offset ) predicted <- drop(fold$assessment$x %*% fit$coefficients) predicted <- predicted + fold$assessment$offset sqrt(mean((fold$assessment$y - predicted)^2)) }, numeric(1)) fold_rmse mean(fold_rmse) #---------------------------------------------------------------------------- # Keep repeated patient observations together #---------------------------------------------------------------------------- repeated <- data.frame( patient_id = rep(sprintf("P%02d", 1:12), each = 2), visit = rep(c("baseline", "follow_up"), times = 12), treatment = factor(rep(rep(c("control", "active"), each = 6), each = 2)), age = rep(round(rnorm(12, mean = 60, sd = 8)), each = 2), outcome = rnorm(24) ) repeated_spec <- design_spec( outcome ~ age + treatment + visit, case_id = ~ patient_id, encoding = "dummy", intercept = TRUE ) set.seed(7L) grouped_cv <- design_cv( repeated_spec, repeated, n_folds = 4L, group = ~ patient_id, strata = ~ treatment ) grouped_fold <- design_fold(grouped_cv, 1L) intersect(grouped_fold$analysis$case_id, grouped_fold$assessment$case_id)#---------------------------------------------------------------------------- # Estimate prediction error with stratified K-fold cross-validation #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:24), age = round(rnorm(24, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 12)), stage = factor(rep(c("I", "II", "III"), each = 8)), exposure = runif(24, min = 0.5, max = 2), inverse_prob_weight = runif(24, min = 0.8, max = 1.2), chart_score = rnorm(24) ) analysis$risk_group <- interaction( analysis$treatment, analysis$stage, drop = TRUE ) analysis$outcome <- with( analysis, 5 + 0.25 * age + 1.5 * (treatment == "active") + 0.8 * (stage == "III") + log(exposure) + rnorm(24) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score + risk_group, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) set.seed(42L) cv <- design_cv( spec, analysis, n_folds = 4L, strata = ~ risk_group ) fold_rmse <- vapply(seq_along(cv$splits), function(i) { fold <- design_fold(cv, i) fit <- stats::lm.wfit( x = fold$analysis$x, y = fold$analysis$y, w = fold$analysis$weights, offset = fold$analysis$offset ) predicted <- drop(fold$assessment$x %*% fit$coefficients) predicted <- predicted + fold$assessment$offset sqrt(mean((fold$assessment$y - predicted)^2)) }, numeric(1)) fold_rmse mean(fold_rmse) #---------------------------------------------------------------------------- # Keep repeated patient observations together #---------------------------------------------------------------------------- repeated <- data.frame( patient_id = rep(sprintf("P%02d", 1:12), each = 2), visit = rep(c("baseline", "follow_up"), times = 12), treatment = factor(rep(rep(c("control", "active"), each = 6), each = 2)), age = rep(round(rnorm(12, mean = 60, sd = 8)), each = 2), outcome = rnorm(24) ) repeated_spec <- design_spec( outcome ~ age + treatment + visit, case_id = ~ patient_id, encoding = "dummy", intercept = TRUE ) set.seed(7L) grouped_cv <- design_cv( repeated_spec, repeated, n_folds = 4L, group = ~ patient_id, strata = ~ treatment ) grouped_fold <- design_fold(grouped_cv, 1L) intersect(grouped_fold$analysis$case_id, grouped_fold$assessment$case_id)
design_data() extracts the retained training inputs from a fitted design_fit() object.
It reuses cached x, y, weights, offset, case_id, and row_index.
It does not rebuild the design matrix and does not re-evaluate any formula.
Use this function after design_fit() when passing analysis data to a model fitter, scoring function, or diagnostic routine.
Use design_newdata() when encoding validation or prediction data.
design_data(fit, response_type = c("response", "ordinal_counts"))design_data(fit, response_type = c("response", "ordinal_counts"))
fit |
(design_fit) |
response_type |
(Scalar character: |
response_type = "response" returns the response and row roles as they were evaluated and retained by design_fit().
The returned x matrix already has the fitted predictor columns, categorical encoding, row filtering, and column order.
The returned y, weights, offset, case_id, and row_index are aligned row-for-row with x.
response_type = "ordinal_counts" returns a weighted count matrix in y and weights = NULL.
This output is useful for ordinal or categorical likelihoods that take a count matrix instead of a factor response plus case weights.
This mode requires the response formula to have a bare-symbol left-hand side that references a factor or ordered-factor column.
It also requires non-NULL weights, at least two declared response levels, and non-missing retained responses.
Ordinal-count rows place the row's observation weight in the observed response column and zero in all other columns.
Declared but unobserved response levels are retained as zero-count columns.
See ordinal_matrix_to_factor() for converting the count matrix back to factor responses and row weights.
A list of class "design_inputs" with elements:
x: retained training design matrix.
y: retained response, weighted ordinal-count matrix, or NULL for one-sided designs.
weights: retained analysis weights, or NULL.
offset: retained offset values, or NULL.
case_id: retained case identifiers, or NULL.
row_index: retained row identities from the fitted design.
All non-NULL elements are aligned to the rows of x.
design_fit(),
design_newdata(),
ordinal_matrix_to_factor()
#---------------------------------------------------------------------------- # Extract training inputs for a weighted linear model #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:12), outcome = rnorm(12), age = round(rnorm(12, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 6)), stage = factor(rep(c("I", "II", "III"), each = 4)), exposure = runif(12, min = 0.5, max = 2), inverse_prob_weight = runif(12, min = 0.8, max = 1.2), chart_score = rnorm(12) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score, encoding = "dummy", intercept = TRUE, na_action = "na.omit" ) fit <- design_fit(spec, analysis) inputs <- design_data(fit) model <- stats::lm.wfit( x = inputs$x, y = inputs$y, w = inputs$weights, offset = inputs$offset ) model$coefficients #---------------------------------------------------------------------------- # Extract weighted ordinal-count responses #---------------------------------------------------------------------------- ordinal_data <- data.frame( response = ordered( c("low", "medium", "high", "low", "medium", "high"), levels = c("low", "medium", "high") ), marker = c(0.2, 0.4, 1.1, 0.3, 0.8, 1.4), sampling_weight = c(1.0, 0.5, 1.5, 1.2, 0.8, 1.1) ) ordinal_spec <- design_spec( response ~ marker, weights = ~ sampling_weight ) ordinal_fit <- design_fit(ordinal_spec, ordinal_data) ordinal_inputs <- design_data( ordinal_fit, response_type = "ordinal_counts" ) ordinal_inputs$y rowSums(ordinal_inputs$y)#---------------------------------------------------------------------------- # Extract training inputs for a weighted linear model #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:12), outcome = rnorm(12), age = round(rnorm(12, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 6)), stage = factor(rep(c("I", "II", "III"), each = 4)), exposure = runif(12, min = 0.5, max = 2), inverse_prob_weight = runif(12, min = 0.8, max = 1.2), chart_score = rnorm(12) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score, encoding = "dummy", intercept = TRUE, na_action = "na.omit" ) fit <- design_fit(spec, analysis) inputs <- design_data(fit) model <- stats::lm.wfit( x = inputs$x, y = inputs$y, w = inputs$weights, offset = inputs$offset ) model$coefficients #---------------------------------------------------------------------------- # Extract weighted ordinal-count responses #---------------------------------------------------------------------------- ordinal_data <- data.frame( response = ordered( c("low", "medium", "high", "low", "medium", "high"), levels = c("low", "medium", "high") ), marker = c(0.2, 0.4, 1.1, 0.3, 0.8, 1.4), sampling_weight = c(1.0, 0.5, 1.5, 1.2, 0.8, 1.1) ) ordinal_spec <- design_spec( response ~ marker, weights = ~ sampling_weight ) ordinal_fit <- design_fit(ordinal_spec, ordinal_data) ordinal_inputs <- design_data( ordinal_fit, response_type = "ordinal_counts" ) ordinal_inputs$y rowSums(ordinal_inputs$y)
design_fit() fits a design schema to analysis data.
It does not fit a statistical model.
It evaluates the formula and role expressions from a design_spec() object, resolves ., learns categorical levels, applies the missing-value policy, and builds the training design matrix.
The fitted schema is the object to reuse when training data and validation or prediction data must have the same predictor columns and factor-level encoding.
design_fit(spec, data, row_index = NULL)design_fit(spec, data, row_index = NULL)
spec |
(design_spec) |
data |
(data.frame) |
row_index |
(integer: |
A typical modeling workflow starts with design_spec() to declare the design.
design_fit() then learns the concrete schema from one analysis dataset.
Use design_data() to extract training inputs from the fitted design.
Use design_newdata() to encode validation or prediction data with the same predictor terms, categorical levels, column order, and role handling.
Retained rows are computed once from the evaluated predictor, response, weight, and offset expressions before degenerate categorical predictors are dropped.
The na_action value stored in spec controls whether rows with missing predictor or response values are kept, removed, or rejected.
If no rows remain after applying na_action, design_fit() errors.
Weights and offsets are stricter than predictors and responses.
On every retained row, weights must be non-missing, finite, and non-negative.
On every retained row, offsets must be non-missing and finite.
Case IDs are metadata and do not participate in na_action.
After retained rows are fixed, categorical predictor columns with fewer than two observed retained levels are detected. Formula terms that reference any such categorical predictor are dropped in one pass. Interaction terms that reference a degenerate categorical predictor are dropped along with the main effect. Degenerate-term dropping does not add rows back and does not remove additional rows.
Constant numeric predictors are retained. They may produce rank-deficient predictor matrices. Downstream model functions are responsible for handling that rank deficiency.
Character predictors are treated as categorical predictors. Factor predictors store the observed retained levels in the fitted predictor schema. Declared but unobserved factor levels are not kept as design-matrix columns.
For a bare-symbol factor or ordered-factor response, response_levels stores the declared response levels from the analysis data.
These levels are needed for ordinal-count output in design_data() and design_newdata().
For one-sided specifications, transformed responses, and non-factor responses, response_levels = NULL.
row_index is the row identity vector for the supplied data.
row_index = NULL means seq_len(nrow(data)).
When supplied, row_index must be a whole-number vector with length(row_index) == nrow(data).
It may contain duplicates, which is required for bootstrap analysis data.
After NA filtering, returned row indices are row_index[retained_positions].
A list of class "design_fit".
Important elements include:
spec: the original design_spec() object.
formula, predictor_terms, and response_terms: fitted formula objects.
predictor_schema: learned predictor roles and categorical levels.
row_index: retained row identities after applying na_action.
dropped: degenerate categorical variables and terms dropped during fitting.
x: the retained training design matrix.
y: the retained evaluated response, or NULL for one-sided designs.
weights, offset, and case_id: retained role values, or NULL when the role was not specified.
The object also stores the role formulas, role source variables, encoding, intercept, na_action, novel_levels, response_name, and response_levels so the same schema can be applied to new data.
design_spec(),
design_data(),
design_newdata()
#---------------------------------------------------------------------------- # Fit a design schema and use it for weighted linear regression #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:12), outcome = rnorm(12), age = round(rnorm(12, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 6)), stage = factor(rep(c("I", "II", "III"), each = 4)), exposure = runif(12, min = 0.5, max = 2), inverse_prob_weight = runif(12, min = 0.8, max = 1.2), chart_score = rnorm(12) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) fit <- design_fit(spec, analysis) colnames(fit$x) fit$row_index inputs <- design_data(fit) model <- stats::lm.wfit( x = inputs$x, y = inputs$y, w = inputs$weights, offset = inputs$offset ) model$coefficients #---------------------------------------------------------------------------- # Encode validation data with the fitted training schema #---------------------------------------------------------------------------- validation <- data.frame( patient_id = sprintf("V%02d", 1:3), age = c(55, 67, 61), treatment = factor(c("active", "control", "active")), stage = factor(c("II", "I", "III")), exposure = c(1.1, 0.9, 1.5) ) validation_inputs <- design_newdata( fit, validation, outcome = "ignore", weights = "ignore" ) linear_predictor <- drop(validation_inputs$x %*% model$coefficients) linear_predictor <- linear_predictor + validation_inputs$offset linear_predictor#---------------------------------------------------------------------------- # Fit a design schema and use it for weighted linear regression #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:12), outcome = rnorm(12), age = round(rnorm(12, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 6)), stage = factor(rep(c("I", "II", "III"), each = 4)), exposure = runif(12, min = 0.5, max = 2), inverse_prob_weight = runif(12, min = 0.8, max = 1.2), chart_score = rnorm(12) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) fit <- design_fit(spec, analysis) colnames(fit$x) fit$row_index inputs <- design_data(fit) model <- stats::lm.wfit( x = inputs$x, y = inputs$y, w = inputs$weights, offset = inputs$offset ) model$coefficients #---------------------------------------------------------------------------- # Encode validation data with the fitted training schema #---------------------------------------------------------------------------- validation <- data.frame( patient_id = sprintf("V%02d", 1:3), age = c(55, 67, 61), treatment = factor(c("active", "control", "active")), stage = factor(c("II", "I", "III")), exposure = c(1.1, 0.9, 1.5) ) validation_inputs <- design_newdata( fit, validation, outcome = "ignore", weights = "ignore" ) linear_predictor <- drop(validation_inputs$x %*% model$coefficients) linear_predictor <- linear_predictor + validation_inputs$offset linear_predictor
design_fold() returns model-ready inputs for one split of a design-aware resampling object.
It fits the split-specific design schema on the split's analysis rows and encodes non-analysis rows with that fitted schema.
It does not fit a statistical model.
For ordinary CV folds, the result contains analysis and assessment.
For nested-CV outer folds, the result contains analysis, assessment, and inner.
For bootstrap splits, the result contains analysis, assessment, and oob.
design_fold(resample, i)design_fold(resample, i)
resample |
(design_resamples) |
i |
(integer or character scalar) |
Use design_fold() after creating a design-aware resampling object with design_cv(), design_nested_cv(), design_bootstrap(), or design_resample().
Select a split by integer position or by split name.
Fit the statistical model on fold$analysis.
Evaluate predictions on fold$assessment, fold$oob, or an inner fold depending on the resampling method.
Each split's design schema is fitted on its analysis rows only. Assessment and OOB rows are encoded with that fitted schema. No assessment or OOB schema is fitted.
This prevents assessment and OOB rows from determining factor levels, dropped categorical terms, column order, or other fitted design-schema details.
Fold-local novel levels are handled by the novel_levels value stored in the original design_spec() object.
The row_index element in each returned design_inputs object refers to row numbers in the original data supplied to the resampling constructor or wrapper.
For CV and nested-CV folds, analysis$row_index and assessment$row_index are disjoint.
For bootstrap splits, analysis$row_index can contain duplicates and preserves bootstrap draw order.
Bootstrap assessment$row_index is the full retained-row universe.
Bootstrap oob$row_index contains rows not present in unique(analysis$row_index).
Assessment and OOB row identities are preserved even when fold-local novel levels encode to NA.
Materializing a nested-CV outer fold returns an inner design-aware CV object.
Its splits are restricted to the outer analysis rows.
Call design_fold(outer$inner, j) to materialize an inner fold for model selection.
The outer fold's seed is used for the final refit, and each inner fold's seed is used for its candidate sweep.
Each fold carries the split's seed, so a fold is self-contained.
Call set.seed(fold$seed) immediately before fitting a random-using model on fold$analysis to make the fit reproducible regardless of run order.
Applying the seed sets the global random-number state.
See resample_cv() for the seeding recipe and its rationale.
The response representation is fixed when the design-aware resampling object is created.
With response_type = "response", materialized inputs contain the ordinary evaluated response.
With response_type = "ordinal_counts", materialized y components contain weighted ordinal-count response matrices.
A list of class c("design_<method>_fold", "design_fold").
The first class is one of "design_cv_fold", "design_nested_cv_fold", or "design_bootstrap_fold".
All returned folds contain:
id: split name.
seed: the split's single positive integer seed.
analysis: a "design_inputs" object for model fitting.
assessment: a "design_inputs" object for assessment.
Nested-CV outer folds also contain:
inner: a design-aware CV object restricted to the outer analysis rows.
Bootstrap folds also contain:
oob: a "design_inputs" object for out-of-bag assessment.
design_cv(),
design_nested_cv(),
design_bootstrap(),
design_resample(),
design_data(),
design_newdata()
#---------------------------------------------------------------------------- # Materialize a CV fold and estimate assessment error #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:24), age = round(rnorm(24, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 12)), stage = factor(rep(c("I", "II", "III"), each = 8)), exposure = runif(24, min = 0.5, max = 2), inverse_prob_weight = runif(24, min = 0.8, max = 1.2), chart_score = rnorm(24) ) analysis$risk_group <- interaction( analysis$treatment, analysis$stage, drop = TRUE ) analysis$outcome <- with( analysis, 5 + 0.25 * age + 1.5 * (treatment == "active") + 0.8 * (stage == "III") + log(exposure) + rnorm(24) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score + risk_group, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) set.seed(42L) cv <- design_cv( spec, analysis, n_folds = 4L, strata = ~ risk_group ) fold <- design_fold(cv, 1L) fit <- stats::lm.wfit( x = fold$analysis$x, y = fold$analysis$y, w = fold$analysis$weights, offset = fold$analysis$offset ) predicted <- drop(fold$assessment$x %*% fit$coefficients) predicted <- predicted + fold$assessment$offset sqrt(mean((fold$assessment$y - predicted)^2)) #---------------------------------------------------------------------------- # Materialize an inner fold for nested model selection #---------------------------------------------------------------------------- set.seed(7L) ncv <- design_nested_cv( spec, analysis, n_outer_folds = 3L, n_inner_folds = 2L, strata = ~ risk_group ) outer <- design_fold(ncv, 1L) inner <- design_fold(outer$inner, 1L) all(inner$analysis$row_index %in% outer$analysis$row_index) all(inner$assessment$row_index %in% outer$analysis$row_index) #---------------------------------------------------------------------------- # Materialize a bootstrap split for out-of-bag assessment #---------------------------------------------------------------------------- set.seed(11L) bs <- design_bootstrap( spec, analysis, n_boot = 10L, strata = ~ risk_group ) boot_fold <- design_fold(bs, 1L) c( analysis_rows = length(boot_fold$analysis$row_index), oob_rows = length(boot_fold$oob$row_index) )#---------------------------------------------------------------------------- # Materialize a CV fold and estimate assessment error #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:24), age = round(rnorm(24, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 12)), stage = factor(rep(c("I", "II", "III"), each = 8)), exposure = runif(24, min = 0.5, max = 2), inverse_prob_weight = runif(24, min = 0.8, max = 1.2), chart_score = rnorm(24) ) analysis$risk_group <- interaction( analysis$treatment, analysis$stage, drop = TRUE ) analysis$outcome <- with( analysis, 5 + 0.25 * age + 1.5 * (treatment == "active") + 0.8 * (stage == "III") + log(exposure) + rnorm(24) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score + risk_group, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) set.seed(42L) cv <- design_cv( spec, analysis, n_folds = 4L, strata = ~ risk_group ) fold <- design_fold(cv, 1L) fit <- stats::lm.wfit( x = fold$analysis$x, y = fold$analysis$y, w = fold$analysis$weights, offset = fold$analysis$offset ) predicted <- drop(fold$assessment$x %*% fit$coefficients) predicted <- predicted + fold$assessment$offset sqrt(mean((fold$assessment$y - predicted)^2)) #---------------------------------------------------------------------------- # Materialize an inner fold for nested model selection #---------------------------------------------------------------------------- set.seed(7L) ncv <- design_nested_cv( spec, analysis, n_outer_folds = 3L, n_inner_folds = 2L, strata = ~ risk_group ) outer <- design_fold(ncv, 1L) inner <- design_fold(outer$inner, 1L) all(inner$analysis$row_index %in% outer$analysis$row_index) all(inner$assessment$row_index %in% outer$analysis$row_index) #---------------------------------------------------------------------------- # Materialize a bootstrap split for out-of-bag assessment #---------------------------------------------------------------------------- set.seed(11L) bs <- design_bootstrap( spec, analysis, n_boot = 10L, strata = ~ risk_group ) boot_fold <- design_fold(bs, 1L) c( analysis_rows = length(boot_fold$analysis$row_index), oob_rows = length(boot_fold$oob$row_index) )
design_nested_cv() creates a lazy design-aware resampling object for nested K-fold cross-validation.
It computes the retained row universe from a design_spec() object and creates outer and inner K-fold split indices on those rows.
It does not fit a statistical model and does not build a full-data design matrix.
Use design_fold() to materialize one outer split.
The materialized outer fold contains analysis, assessment, and an inner design-aware CV object.
Use design_fold() again on outer$inner to materialize inner folds for model selection within the outer analysis rows.
design_nested_cv( spec, data, n_outer_folds = 5L, n_inner_folds = 5L, n_iters = 1L, strata = NULL, group = NULL, response_type = c("response", "ordinal_counts") )design_nested_cv( spec, data, n_outer_folds = 5L, n_inner_folds = 5L, n_iters = 1L, strata = NULL, group = NULL, response_type = c("response", "ordinal_counts") )
spec |
(design_spec) |
data |
(data.frame) |
n_outer_folds |
(numeric scalar: |
n_inner_folds |
(numeric scalar: |
n_iters |
(numeric scalar: |
strata |
(formula: |
group |
(formula: |
response_type |
(Scalar character: |
A typical nested validation workflow starts with design_spec() to declare the design rules.
Call design_nested_cv() after setting the random-number seed to create reproducible outer and inner folds.
For each outer split, use the inner folds to choose a candidate model or tuning parameter.
Then fit the selected candidate on the full outer analysis input and evaluate it on the outer assessment input.
Keeping split construction separate from fold materialization prevents assessment rows from determining factor levels, dropped categorical terms, column order, or other fitted design-schema details. Outer assessment rows are not used during inner model selection.
The retained row universe is computed once before folds are assigned.
The na_action value stored in spec controls whether rows with missing predictor or response values are kept, removed, or rejected.
Weights and offsets are checked on retained rows using the same rules as design_fit().
Each outer and inner split stores row identities from the retained universe.
When a fold is materialized, row_index values in analysis and assessment refer to row numbers in the original data.
Assessment row identities are preserved after split construction so prediction errors can be aligned back to the original data.
Assessment predictors and responses are still encoded with the fold-specific analysis schema.
n_outer_folds controls the number of outer folds in each outer K-fold partition.
n_inner_folds controls the number of inner folds within each outer analysis set.
n_iters controls the number of independent outer K-fold partitions.
The total number of outer splits is n_outer_folds * n_iters.
Fold assignment is over retained rows when group is NULL and over retained groups when group is supplied.
Within each iteration, every retained row appears in exactly one outer assessment set.
Within each outer split, every outer-analysis row appears in exactly one inner assessment set.
Across different iterations, the same row can appear in different outer and inner folds.
Call base::set.seed() immediately before design_nested_cv() for reproducible split assignment.
Each outer split carries a seed for its final refit, and each inner split carries a seed for its candidate sweep.
design_fold() copies the relevant seed onto the materialized outer and inner folds.
See resample_nested_cv() for the seeding recipe and its rationale.
group and strata are optional operational controls for assigning folds.
They are evaluated on the retained rows after the design specification computes the resampling row universe.
If the design formula uses ., variables used only for group or strata must not enter the predictor set through . expansion.
Put those variables in case_id, weights, offset, or exclude, or write an explicit predictor formula.
Operational variables may also be named explicitly as predictors when that is the intended statistical design.
Use strata = NULL for unstratified folds.
Use a one-sided formula such as strata = ~ treatment to balance folds by observed covariate levels.
Use strata = ~ 1 to stratify by the raw response.
Response stratification evaluates the response before any response_type conversion.
For ordinal-count workflows with rare levels, strata = ~ 1 can reduce the chance that an analysis fold has an all-zero declared-level response column.
See resample_nested_cv() for additional stratification and grouping details.
response_typeresponse_type = "response" returns the ordinary evaluated response in each materialized fold.
response_type = "ordinal_counts" returns weighted ordinal-count response matrices in analysis$y and assessment$y.
Ordinal-count folds require a two-sided bare-symbol factor or ordered-factor response with non-missing retained responses.
They also require weights when folds are materialized because the response matrix stores weighted counts.
The response type is fixed when the nested CV object is created. Every materialized outer and inner fold uses the same response representation for analysis and assessment inputs. Create a separate design nested-CV object when a different response representation is needed.
A list of class c("design_nested_cv_resamples", "design_resamples") with elements:
spec: the original design_spec() object.
data: the original analysis data.
row_index: retained row identities used for split construction.
response_type: response representation used by materialized folds.
response_levels: declared response levels for ordinal-count output, or NULL.
resamples: the underlying index-only nested_cv_resamples object.
splits: a named list of outer split indices.
Each outer split contains analysis, assessment, iter, outer_fold, seed, and inner.
Each inner split contains analysis, assessment, iter, outer_fold, inner_fold, and seed.
Use design_fold() to convert outer or inner splits into model-ready inputs.
design_spec(),
design_fold(),
design_fit(),
design_data(),
design_cv(),
resample_nested_cv(),
base::set.seed()
#---------------------------------------------------------------------------- # Estimate prediction error after inner-fold model selection #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:36), age = round(rnorm(36, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 18)), stage = factor(rep(c("I", "II", "III"), each = 12)), exposure = runif(36, min = 0.5, max = 2), inverse_prob_weight = runif(36, min = 0.8, max = 1.2), chart_score = rnorm(36) ) analysis$risk_group <- interaction( analysis$treatment, analysis$stage, drop = TRUE ) analysis$outcome <- with( analysis, 5 + 0.25 * age + 1.5 * (treatment == "active") + 0.8 * (stage == "III") + log(exposure) + rnorm(36) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score + risk_group, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) candidate_models <- c("age_only", "full") candidate_columns <- function(inputs, candidate) { if (candidate == "age_only") { return(c("(Intercept)", "age")) } colnames(inputs$x) } candidate_rmse <- function(fold, candidate) { cols <- candidate_columns(fold$analysis, candidate) fit <- stats::lm.wfit( x = fold$analysis$x[, cols, drop = FALSE], y = fold$analysis$y, w = fold$analysis$weights, offset = fold$analysis$offset ) predicted <- drop( fold$assessment$x[, cols, drop = FALSE] %*% fit$coefficients ) predicted <- predicted + fold$assessment$offset sqrt(mean((fold$assessment$y - predicted)^2)) } set.seed(42L) ncv <- design_nested_cv( spec, analysis, n_outer_folds = 3L, n_inner_folds = 2L, strata = ~ risk_group ) nested_results <- lapply(seq_along(ncv$splits), function(i) { outer <- design_fold(ncv, i) inner_rmse <- vapply(candidate_models, function(candidate) { fold_rmse <- vapply(seq_along(outer$inner$splits), function(j) { inner <- design_fold(outer$inner, j) candidate_rmse(inner, candidate) }, numeric(1)) mean(fold_rmse) }, numeric(1)) selected <- names(which.min(inner_rmse)) data.frame( outer_split = outer$id, selected_model = selected, rmse = candidate_rmse(outer, selected) ) }) nested_results <- do.call(rbind, nested_results) nested_results mean(nested_results$rmse) #---------------------------------------------------------------------------- # Keep repeated patient observations together #---------------------------------------------------------------------------- repeated <- data.frame( patient_id = rep(sprintf("P%02d", 1:18), each = 2), visit = rep(c("baseline", "follow_up"), times = 18), treatment = factor(rep(rep(c("control", "active"), each = 9), each = 2)), age = rep(round(rnorm(18, mean = 60, sd = 8)), each = 2), outcome = rnorm(36) ) repeated_spec <- design_spec( outcome ~ age + treatment + visit, case_id = ~ patient_id, encoding = "dummy", intercept = TRUE ) set.seed(7L) grouped_ncv <- design_nested_cv( repeated_spec, repeated, n_outer_folds = 3L, n_inner_folds = 2L, group = ~ patient_id, strata = ~ treatment ) outer_fold <- design_fold(grouped_ncv, 1L) inner_fold <- design_fold(outer_fold$inner, 1L) intersect(inner_fold$analysis$case_id, inner_fold$assessment$case_id)#---------------------------------------------------------------------------- # Estimate prediction error after inner-fold model selection #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:36), age = round(rnorm(36, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 18)), stage = factor(rep(c("I", "II", "III"), each = 12)), exposure = runif(36, min = 0.5, max = 2), inverse_prob_weight = runif(36, min = 0.8, max = 1.2), chart_score = rnorm(36) ) analysis$risk_group <- interaction( analysis$treatment, analysis$stage, drop = TRUE ) analysis$outcome <- with( analysis, 5 + 0.25 * age + 1.5 * (treatment == "active") + 0.8 * (stage == "III") + log(exposure) + rnorm(36) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score + risk_group, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) candidate_models <- c("age_only", "full") candidate_columns <- function(inputs, candidate) { if (candidate == "age_only") { return(c("(Intercept)", "age")) } colnames(inputs$x) } candidate_rmse <- function(fold, candidate) { cols <- candidate_columns(fold$analysis, candidate) fit <- stats::lm.wfit( x = fold$analysis$x[, cols, drop = FALSE], y = fold$analysis$y, w = fold$analysis$weights, offset = fold$analysis$offset ) predicted <- drop( fold$assessment$x[, cols, drop = FALSE] %*% fit$coefficients ) predicted <- predicted + fold$assessment$offset sqrt(mean((fold$assessment$y - predicted)^2)) } set.seed(42L) ncv <- design_nested_cv( spec, analysis, n_outer_folds = 3L, n_inner_folds = 2L, strata = ~ risk_group ) nested_results <- lapply(seq_along(ncv$splits), function(i) { outer <- design_fold(ncv, i) inner_rmse <- vapply(candidate_models, function(candidate) { fold_rmse <- vapply(seq_along(outer$inner$splits), function(j) { inner <- design_fold(outer$inner, j) candidate_rmse(inner, candidate) }, numeric(1)) mean(fold_rmse) }, numeric(1)) selected <- names(which.min(inner_rmse)) data.frame( outer_split = outer$id, selected_model = selected, rmse = candidate_rmse(outer, selected) ) }) nested_results <- do.call(rbind, nested_results) nested_results mean(nested_results$rmse) #---------------------------------------------------------------------------- # Keep repeated patient observations together #---------------------------------------------------------------------------- repeated <- data.frame( patient_id = rep(sprintf("P%02d", 1:18), each = 2), visit = rep(c("baseline", "follow_up"), times = 18), treatment = factor(rep(rep(c("control", "active"), each = 9), each = 2)), age = rep(round(rnorm(18, mean = 60, sd = 8)), each = 2), outcome = rnorm(36) ) repeated_spec <- design_spec( outcome ~ age + treatment + visit, case_id = ~ patient_id, encoding = "dummy", intercept = TRUE ) set.seed(7L) grouped_ncv <- design_nested_cv( repeated_spec, repeated, n_outer_folds = 3L, n_inner_folds = 2L, group = ~ patient_id, strata = ~ treatment ) outer_fold <- design_fold(grouped_ncv, 1L) inner_fold <- design_fold(outer_fold$inner, 1L) intersect(inner_fold$analysis$case_id, inner_fold$assessment$case_id)
design_newdata() applies a fitted design_fit() schema to validation, assessment, or prediction data.
It builds a design_inputs object whose predictor matrix has the same columns, categorical encoding, factor levels, and column order as the training design.
It does not refit the schema or learn new factor levels from newdata.
Use this function after design_fit() when new data must be encoded in the same way as the analysis data used for model training.
design_newdata( fit, newdata, row_index = NULL, outcome = c("auto", "require", "ignore"), weights = c("auto", "require", "ignore"), response_type = c("response", "ordinal_counts"), na_action = fit$na_action, novel_levels = fit$novel_levels )design_newdata( fit, newdata, row_index = NULL, outcome = c("auto", "require", "ignore"), weights = c("auto", "require", "ignore"), response_type = c("response", "ordinal_counts"), na_action = fit$na_action, novel_levels = fit$novel_levels )
fit |
(design_fit) |
newdata |
(data.frame) |
row_index |
(integer: |
outcome |
(Scalar character: |
weights |
(Scalar character: |
response_type |
(Scalar character: |
na_action |
(Scalar character: |
novel_levels |
(Scalar character: |
All predictor source variables required by the fitted schema must be present in newdata.
Character and factor predictors are constrained to the levels learned by design_fit().
novel_levels = "na" converts new categorical levels to NA before row filtering and encoding.
novel_levels = "fail" errors and reports the affected variables and values.
The returned x matrix has the same column names and order as fit$x.
Missing one-hot columns are filled with zeros when a fitted training level is absent from newdata.
na_action controls row retention for evaluated predictors, responses, weights, and offsets.
The default is fit$na_action.
na_action = "na.pass" keeps rows with missing predictor or response values.
na_action = "na.omit" and na_action = "na.exclude" keep only complete evaluated rows.
na_action = "na.fail" errors if any evaluated predictor or response value is missing.
Weights and offsets are stricter. On retained rows, evaluated weights must be non-missing, finite, and non-negative. On retained rows, evaluated offsets must be non-missing and finite. Case IDs are metadata and do not participate in row filtering.
outcome = "auto" evaluates y when all response source variables are present in newdata and returns NULL only when none are present.
outcome = "require" errors when any response source variable is absent from newdata.
outcome = "ignore" never evaluates the response.
For one-sided designs, y = NULL for all outcome modes.
Use outcome = "ignore" for pure prediction data without observed outcomes.
Use outcome = "require" for assessment data where the response should be present.
If the fitted design has no weights, all weights modes return weights = NULL.
Otherwise, weights = "auto" evaluates weights when all weight source variables are present and returns NULL when none are present.
weights = "require" errors when any weight source variable is absent.
weights = "ignore" never evaluates weights and returns NULL.
Use weights = "ignore" for prediction data where analysis weights are not needed.
Use weights = "require" for assessment data where weighted scoring or weighted likelihoods require weights.
Offsets are required whenever the fitted design has an offset.
If any required offset source variable is absent from newdata, design_newdata() errors.
Offset values must be non-missing and finite on retained rows.
When all fitted case ID source variables are present, case_id is returned aligned to the retained rows.
When none are present, case_id = NULL.
When only some are present, design_newdata() errors.
Ordinal-count output requires evaluated responses and evaluated weights.
Therefore outcome = "ignore" and weights = "ignore" error.
"auto" behaves like "require" for both response and weight source variables.
Missing retained responses error.
Response values must be contained in the fitted declared response levels.
A list of class "design_inputs" with elements:
x: encoded new-data design matrix with the same columns as fit$x.
y: evaluated response, weighted ordinal-count matrix, or NULL.
weights: evaluated weights, or NULL.
offset: evaluated offsets, or NULL.
case_id: evaluated case identifiers, or NULL.
row_index: retained row identities from newdata.
All non-NULL elements are aligned to the rows of x.
design_fit(),
design_data(),
ordinal_matrix_to_factor()
#---------------------------------------------------------------------------- # Encode prediction data with the training design schema #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:12), outcome = rnorm(12), age = round(rnorm(12, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 6)), stage = factor(rep(c("I", "II", "III"), each = 4)), exposure = runif(12, min = 0.5, max = 2), inverse_prob_weight = runif(12, min = 0.8, max = 1.2), chart_score = rnorm(12) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) fit <- design_fit(spec, analysis) inputs <- design_data(fit) model <- stats::lm.wfit( x = inputs$x, y = inputs$y, w = inputs$weights, offset = inputs$offset ) prediction_data <- data.frame( patient_id = sprintf("V%02d", 1:3), age = c(55, 67, 61), treatment = factor(c("active", "control", "active")), stage = factor(c("II", "I", "III")), exposure = c(1.1, 0.9, 1.5) ) prediction_inputs <- design_newdata( fit, prediction_data, outcome = "ignore", weights = "ignore" ) linear_predictor <- drop(prediction_inputs$x %*% model$coefficients) linear_predictor <- linear_predictor + prediction_inputs$offset linear_predictor #---------------------------------------------------------------------------- # Encode assessment data with observed outcomes and weights #---------------------------------------------------------------------------- assessment_data <- data.frame( patient_id = sprintf("A%02d", 1:3), outcome = c(0.1, -0.4, 0.7), age = c(58, 63, 52), treatment = factor(c("control", "active", "control")), stage = factor(c("I", "III", "II")), exposure = c(1.0, 1.3, 0.8), inverse_prob_weight = c(1.0, 0.9, 1.1) ) assessment_inputs <- design_newdata( fit, assessment_data, outcome = "require", weights = "require" ) assessment_inputs$y assessment_inputs$weights#---------------------------------------------------------------------------- # Encode prediction data with the training design schema #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:12), outcome = rnorm(12), age = round(rnorm(12, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 6)), stage = factor(rep(c("I", "II", "III"), each = 4)), exposure = runif(12, min = 0.5, max = 2), inverse_prob_weight = runif(12, min = 0.8, max = 1.2), chart_score = rnorm(12) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) fit <- design_fit(spec, analysis) inputs <- design_data(fit) model <- stats::lm.wfit( x = inputs$x, y = inputs$y, w = inputs$weights, offset = inputs$offset ) prediction_data <- data.frame( patient_id = sprintf("V%02d", 1:3), age = c(55, 67, 61), treatment = factor(c("active", "control", "active")), stage = factor(c("II", "I", "III")), exposure = c(1.1, 0.9, 1.5) ) prediction_inputs <- design_newdata( fit, prediction_data, outcome = "ignore", weights = "ignore" ) linear_predictor <- drop(prediction_inputs$x %*% model$coefficients) linear_predictor <- linear_predictor + prediction_inputs$offset linear_predictor #---------------------------------------------------------------------------- # Encode assessment data with observed outcomes and weights #---------------------------------------------------------------------------- assessment_data <- data.frame( patient_id = sprintf("A%02d", 1:3), outcome = c(0.1, -0.4, 0.7), age = c(58, 63, 52), treatment = factor(c("control", "active", "control")), stage = factor(c("I", "III", "II")), exposure = c(1.0, 1.3, 0.8), inverse_prob_weight = c(1.0, 0.9, 1.1) ) assessment_inputs <- design_newdata( fit, assessment_data, outcome = "require", weights = "require" ) assessment_inputs$y assessment_inputs$weights
design_resample() validates a user-supplied cv_resamples, nested_cv_resamples, or bootstrap_resamples object against a design spec and returns a design-aware resampling object.
It computes the retained row universe from a design_spec() object and checks that the supplied split indices are compatible with those rows.
It does not create new splits, fit a statistical model, or build a full-data design matrix.
Use design_fold() to materialize one validated split.
Materialized CV and bootstrap splits fit their design schema on their analysis rows only and then encode assessment or out-of-bag rows with that analysis schema.
Materialized nested-CV outer splits also return an inner design-aware CV object whose inner folds are materialized separately.
design_resample( spec, data, resamples, response_type = c("response", "ordinal_counts") )design_resample( spec, data, resamples, response_type = c("response", "ordinal_counts") )
spec |
(design_spec) |
data |
(data.frame) |
resamples |
(resamples) |
response_type |
(Scalar character: |
A typical workflow starts with an index-only resampling object created by resample_cv(), resample_nested_cv(), resample_bootstrap(), or another package function that returns the same classes and split structure.
Call design_resample() to attach a design_spec() object to those split indices.
Then iterate over seq_along(resample$splits), materialize each split with design_fold(), fit the model on fold$analysis, and evaluate predictions on the assessment or OOB inputs supplied by that resampling method.
Use design_resample() when the split indices already exist.
Use design_cv(), design_nested_cv(), or design_bootstrap() when the design-aware object should create the split indices.
The retained row universe is computed once before the supplied split indices are checked.
The na_action value stored in spec controls whether rows with missing predictor or response values are kept, removed, or rejected.
Weights and offsets are checked on retained rows using the same rules as design_fit().
All split indices must be contained in the design-retained row universe. If the design drops rows, create the resampling object on the retained rows or use one of the design-aware split constructors.
The supplied resampling object's n must equal nrow(data).
Split indices must be positive integer vectors.
For CV splits, analysis and assessment must be non-empty, unique, and disjoint within every split.
For nested-CV splits, outer and inner analysis and assessment vectors must be non-empty, unique, and disjoint.
Every inner analysis and inner assessment vector must be contained in the corresponding outer analysis vector.
For bootstrap splits, assessment must equal the retained row universe.
oob must be set-equal to the retained-row complement of unique(analysis).
Bootstrap analysis may contain duplicates.
Ungrouped bootstrap analysis must have the same length as the retained row universe.
response_typeresponse_type = "response" returns the ordinary evaluated response in each materialized split.
response_type = "ordinal_counts" returns weighted ordinal-count response matrices in the materialized y components supplied by the resampling method.
Ordinal-count splits require a two-sided bare-symbol factor or ordered-factor response with non-missing retained responses.
They also require weights when splits are materialized because the response matrix stores weighted counts.
The response type is fixed when the design-aware resampling object is created. Every materialized split uses the same response representation. Create a separate design-aware resampling object when a different response representation is needed.
Each split may carry a seed, a single positive integer for reproducible model fits on that split.
Objects built by resample_cv(), resample_nested_cv(), and resample_bootstrap() already carry seeds.
design_resample() preserves the seeds on the wrapped object and does not create or modify them, so design_fold() propagates whatever seed each split carries.
See resample_cv() for the seeding recipe and its rationale.
A list of class c("design_<method>_resamples", "design_resamples") with elements:
spec: the original design_spec() object.
data: the original analysis data.
row_index: retained row identities used to validate split indices.
response_type: response representation used by materialized splits.
response_levels: declared response levels for ordinal-count output, or NULL.
resamples: the index-only resampling object.
splits: the named list of split indices from resamples.
Each split carries a seed when the wrapped object carried one.
The first class is one of "design_cv_resamples", "design_nested_cv_resamples", or "design_bootstrap_resamples".
Use design_fold() to convert a split into model-ready inputs.
design_cv(),
design_nested_cv(),
design_bootstrap(),
design_fold(),
resample_cv(),
resample_nested_cv(),
resample_bootstrap()
#---------------------------------------------------------------------------- # Wrap existing CV splits and estimate prediction error #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:24), age = round(rnorm(24, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 12)), stage = factor(rep(c("I", "II", "III"), each = 8)), exposure = runif(24, min = 0.5, max = 2), inverse_prob_weight = runif(24, min = 0.8, max = 1.2), chart_score = rnorm(24) ) analysis$risk_group <- interaction( analysis$treatment, analysis$stage, drop = TRUE ) analysis$outcome <- with( analysis, 5 + 0.25 * age + 1.5 * (treatment == "active") + 0.8 * (stage == "III") + log(exposure) + rnorm(24) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score + risk_group, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) set.seed(42L) raw_cv <- resample_cv( analysis, outcome ~ age, n_folds = 4L, strata = ~ risk_group ) cv <- design_resample(spec, analysis, raw_cv) fold_rmse <- vapply(seq_along(cv$splits), function(i) { fold <- design_fold(cv, i) fit <- stats::lm.wfit( x = fold$analysis$x, y = fold$analysis$y, w = fold$analysis$weights, offset = fold$analysis$offset ) predicted <- drop(fold$assessment$x %*% fit$coefficients) predicted <- predicted + fold$assessment$offset sqrt(mean((fold$assessment$y - predicted)^2)) }, numeric(1)) fold_rmse mean(fold_rmse) #---------------------------------------------------------------------------- # Wrap existing bootstrap splits for out-of-bag scoring #---------------------------------------------------------------------------- set.seed(7L) raw_bs <- resample_bootstrap( analysis, outcome ~ age, n_boot = 10L, strata = ~ risk_group ) bs <- design_resample(spec, analysis, raw_bs) boot_fold <- design_fold(bs, 1L) c( analysis_rows = length(boot_fold$analysis$row_index), oob_rows = length(boot_fold$oob$row_index) )#---------------------------------------------------------------------------- # Wrap existing CV splits and estimate prediction error #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:24), age = round(rnorm(24, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 12)), stage = factor(rep(c("I", "II", "III"), each = 8)), exposure = runif(24, min = 0.5, max = 2), inverse_prob_weight = runif(24, min = 0.8, max = 1.2), chart_score = rnorm(24) ) analysis$risk_group <- interaction( analysis$treatment, analysis$stage, drop = TRUE ) analysis$outcome <- with( analysis, 5 + 0.25 * age + 1.5 * (treatment == "active") + 0.8 * (stage == "III") + log(exposure) + rnorm(24) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score + risk_group, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) set.seed(42L) raw_cv <- resample_cv( analysis, outcome ~ age, n_folds = 4L, strata = ~ risk_group ) cv <- design_resample(spec, analysis, raw_cv) fold_rmse <- vapply(seq_along(cv$splits), function(i) { fold <- design_fold(cv, i) fit <- stats::lm.wfit( x = fold$analysis$x, y = fold$analysis$y, w = fold$analysis$weights, offset = fold$analysis$offset ) predicted <- drop(fold$assessment$x %*% fit$coefficients) predicted <- predicted + fold$assessment$offset sqrt(mean((fold$assessment$y - predicted)^2)) }, numeric(1)) fold_rmse mean(fold_rmse) #---------------------------------------------------------------------------- # Wrap existing bootstrap splits for out-of-bag scoring #---------------------------------------------------------------------------- set.seed(7L) raw_bs <- resample_bootstrap( analysis, outcome ~ age, n_boot = 10L, strata = ~ risk_group ) bs <- design_resample(spec, analysis, raw_bs) boot_fold <- design_fold(bs, 1L) c( analysis_rows = length(boot_fold$analysis$row_index), oob_rows = length(boot_fold$oob$row_index) )
design_spec() defines how analysis data should be turned into model-ready inputs.
It records the model formula, optional row roles, and design-matrix options.
It does not inspect data, expand ., encode factors, remove missing rows, or fit a statistical model.
Use the returned object with design_fit() to learn a reproducible design schema from training data.
Use it with design_cv(), design_nested_cv(), design_bootstrap(), or design_resample() to build resampling plans that use the same design rules.
design_spec( formula, weights = NULL, offset = NULL, case_id = NULL, exclude = NULL, encoding = c("one_hot", "dummy"), intercept = FALSE, na_action = "na.pass", novel_levels = c("na", "fail") )design_spec( formula, weights = NULL, offset = NULL, case_id = NULL, exclude = NULL, encoding = c("one_hot", "dummy"), intercept = FALSE, na_action = "na.pass", novel_levels = c("na", "fail") )
formula |
(formula) |
weights |
(formula or |
offset |
(formula or |
case_id |
(formula or |
exclude |
(formula or |
encoding |
(Scalar character: |
intercept |
(Scalar logical: |
na_action |
(Scalar character: |
novel_levels |
(Scalar character: |
A typical workflow has three steps.
First, call design_spec() to declare the statistical design.
Second, call design_fit() on the analysis data to evaluate formulas, expand ., learn factor levels, and build the training design matrix.
Third, use design_data() for training inputs and design_newdata() to encode validation or prediction data with the same columns and factor levels.
Keeping these steps separate makes the analysis reproducible. The specification can be reused across cross-validation folds, bootstrap samples, and final model fitting without silently changing the predictor schema.
The weights, offset, and case_id arguments declare columns or expressions with special roles.
Each role may be NULL or a one-sided formula.
Role formulas may not use ..
Source variables used by these roles are removed from . expansion automatically and may not also be named as explicit predictors.
Response source variables may not also be used as predictors, weights, offsets, or case IDs.
This prevents outcome leakage during training, assessment, and resampling.
If an overlap is intentional, create a separate response column before calling design_spec().
exclude may be NULL or a one-sided formula whose right-hand side contains only bare variable names separated by +.
The exclusion set is all.vars(exclude).
Variables in exclude are removed from . expansion only.
Explicitly named predictor terms are kept even when they also appear in exclude.
Right-hand-side offset() terms inside formula are collected as offset source variables.
Variables referenced inside those calls are removed from . expansion and may not appear as explicit predictors.
If formula contains offset() terms and the offset argument is also supplied, the evaluated components are summed into one numeric offset vector.
encoding = "dummy" uses treatment contrasts.
encoding = "one_hot" creates one indicator column per categorical level.
intercept controls only whether fitted design matrices contain a numeric "(Intercept)" column.
It is separate from any intercept option supplied to a downstream model fitter.
If the formula explicitly suppresses the intercept with -1 or +0, intercept = TRUE errors.
With encoding = "dummy", categorical predictors are treatment-coded.
The reference level is represented by zeros in the treatment columns.
This is the usual choice for ordinary linear and generalized linear model workflows that include an intercept.
With encoding = "one_hot", categorical predictors use one column per level.
The default intercept = FALSE gives a design matrix similar to glmnet::makeX() factor encoding.
This is useful when a downstream fitter supplies its own intercept or when full level indicators are desired.
Some choices can produce rank-deficient design matrices.
For example, encoding = "one_hot" with intercept = TRUE is rank deficient
when categorical predictors are present.
Additive formulas with multiple one-hot categorical predictors can also be rank deficient when intercept = FALSE.
Choose encoding = "dummy" for reference-level contrasts.
Choose encoding = "one_hot" for full level indicators.
na_action is one of "na.pass", "na.omit", "na.exclude", or "na.fail".
It is applied when design_fit(), design_newdata(), and design-aware resampling functions evaluate predictors and responses.
"na.pass" keeps rows with missing predictor or response values.
"na.omit" and "na.exclude" keep only complete rows.
"na.fail" errors if any evaluated predictor or response value is missing.
Weights and offsets are validated more strictly.
On every retained row, weights must be non-missing, finite, and non-negative.
On every retained row, offsets must be non-missing and finite.
Case IDs are metadata and do not participate in na_action.
novel_levels controls categorical values in new data that were not observed when the design schema was fitted.
"na" converts those values to NA before encoding.
"fail" errors and reports the affected variables and values.
A list of class "design_spec".
It stores the unevaluated design contract and contains these elements:
formula
weights, offset, case_id, exclude
encoding, intercept, na_action, novel_levels
The returned object does not contain a design matrix, response vector, factor schema, retained-row index, or fitted model.
design_fit(),
design_data(),
design_newdata(),
design_cv(),
design_nested_cv(),
design_bootstrap(),
design_resample()
#---------------------------------------------------------------------------- # Declare a design, fit its schema, and model the analysis data #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:12), outcome = rnorm(12), age = round(rnorm(12, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 6)), stage = factor(rep(c("I", "II", "III"), each = 4)), exposure = runif(12, min = 0.5, max = 2), inverse_prob_weight = runif(12, min = 0.8, max = 1.2), chart_score = rnorm(12) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) fit <- design_fit(spec, analysis) inputs <- design_data(fit) model <- stats::lm.wfit( x = inputs$x, y = inputs$y, w = inputs$weights, offset = inputs$offset ) model$coefficients #---------------------------------------------------------------------------- # Apply the same columns and factor levels to validation data #---------------------------------------------------------------------------- validation <- data.frame( patient_id = sprintf("V%02d", 1:3), age = c(55, 67, 61), treatment = factor(c("active", "control", "active")), stage = factor(c("II", "I", "III")), exposure = c(1.1, 0.9, 1.5) ) validation_inputs <- design_newdata( fit, validation, outcome = "ignore", weights = "ignore" ) linear_predictor <- drop(validation_inputs$x %*% model$coefficients) linear_predictor <- linear_predictor + validation_inputs$offset linear_predictor#---------------------------------------------------------------------------- # Declare a design, fit its schema, and model the analysis data #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) analysis <- data.frame( patient_id = sprintf("P%02d", 1:12), outcome = rnorm(12), age = round(rnorm(12, mean = 60, sd = 8)), treatment = factor(rep(c("control", "active"), 6)), stage = factor(rep(c("I", "II", "III"), each = 4)), exposure = runif(12, min = 0.5, max = 2), inverse_prob_weight = runif(12, min = 0.8, max = 1.2), chart_score = rnorm(12) ) spec <- design_spec( outcome ~ ., weights = ~ inverse_prob_weight, offset = ~ log(exposure), case_id = ~ patient_id, exclude = ~ chart_score, encoding = "dummy", intercept = TRUE, na_action = "na.omit", novel_levels = "fail" ) fit <- design_fit(spec, analysis) inputs <- design_data(fit) model <- stats::lm.wfit( x = inputs$x, y = inputs$y, w = inputs$weights, offset = inputs$offset ) model$coefficients #---------------------------------------------------------------------------- # Apply the same columns and factor levels to validation data #---------------------------------------------------------------------------- validation <- data.frame( patient_id = sprintf("V%02d", 1:3), age = c(55, 67, 61), treatment = factor(c("active", "control", "active")), stage = factor(c("II", "I", "III")), exposure = c(1.1, 0.9, 1.5) ) validation_inputs <- design_newdata( fit, validation, outcome = "ignore", weights = "ignore" ) linear_predictor <- drop(validation_inputs$x %*% model$coefficients) linear_predictor <- linear_predictor + validation_inputs$offset linear_predictor
Get data for one-sample or dependent two-sample tests using a formula. Three forms are supported.
y ~ x | z
y ~ x
~x
get_dep_data(data, formula, allow_one_sample = TRUE, agg_fun = "error")get_dep_data(data, formula, allow_one_sample = TRUE, agg_fun = "error")
data |
(data.frame) |
formula |
(formula) |
allow_one_sample |
(scalar logical: |
agg_fun |
(Scalar character or function: |
For dependent two-sample tests with data in tall format, the formula with form y ~ x | z is defined by
y is a numeric vector for the outcome.
x is a factor with exactly two observed levels representing the binary group variable.
z is a factor with n levels representing the blocking variable (subject/item index).
For factor x, the first observed level will be the reference group.
For example, data$x <- factor(data$x, levels = c("pre", "post")) makes pre the reference group and post the focal group.
A downstream comparison then takes the form mean(post - pre).
A non-factor x or z is converted with as.factor() and a warning is emitted.
as.factor() then orders the levels alphabetically or numerically,
so the reference group for x becomes its smallest value.
Pass x as a factor to control which group is the reference.
Declared-but-unobserved levels of x and z are dropped before the two-level check, so they cannot displace observed levels or seed all-NA rows or columns in the wide output.
For dependent two-sample tests with data in wide format, the formula with form y ~ x is defined by
y is a numeric vector for the group of interest.
x is a numeric vector for the reference group.
Row i of y and row i of x correspond to the same subject/item.
For returned object out, downstream differences would take the form out$data[[out$focal_group]] - out$data[[out$ref_group]].
For one-sample (or paired) tests, the formula has form ~x where
x is a numeric vector for the outcome (or differences).
For ~x and y ~ x, missing values are retained and returned as-is.
For y ~ x | z, a row with a missing value of x or z is dropped before reshaping.
Missing values of x are ignored when checking that it has exactly two observed levels.
See tall_to_wide() for missing value handling during cell-aggregation.
A missing value of y is retained.
A cell with no observation in the wide output becomes NA_real_.
I()
Any term in the formula may be wrapped in I().
Operators such as + and * carry special meaning inside a formula.
Any other expression, such as log(y) or x > 0, must be wrapped in I() so it is evaluated as ordinary R code against data.
On the left hand side, I() transforms the outcome and must yield a numeric result.
For example, I(log(y)) ~ x | z uses the log-transformed outcome.
On the group side, I() derives the group term.
For example, y ~ I(x > 0) | z groups by whether x is positive.
On the block side, I() derives the block term.
For example, y ~ x | I(interaction(subject, session)) blocks by a subject-session combination.
A named list with components:
data
The response values split by group.
For ~x a length-1 named list holding the outcome vector.
For y ~ x a length-2 named list with the focal group first and the reference group second.
For y ~ x | z a data.frame with the focal group in column 1, the reference group in column 2, and the block identifier in column 3, with one row per level of the block factor in levels(z) order.
focal_group
A scalar character naming the focal group.
The name of the response variable for y ~ x, the single term for ~ x, or the second factor level for y ~ x | z.
ref_group
A scalar character naming the reference group, or NULL for a ~x formula.
The name of the grouping variable for y ~ x, or the first factor level for y ~ x | z.
block
A scalar character naming the block column for y ~ x | z, or NULL otherwise.
parse_formula
The list returned by parse_formula() applied to formula.
parse_formula(),
get_ind_data()
#---------------------------------------------------------------------------- # get_dep_data() example #---------------------------------------------------------------------------- library(bkmodel) # A pre/post intervention study where each subject is measured twice # pre is the first factor level, so it is the reference group tall <- data.frame( score = c(5.1, 6.3, 5.8, 7.2, 6.4, 7.1, 6.9, 8.0), time = factor( rep(c("pre", "post"), times = 4), levels = c("pre", "post") ), subject = factor(rep(1:4, each = 2)) ) # Reshape tall data to one row per subject with focal and reference columns res <- get_dep_data(data = tall, formula = score ~ time | subject) res # Extract the paired vectors with the returned group names focal <- res$data[[res$focal_group]] reference <- res$data[[res$ref_group]] # Estimate the intervention effect as the mean within-subject difference mean(focal - reference) # The same pairing feeds a paired t-test t.test(focal, reference, paired = TRUE) # The same study already pivoted to one row per subject wide <- data.frame( post = c(6.3, 7.2, 7.1, 8.0), pre = c(5.1, 5.8, 6.4, 6.9) ) get_dep_data(data = wide, formula = post ~ pre) # Pre-computed within-subject differences for a one-sample test diffs <- data.frame( change = c(1.2, 1.4, 0.7, 1.1) ) get_dep_data(data = diffs, formula = ~change)#---------------------------------------------------------------------------- # get_dep_data() example #---------------------------------------------------------------------------- library(bkmodel) # A pre/post intervention study where each subject is measured twice # pre is the first factor level, so it is the reference group tall <- data.frame( score = c(5.1, 6.3, 5.8, 7.2, 6.4, 7.1, 6.9, 8.0), time = factor( rep(c("pre", "post"), times = 4), levels = c("pre", "post") ), subject = factor(rep(1:4, each = 2)) ) # Reshape tall data to one row per subject with focal and reference columns res <- get_dep_data(data = tall, formula = score ~ time | subject) res # Extract the paired vectors with the returned group names focal <- res$data[[res$focal_group]] reference <- res$data[[res$ref_group]] # Estimate the intervention effect as the mean within-subject difference mean(focal - reference) # The same pairing feeds a paired t-test t.test(focal, reference, paired = TRUE) # The same study already pivoted to one row per subject wide <- data.frame( post = c(6.3, 7.2, 7.1, 8.0), pre = c(5.1, 5.8, 6.4, 6.9) ) get_dep_data(data = wide, formula = post ~ pre) # Pre-computed within-subject differences for a one-sample test diffs <- data.frame( change = c(1.2, 1.4, 0.7, 1.1) ) get_dep_data(data = diffs, formula = ~change)
Get data for independent two-sample tests when defined by a formula of form y ~ x.
get_ind_data(data, formula)get_ind_data(data, formula)
data |
(data.frame) |
formula |
(formula) |
The formula with form y ~ x is defined by
y is a numeric vector for the outcome.
x is a factor with exactly two observed levels representing the binary group variable.
For factor x, the first observed level will be the reference group.
For example, data$x <- factor(data$x, levels = c("pre", "post")) makes pre the reference group and post the focal group.
A downstream comparison then takes the form mean(post) - mean(pre).
A non-factor x is converted with as.factor() and a warning is emitted.
The reference group is then the smallest value of x.
as.factor() orders the levels alphabetically or numerically.
Pass x as a factor to control which group is the reference.
A row with a missing value of x is dropped from both groups.
Missing values of x are ignored when checking that it has exactly two observed levels.
A missing value of y is kept in its group.
Either side of the formula may wrap a term in I().
Operators such as + and * carry special meaning inside a formula.
Any other expression, such as log(y) or x > 0, must be wrapped in I() so it is evaluated as ordinary R code against data.
On the left hand side, I() transforms the outcome and must yield a numeric result.
For example, I(log(y)) ~ x uses the log-transformed outcome.
On the right hand side, I() derives the grouping variable.
For example, y ~ I(x > 0) groups the outcome by whether x is positive.
A named list with components:
data
A length-2 named list of numeric response vectors, the focal group first and the reference group second, each named by its factor level.
focal_group
A scalar character giving the focal group name, the second level with observations in the grouping factor.
ref_group
A scalar character giving the reference group name, the first level with observations in the grouping factor.
parse_formula
The list returned by parse_formula() applied to formula.
parse_formula(),
get_dep_data()
#---------------------------------------------------------------------------- # get_ind_data() example #---------------------------------------------------------------------------- library(bkmodel) # A two-arm study comparing an outcome under control and treatment # control is the first factor level, so it is the reference group data <- data.frame( score = c(5.1, 6.3, 5.8, 7.2, 8.1, 7.6, 8.4, 7.9), group = factor( rep(c("control", "treatment"), each = 4), levels = c("control", "treatment") ) ) # Split the outcome into the focal and reference groups res <- get_ind_data(data = data, formula = score ~ group) res # Extract each group with the returned group names focal <- res$data[[res$focal_group]] reference <- res$data[[res$ref_group]] # Estimate the treatment effect as the focal minus reference mean mean(focal) - mean(reference) # The same split feeds an independent two-sample t-test t.test(focal, reference) # Wrap the response in I() to compare on the log scale get_ind_data(data = data, formula = I(log(score)) ~ group)#---------------------------------------------------------------------------- # get_ind_data() example #---------------------------------------------------------------------------- library(bkmodel) # A two-arm study comparing an outcome under control and treatment # control is the first factor level, so it is the reference group data <- data.frame( score = c(5.1, 6.3, 5.8, 7.2, 8.1, 7.6, 8.4, 7.9), group = factor( rep(c("control", "treatment"), each = 4), levels = c("control", "treatment") ) ) # Split the outcome into the focal and reference groups res <- get_ind_data(data = data, formula = score ~ group) res # Extract each group with the returned group names focal <- res$data[[res$focal_group]] reference <- res$data[[res$ref_group]] # Estimate the treatment effect as the focal minus reference mean mean(focal) - mean(reference) # The same split feeds an independent two-sample t-test t.test(focal, reference) # Wrap the response in I() to compare on the log scale get_ind_data(data = data, formula = I(log(score)) ~ group)
A unified interface to four optimization methods.
optim2(fun, parameters, method = "optim", lower = -Inf, upper = Inf, ...)optim2(fun, parameters, method = "optim", lower = -Inf, upper = Inf, ...)
fun |
(function) |
parameters |
(numeric) |
method |
(string:
|
lower |
(numeric: |
upper |
(numeric: |
... |
Additional arguments passed to the corresponding |
A named list with five elements:
estimate: numeric vector of optimized parameter values, with length equal to length(parameters).
minimum: scalar numeric giving the value of fun at estimate.
iterations: scalar integer.
For "nlm" and "nlminb", the number of optimizer iterations.
For "optim" and "optim_constrained", the number of function evaluations (counts["function"] from stats::optim()); optim() does not report iteration counts, so this quantity is not directly comparable to the other methods.
code: integer convergence code.
For "nlm", codes 1–2 indicate probable convergence and codes 3–5 indicate potential failure; see stats::nlm() for definitions.
For all other methods, 0 indicates success and a positive integer indicates failure; see stats::nlminb() and stats::optim() for method-specific codes.
message: character string describing the convergence outcome.
stats::optim(),
stats::nlm(),
stats::nlminb(),
stats4::mle()
#---------------------------------------------------------------------------- # optim2() example #---------------------------------------------------------------------------- library(bkmodel) # Linear regression # residuals should be normal with mean 0 and unknown standard deviation sigma x1 <- 1:100 set.seed(1) epsilon <- rnorm(100) y <- 4 + x1*2 + epsilon nll <- function(par, y, x1){ alpha <- par[1] beta1 <- par[2] sigma <- par[3] resid <- y - alpha - beta1 * x1 -sum(dnorm(resid, mean = 0, sigma, log = TRUE)) } res_lm <- lm(y ~ x1) res_lm res_optim <- optim( fn = nll, par = c(alpha = mean(y), beta1 = 0, sigma = 1), y = y, x1 = x1 ) |> suppressWarnings() res_optim res_optim2 <- optim2( fun = nll, parameters = c(alpha = mean(y), beta1 = 0, sigma = 1), method = "optim", y = y, x1 = x1 ) |> suppressWarnings() res_optim2 res_nlm <- optim2( fun = nll, parameters = c(alpha = mean(y), beta1 = 0, sigma = 1), method = "nlm", y = y, x1 = x1 ) |> suppressWarnings() res_nlm res_nlminb <- optim2( fun = nll, parameters = c(alpha = mean(y), beta1 = 0, sigma = 1), method = "nlminb", lower = c(-Inf, -Inf, 1e-6), y = y, x1 = x1 ) res_nlminb res_optim_constrained <- optim2( fun = nll, parameters = c(alpha = mean(y), beta1 = 0, sigma = 1), method = "optim_constrained", lower = c(-Inf, -Inf, 1e-6), y = y, x1 = x1 ) res_optim_constrained#---------------------------------------------------------------------------- # optim2() example #---------------------------------------------------------------------------- library(bkmodel) # Linear regression # residuals should be normal with mean 0 and unknown standard deviation sigma x1 <- 1:100 set.seed(1) epsilon <- rnorm(100) y <- 4 + x1*2 + epsilon nll <- function(par, y, x1){ alpha <- par[1] beta1 <- par[2] sigma <- par[3] resid <- y - alpha - beta1 * x1 -sum(dnorm(resid, mean = 0, sigma, log = TRUE)) } res_lm <- lm(y ~ x1) res_lm res_optim <- optim( fn = nll, par = c(alpha = mean(y), beta1 = 0, sigma = 1), y = y, x1 = x1 ) |> suppressWarnings() res_optim res_optim2 <- optim2( fun = nll, parameters = c(alpha = mean(y), beta1 = 0, sigma = 1), method = "optim", y = y, x1 = x1 ) |> suppressWarnings() res_optim2 res_nlm <- optim2( fun = nll, parameters = c(alpha = mean(y), beta1 = 0, sigma = 1), method = "nlm", y = y, x1 = x1 ) |> suppressWarnings() res_nlm res_nlminb <- optim2( fun = nll, parameters = c(alpha = mean(y), beta1 = 0, sigma = 1), method = "nlminb", lower = c(-Inf, -Inf, 1e-6), y = y, x1 = x1 ) res_nlminb res_optim_constrained <- optim2( fun = nll, parameters = c(alpha = mean(y), beta1 = 0, sigma = 1), method = "optim_constrained", lower = c(-Inf, -Inf, 1e-6), y = y, x1 = x1 ) res_optim_constrained
ordinal_matrix_to_factor() converts a weighted ordinal-count response matrix into a factor response and row weights.
It is the inverse of the ordinal-count response representation returned by design_data() and design_newdata() when response_type = "ordinal_counts".
Use this helper when a model, metric, or diagnostic needs a factor response plus case weights rather than a count matrix.
ordinal_matrix_to_factor(y)ordinal_matrix_to_factor(y)
y |
(numeric matrix) |
The matrix column names define the factor levels. Column order is preserved as the factor's level order. Column names must be non-missing, non-empty, and unique.
Values must be non-missing, finite, and non-negative.
Each row must have either zero positive cells or exactly one positive cell.
Rows with one positive cell return that column as the factor level and the row's positive cell value as its weight.
Rows with all zeros return NA and weight 0.
Rows with more than one positive cell error.
The function does not require integer counts. Fractional values are allowed because design-generated ordinal-count matrices use analysis weights.
A list with elements:
y: a factor with levels = colnames(y).
weights: a numeric vector equal to rowSums(y).
The returned factor and weights are aligned row-for-row with the input matrix.
design_data(),
design_newdata()
#---------------------------------------------------------------------------- # Convert a weighted ordinal-count matrix #---------------------------------------------------------------------------- library(bkmodel) counts <- matrix( c(0, 1.2, 0, 0.8, 0, 0, 0, 0, 1.5, 0, 0, 0), ncol = 3L, byrow = TRUE, dimnames = list(NULL, c("low", "medium", "high")) ) converted <- ordinal_matrix_to_factor(counts) converted$y converted$weights #---------------------------------------------------------------------------- # Round-trip from design_data(response_type = "ordinal_counts") #---------------------------------------------------------------------------- set.seed(1L) analysis <- data.frame( response = ordered( c("low", "medium", "high", "low", "medium", "high"), levels = c("low", "medium", "high") ), marker = c(0.2, 0.4, 1.1, 0.3, 0.8, 1.4), sampling_weight = c(1.0, 0.5, 1.5, 1.2, 0.8, 1.1) ) spec <- design_spec( response ~ marker, weights = ~ sampling_weight ) fit <- design_fit(spec, analysis) ordinal_inputs <- design_data( fit, response_type = "ordinal_counts" ) factor_inputs <- ordinal_matrix_to_factor(ordinal_inputs$y) factor_inputs$y factor_inputs$weights#---------------------------------------------------------------------------- # Convert a weighted ordinal-count matrix #---------------------------------------------------------------------------- library(bkmodel) counts <- matrix( c(0, 1.2, 0, 0.8, 0, 0, 0, 0, 1.5, 0, 0, 0), ncol = 3L, byrow = TRUE, dimnames = list(NULL, c("low", "medium", "high")) ) converted <- ordinal_matrix_to_factor(counts) converted$y converted$weights #---------------------------------------------------------------------------- # Round-trip from design_data(response_type = "ordinal_counts") #---------------------------------------------------------------------------- set.seed(1L) analysis <- data.frame( response = ordered( c("low", "medium", "high", "low", "medium", "high"), levels = c("low", "medium", "high") ), marker = c(0.2, 0.4, 1.1, 0.3, 0.8, 1.4), sampling_weight = c(1.0, 0.5, 1.5, 1.2, 0.8, 1.1) ) spec <- design_spec( response ~ marker, weights = ~ sampling_weight ) fit <- design_fit(spec, analysis) ordinal_inputs <- design_data( fit, response_type = "ordinal_counts" ) factor_inputs <- ordinal_matrix_to_factor(ordinal_inputs$y) factor_inputs$y factor_inputs$weights
Parse a formula into its terms lhs ~ rhs where rhs can have two components separated by |.
For example, y ~ x | z has rhs = ~ x | z where x is typically described as the grouping term and z is the blocking term.
A fast, unsafe alternative to modeltools::ParseFormula().
parse_formula(formula, specials = NULL, data = NULL, terms = FALSE)parse_formula(formula, specials = NULL, data = NULL, terms = FALSE)
formula |
(formula) |
specials |
(character: |
data |
(data frame) |
terms |
(scalar logical: |
A named list with components:
formula: the (possibly terms.formula-expanded) input formula.
lhs: a one-sided formula containing the response, or NULL if formula is one-sided.
rhs: a one-sided formula containing the full right-hand side.
rhs_group: a one-sided formula containing the term left of |, or rhs if no top-level | is present.
rhs_block: a one-sided formula containing the term right of |, or NULL if no top-level | is present.
get_ind_data(),
get_dep_data(),
tall_to_wide()
#---------------------------------------------------------------------------- # parse_formula() example #---------------------------------------------------------------------------- library(bkmodel) # One-sided formula parse_formula(~x) # Two-sided formula parse_formula(y ~ x) # Two-sided formula with a top-level `|` splitting the rhs into group and block parse_formula(y ~ x | z) # Dot expansion against the names in `data` parse_formula(y ~ ., data = data.frame(y = 1, a = 1, b = 1)) # Functions named in `specials` are preserved through term expansion parse_formula(y ~ s(x) + z, specials = "s")#---------------------------------------------------------------------------- # parse_formula() example #---------------------------------------------------------------------------- library(bkmodel) # One-sided formula parse_formula(~x) # Two-sided formula parse_formula(y ~ x) # Two-sided formula with a top-level `|` splitting the rhs into group and block parse_formula(y ~ x | z) # Dot expansion against the names in `data` parse_formula(y ~ ., data = data.frame(y = 1, a = 1, b = 1)) # Functions named in `specials` are preserved through term expansion parse_formula(y ~ s(x) + z, specials = "s")
Create an index-only resampling plan for bootstrap resampling with replacement.
Each split stores integer row positions from data for model fitting and assessment.
The function supports stratified bootstrap samples and grouped bootstrap samples.
resample_bootstrap(data, formula, n_boot = 1000L, strata = NULL, group = NULL)resample_bootstrap(data, formula, n_boot = 1000L, strata = NULL, group = NULL)
data |
(data.frame) |
formula |
(formula) |
n_boot |
(numeric scalar: |
strata |
(formula: |
group |
(formula: |
resample_bootstrap() does not copy data.
It returns row indices that can be used to subset data later.
The predictor side of formula is not used to build the splits.
It is accepted so the same formula can be passed to a model-fitting function.
Each element of splits contains:
analysis:
integer row indices drawn with replacement and used to fit a model.
Duplicate row indices are expected.
When group is NULL, length equals nrow(data).
When group is supplied, length equals the total row count of n_groups drawn groups, which can vary across bootstrap splits if group sizes differ.
assessment:
integer row indices for the full original data set, seq_len(nrow(data)).
This supports bootstrap optimism correction.
oob:
integer out-of-bag row indices.
These are computed as sort(setdiff(seq_len(nrow(data)), unique(analysis))).
seed:
single positive integer for reproducible model fits on this split.
Rows are the sampling units when group is NULL.
Groups are the sampling units when group is supplied.
Each bootstrap split advances the current random-number generator state.
Stratification is optional.
Pass NULL for no stratification.
strata = ~ 1 stratifies by the model outcome.
A formula such as strata = ~ a + b stratifies by interaction(data$a, data$b, drop = TRUE).
Variables referenced by strata must be columns of data.
NA stratum values are not allowed.
Empty factor levels are dropped.
Stratification variables are treated as categorical.
Pre-bin continuous variables before passing them to strata.
This also applies when strata = ~ 1 and the model outcome is continuous.
Expressions in strata are not evaluated as transformations.
Create transformed or composite strata columns in data before calling resample_bootstrap().
For row-level bootstrap resampling, stratification samples rows with replacement within each stratum.
It preserves the original row count of each stratum in each bootstrap split.
The analysis vector concatenates per-stratum draws in factor-level order.
Each bootstrap stratum must contain at least two rows.
Grouping is optional.
Pass NULL for ordinary row-level resampling.
Pass a one-sided formula such as group = ~ subject to draw whole groups with replacement.
The right-hand side of group must be a single bare column name in data.
NA group values are not allowed.
Empty factor levels are dropped.
When group is supplied, groups are drawn with replacement.
For each bootstrap split, n_groups groups are drawn and their rows are concatenated in draw order.
assessment remains seq_len(nrow(data)).
oob is the set of rows whose group was never drawn into analysis.
Stratified grouped bootstrap draws groups within each stratum to preserve group counts per stratum.
Strata are processed in resolved factor-level order.
Rows are concatenated in draw order within each stratum.
This preserves group counts per stratum rather than row counts per stratum.
When group sizes differ, total analysis length and per-stratum row counts can vary across bootstrap splits.
Each grouped bootstrap stratum must contain at least two groups.
When strata and group are both supplied, the resolved stratum vector must be constant within each group.
If strata = ~ 1 is used with group, the outcome must be constant within each group.
Split names are generated from n_boot.
For fewer than 10 splits, names are split1, ..., split9.
For 10 or more splits, names are zero-padded so lexical order matches numeric order.
Call base::set.seed() immediately before resample_bootstrap() for reproducible splits.
Split indices reproduce the resampled rows, but a downstream model fit that uses randomness depends on the global random-number state when it runs, so it does not reproduce on its own.
Each split therefore carries a seed, a single positive integer generated at construction.
Set the split seed immediately before fitting that split to make the fit reproducible regardless of run order.
# Single fit per split.
set.seed(split$seed)
fit <- fit_model(split$analysis)
# Several candidate models on one analysis set, each reproducible alone.
set.seed(split$seed)
candidate_seeds <- sample.int(.Machine$integer.max, K, replace = FALSE)
for (k in seq_len(K)) {
set.seed(candidate_seeds[k])
fit_k <- fit_candidate_k(split$analysis)
}
Seeds are created from the global generator and the generator state is then restored, so split indices and the caller's random-number stream are unchanged.
Reproducing a fit assumes the same base::RNGkind() on each run.
An S3 list of class c("bootstrap_resamples", "resamples") with elements:
call:
the matched call.
method:
"bootstrap".
n:
nrow(data).
formula:
the input formula.
outcome_var:
the resolved outcome column name.
strata:
the input strata formula or NULL.
strata_vars:
the resolved strata column names or NULL.
group:
the input group formula or NULL.
group_vars:
the resolved group column name or NULL.
n_groups:
the number of unique groups when group is supplied, otherwise NULL.
n_boot:
the resolved bootstrap count.
splits:
a named list of split entries.
Each split contains analysis, assessment, oob, and seed.
For ungrouped bootstrap resampling, length(analysis) equals n.
For grouped bootstrap resampling, length(analysis) can differ from n when group sizes differ.
resample_cv(),
resample_nested_cv(),
base::set.seed()
#---------------------------------------------------------------------------- # resample_bootstrap() examples #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) data <- data.frame( outcome = c(rep("event", 12L), rep("none", 12L)), age = c(34L, 52L, 41L, 63L, 29L, 48L, 37L, 55L, 46L, 31L, 60L, 43L, 39L, 50L, 28L, 44L, 57L, 36L, 62L, 33L, 47L, 53L, 40L, 58L), clinic = rep(c("north", "south", "west"), each = 8L), patient = rep(seq_len(12L), each = 2L) ) bs <- resample_bootstrap(data, outcome ~ age, n_boot = 10L) bs$splits[[1L]]$analysis bs$splits[[1L]]$oob # Preserve the outcome class counts in each bootstrap sample. resample_bootstrap(data, outcome ~ age, n_boot = 10L, strata = ~ outcome) # Draw whole patients with replacement. resample_bootstrap(data, outcome ~ age, n_boot = 10L, group = ~ patient)#---------------------------------------------------------------------------- # resample_bootstrap() examples #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) data <- data.frame( outcome = c(rep("event", 12L), rep("none", 12L)), age = c(34L, 52L, 41L, 63L, 29L, 48L, 37L, 55L, 46L, 31L, 60L, 43L, 39L, 50L, 28L, 44L, 57L, 36L, 62L, 33L, 47L, 53L, 40L, 58L), clinic = rep(c("north", "south", "west"), each = 8L), patient = rep(seq_len(12L), each = 2L) ) bs <- resample_bootstrap(data, outcome ~ age, n_boot = 10L) bs$splits[[1L]]$analysis bs$splits[[1L]]$oob # Preserve the outcome class counts in each bootstrap sample. resample_bootstrap(data, outcome ~ age, n_boot = 10L, strata = ~ outcome) # Draw whole patients with replacement. resample_bootstrap(data, outcome ~ age, n_boot = 10L, group = ~ patient)
Create an index-only resampling plan for K-fold cross-validation.
Each split stores integer row positions from data for model fitting and model assessment.
The function supports repeated K-fold cross-validation, stratified folds, and grouped folds.
resample_cv( data, formula, n_folds = 5L, n_iters = 1L, strata = NULL, group = NULL )resample_cv( data, formula, n_folds = 5L, n_iters = 1L, strata = NULL, group = NULL )
data |
(data.frame) |
formula |
(formula) |
n_folds |
(numeric scalar: |
n_iters |
(numeric scalar: |
strata |
(formula: |
group |
(formula: |
resample_cv() does not copy data.
It returns row indices that can be used to subset data later.
The predictor side of formula is not used to build the splits.
It is accepted so the same formula can be passed to a model-fitting function.
Each element of splits contains:
analysis:
integer row indices used to fit a model.
These are all rows not in assessment.
They are sorted in ascending order.
assessment:
integer row indices used to assess the fitted model.
They are sorted in ascending order.
iter:
integer iteration number.
fold:
integer fold number within iter.
seed:
single positive integer for reproducible model fits on this split.
Within each iteration, the assessment indices partition seq_len(nrow(data)).
A row appears in exactly one assessment set per iteration.
Across different iterations, the same row can appear in different folds.
Folds are assigned over sampling units.
Rows are the sampling units when group is NULL.
Groups are the sampling units when group is supplied.
Without stratification, fold sizes are balanced over the sampling units.
The target fold sizes are floor(unit_n / n_folds), with unit_n %% n_folds folds one unit larger.
The larger folds are selected at random.
When strata is supplied, strata are processed from smallest to largest, so rare strata are allocated first.
Within each stratum, sampling units are shuffled and distributed across folds as evenly as possible.
Per-stratum fold counts differ by at most one.
Ties are broken with the current random-number generator state.
Stratification is optional.
Pass NULL for no stratification.
strata = ~ 1 stratifies by the model outcome.
A formula such as strata = ~ a + b stratifies by interaction(data$a, data$b, drop = TRUE).
Variables referenced by strata must be columns of data.
NA stratum values are not allowed.
Empty factor levels are dropped.
Stratification variables are treated as categorical.
Pre-bin continuous variables before passing them to strata.
This also applies when strata = ~ 1 and the model outcome is continuous.
Expressions in strata are not evaluated as transformations.
Create transformed or composite strata columns in data before calling resample_cv().
Grouping is optional.
Pass NULL for ordinary row-level resampling.
Pass a one-sided formula such as group = ~ subject to keep all rows from each subject together.
The right-hand side of group must be a single bare column name in data.
NA group values are not allowed.
Empty factor levels are dropped.
Rows that share a group value are kept together.
All rows of any one group land entirely in one fold's assessment per iteration.
Because groups are the sampling units, n_folds must be less than or equal to the number of unique groups.
Fold balance is defined over groups, not rows.
When group sizes differ, row counts across folds may differ by more than one.
When strata and group are both supplied, the resolved stratum vector must be constant within each group.
If strata = ~ 1 is used with group, the outcome must be constant within each group.
When n_iters > 1, the function creates n_iters independent K-fold partitions and concatenates the splits.
The total number of splits is n_folds * n_iters.
Each iteration advances the current random-number generator state.
Split names are generated from the total number of splits.
For fewer than 10 splits, names are split1, ..., split9.
For 10 or more splits, names are zero-padded so lexical order matches numeric order.
Call base::set.seed() immediately before resample_cv() for reproducible splits.
Split indices reproduce the fold rows, but a downstream model fit that uses randomness depends on the global random-number state when it runs, so it does not reproduce on its own.
Each split therefore carries a seed, a single positive integer generated at construction.
Set the split seed immediately before fitting that split to make the fit reproducible regardless of run order.
# Single fit per split.
set.seed(split$seed)
fit <- fit_model(split$analysis)
# Several candidate models on one analysis set, each reproducible alone.
set.seed(split$seed)
candidate_seeds <- sample.int(.Machine$integer.max, K, replace = FALSE)
for (k in seq_len(K)) {
set.seed(candidate_seeds[k])
fit_k <- fit_candidate_k(split$analysis)
}
Seeds are created from the global generator and the generator state is then restored, so split indices and the caller's random-number stream are unchanged.
Reproducing a fit assumes the same base::RNGkind() on each run.
An S3 list of class c("cv_resamples", "resamples") with elements:
call:
the matched call.
method:
"cv".
n:
nrow(data).
formula:
the input formula.
outcome_var:
the resolved outcome column name.
strata:
the input strata formula or NULL.
strata_vars:
the resolved strata column names or NULL.
group:
the input group formula or NULL.
group_vars:
the resolved group column name or NULL.
n_groups:
the number of unique groups when group is supplied, otherwise NULL.
n_folds:
the resolved fold count.
n_iters:
the resolved iteration count.
splits:
a named list of split entries.
Each split contains analysis, assessment, iter, fold, and seed.
resample_nested_cv(),
resample_bootstrap(),
base::set.seed()
#---------------------------------------------------------------------------- # resample_cv() examples #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) data <- data.frame( outcome = c(rep("event", 12L), rep("none", 12L)), age = c(34L, 52L, 41L, 63L, 29L, 48L, 37L, 55L, 46L, 31L, 60L, 43L, 39L, 50L, 28L, 44L, 57L, 36L, 62L, 33L, 47L, 53L, 40L, 58L), clinic = rep(c("north", "south", "west"), each = 8L), patient = rep(seq_len(12L), each = 2L) ) cv <- resample_cv(data, outcome ~ age, n_folds = 4L) cv$splits[[1L]]$analysis cv$splits[[1L]]$assessment # Keep outcome classes balanced across folds. resample_cv(data, outcome ~ age, n_folds = 4L, strata = ~ outcome) # Repeat the full K-fold partition three times. resample_cv(data, outcome ~ age, n_folds = 4L, n_iters = 3L) # Keep repeated observations from the same patient together. resample_cv(data, outcome ~ age, n_folds = 4L, group = ~ patient)#---------------------------------------------------------------------------- # resample_cv() examples #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) data <- data.frame( outcome = c(rep("event", 12L), rep("none", 12L)), age = c(34L, 52L, 41L, 63L, 29L, 48L, 37L, 55L, 46L, 31L, 60L, 43L, 39L, 50L, 28L, 44L, 57L, 36L, 62L, 33L, 47L, 53L, 40L, 58L), clinic = rep(c("north", "south", "west"), each = 8L), patient = rep(seq_len(12L), each = 2L) ) cv <- resample_cv(data, outcome ~ age, n_folds = 4L) cv$splits[[1L]]$analysis cv$splits[[1L]]$assessment # Keep outcome classes balanced across folds. resample_cv(data, outcome ~ age, n_folds = 4L, strata = ~ outcome) # Repeat the full K-fold partition three times. resample_cv(data, outcome ~ age, n_folds = 4L, n_iters = 3L) # Keep repeated observations from the same patient together. resample_cv(data, outcome ~ age, n_folds = 4L, group = ~ patient)
Create an index-only resampling plan for nested K-fold cross-validation. Each outer split stores integer row positions for final model assessment. Each outer split also contains inner cross-validation splits for model selection within the outer analysis set. The function supports repeated outer cross-validation, stratified folds, and grouped folds.
resample_nested_cv( data, formula, n_outer_folds = 5L, n_inner_folds = 5L, n_iters = 1L, strata = NULL, group = NULL )resample_nested_cv( data, formula, n_outer_folds = 5L, n_inner_folds = 5L, n_iters = 1L, strata = NULL, group = NULL )
data |
(data.frame) |
formula |
(formula) |
n_outer_folds |
(numeric scalar: |
n_inner_folds |
(numeric scalar: |
n_iters |
(numeric scalar: |
strata |
(formula: |
group |
(formula: |
resample_nested_cv() does not copy data.
It returns row indices that can be used to subset data later.
The predictor side of formula is not used to build the splits.
It is accepted so the same formula can be passed to a model-fitting function.
Each outer split separates the original data rows into:
analysis:
integer row indices used for model selection and final model fitting within that outer split.
These are all rows not in the outer assessment.
They are sorted in ascending order.
assessment:
integer row indices held out for final assessment of that outer split.
They are sorted in ascending order.
iter:
integer iteration number.
outer_fold:
integer outer-fold number within iter.
seed:
single positive integer for reproducible model fits on this split.
inner:
a named list of inner cross-validation splits.
Inner splits are created only from the outer analysis rows.
Each inner split contains:
analysis:
integer row indices used to fit a candidate model.
They are restricted to the parent outer analysis set.
They are sorted in ascending order.
assessment:
integer row indices used to compare candidate models.
They are restricted to the parent outer analysis set.
They are sorted in ascending order.
iter:
integer iteration number inherited from the parent outer split.
outer_fold:
integer outer-fold number inherited from the parent outer split.
inner_fold:
integer inner-fold number.
seed:
single positive integer for reproducible candidate model fits on this inner analysis set.
Within each iteration, the outer assessment indices partition seq_len(nrow(data)).
Within each outer split, the inner assessment indices partition the parent outer analysis indices.
Outer and inner folds use the same fold-assignment rules as resample_cv().
Rows are the sampling units when group is NULL.
Groups are the sampling units when group is supplied.
Without stratification, fold sizes are balanced over the sampling units.
When strata is supplied, strata are processed from smallest to largest, so rare strata are allocated first.
Within each stratum, sampling units are shuffled and distributed across folds as evenly as possible.
Per-stratum fold counts differ by at most one.
Ties are broken with the current random-number generator state.
Stratification is optional.
Pass NULL for no stratification.
strata = ~ 1 stratifies by the model outcome.
A formula such as strata = ~ a + b stratifies by interaction(data$a, data$b, drop = TRUE).
Variables referenced by strata must be columns of data.
NA stratum values are not allowed.
Empty factor levels are dropped.
Stratification variables are treated as categorical.
Pre-bin continuous variables before passing them to strata.
This also applies when strata = ~ 1 and the model outcome is continuous.
Expressions in strata are not evaluated as transformations.
Create transformed or composite strata columns in data before calling resample_nested_cv().
Grouping is optional.
Pass NULL for ordinary row-level resampling.
Pass a one-sided formula such as group = ~ subject to keep all rows from each subject together.
The right-hand side of group must be a single bare column name in data.
NA group values are not allowed.
Empty factor levels are dropped.
Rows that share a group value are kept together in both outer and inner splits.
No group is split between analysis and assessment.
Because groups are the sampling units, n_outer_folds must be less than or equal to the number of unique groups.
n_inner_folds must be no larger than the smallest possible outer-analysis sampling-unit count.
This upper bound is unit_n - ceiling(unit_n / n_outer_folds).
Fold balance is defined over groups, not rows.
When group sizes differ, row counts across folds may differ by more than one.
When strata and group are both supplied, the resolved stratum vector must be constant within each group.
If strata = ~ 1 is used with group, the outcome must be constant within each group.
When n_iters > 1, the function repeats the outer K-fold partitioning.
Inner partitioning is generated separately for each outer split.
The total number of outer splits is n_outer_folds * n_iters.
Each iteration advances the current random-number generator state.
Outer split names are generated from the total number of outer splits.
For fewer than 10 outer splits, names are split1, ..., split9.
For 10 or more outer splits, names are zero-padded so lexical order matches numeric order.
Inner split names are generated separately within each outer split from n_inner_folds.
Call base::set.seed() immediately before resample_nested_cv() for reproducible splits.
Split indices reproduce the fold rows, but a downstream model fit that uses randomness depends on the global random-number state when it runs, so it does not reproduce on its own.
Each outer split carries a seed for its final refit, and each inner split carries a seed for its candidate sweep.
Set the relevant seed immediately before fitting to make the fit reproducible regardless of run order.
# Inner candidate sweep on inner analysis set j of outer split i.
inner <- ncv$splits[[i]]$inner[[j]]
set.seed(inner$seed)
candidate_seeds <- sample.int(.Machine$integer.max, K, replace = FALSE)
for (k in seq_len(K)) {
set.seed(candidate_seeds[k])
fit_k <- fit_candidate_k(inner$analysis)
}
# Final refit on outer analysis set i.
set.seed(ncv$splits[[i]]$seed)
final <- fit_model(ncv$splits[[i]]$analysis)
Seeds are created from the global generator and the generator state is then restored, so split indices and the caller's random-number stream are unchanged.
Reproducing a fit assumes the same base::RNGkind() on each run.
An S3 list of class c("nested_cv_resamples", "resamples") with elements:
call:
the matched call.
method:
"nested_cv".
n:
nrow(data).
formula:
the input formula.
outcome_var:
the resolved outcome column name.
strata:
the input strata formula or NULL.
strata_vars:
the resolved strata column names or NULL.
group:
the input group formula or NULL.
group_vars:
the resolved group column name or NULL.
n_groups:
the number of unique groups when group is supplied, otherwise NULL.
n_outer_folds:
the resolved outer-fold count.
n_inner_folds:
the resolved inner-fold count.
n_iters:
the resolved iteration count.
splits:
a named list of outer split entries.
Each outer split contains analysis, assessment, iter, outer_fold, seed, and inner.
Each inner split contains analysis, assessment, iter, outer_fold, inner_fold, and seed.
resample_cv(),
resample_bootstrap(),
base::set.seed()
#---------------------------------------------------------------------------- # resample_nested_cv() examples #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) data <- data.frame( outcome = c(rep("event", 12L), rep("none", 12L)), age = c(34L, 52L, 41L, 63L, 29L, 48L, 37L, 55L, 46L, 31L, 60L, 43L, 39L, 50L, 28L, 44L, 57L, 36L, 62L, 33L, 47L, 53L, 40L, 58L), clinic = rep(c("north", "south", "west"), each = 8L), patient = rep(seq_len(12L), each = 2L) ) ncv <- resample_nested_cv( data, outcome ~ age, n_outer_folds = 4L, n_inner_folds = 3L ) ncv$splits[[1L]]$assessment ncv$splits[[1L]]$inner[[1L]]$assessment # Keep outcome classes balanced in outer and inner folds. resample_nested_cv( data, outcome ~ age, n_outer_folds = 4L, n_inner_folds = 3L, strata = ~ outcome ) # Keep repeated observations from the same patient together. resample_nested_cv( data, outcome ~ age, n_outer_folds = 4L, n_inner_folds = 3L, group = ~ patient )#---------------------------------------------------------------------------- # resample_nested_cv() examples #---------------------------------------------------------------------------- library(bkmodel) set.seed(1L) data <- data.frame( outcome = c(rep("event", 12L), rep("none", 12L)), age = c(34L, 52L, 41L, 63L, 29L, 48L, 37L, 55L, 46L, 31L, 60L, 43L, 39L, 50L, 28L, 44L, 57L, 36L, 62L, 33L, 47L, 53L, 40L, 58L), clinic = rep(c("north", "south", "west"), each = 8L), patient = rep(seq_len(12L), each = 2L) ) ncv <- resample_nested_cv( data, outcome ~ age, n_outer_folds = 4L, n_inner_folds = 3L ) ncv$splits[[1L]]$assessment ncv$splits[[1L]]$inner[[1L]]$assessment # Keep outcome classes balanced in outer and inner folds. resample_nested_cv( data, outcome ~ age, n_outer_folds = 4L, n_inner_folds = 3L, strata = ~ outcome ) # Keep repeated observations from the same patient together. resample_nested_cv( data, outcome ~ age, n_outer_folds = 4L, n_inner_folds = 3L, group = ~ patient )
Reshape a tall data frame to a wide layout following y ~ x | z.
Optionally aggregates duplicate (z, x) cells.
tall_to_wide(data, formula, agg_fun = "error", drop_unused_levels = FALSE) ftall_to_wide(data, y, x, z, agg_fun = "error", drop_unused_levels = FALSE)tall_to_wide(data, formula, agg_fun = "error", drop_unused_levels = FALSE) ftall_to_wide(data, y, x, z, agg_fun = "error", drop_unused_levels = FALSE)
data |
(data.frame) |
formula |
(formula) |
agg_fun |
(scalar character or function: |
drop_unused_levels |
(scalar logical: |
y |
(scalar character) |
x |
(scalar character) |
z |
(scalar character) |
Inputs are interpreted as:
y is a numeric vector for the outcome.
x is a factor representing the grouping variable.
z is a factor representing the blocking variable (subject or item index).
Pre-processing rules:
If x or z are not factors, they are coerced via base::as.factor().
Factor levels with no observations appear as all-NA columns (x) or rows (z) in the output.
Set drop_unused_levels = TRUE to drop factor levels with no observations after the NA filter so they do not appear in the output.
Cell aggregation:
A cell with no observations becomes NA_real_.
A cell with one observation returns that value (which may itself be NA).
A cell with two or more observations requires agg_fun to be set to something other than "error", otherwise an error is raised.
User-supplied agg_fun:
Called once per non-empty cell with a length >= 1 numeric vector, which may contain NA.
Must return a single numeric value.
NA handling:
Rows with NA in x or z are dropped before aggregation.
Rows with NA in y are retained and reach the aggregator.
Built-in aggregators apply na.rm = TRUE and return NA_real_ when every value in a cell is NA.
"first" and "last" return the first or last non-NA value in row order.
Any NaN produced by an aggregator (Inf + (-Inf) in sum, (-Inf + Inf) / 2 in median, 0 / 0 in mean, or returned by a user function) is normalized to NA_real_.
The operation is analogous to:
reshape( data = data, direction = "wide", idvar = "z", timevar = "x", v.names = "y" )
A data.frame with nlevels(z) rows and nlevels(x) + 1 columns.
The first column is a factor named after z, holding the levels of z in levels(z) order, one per row.
The remaining columns are numeric, one per level of x in levels(x) order, each named by its level string.
Each numeric cell holds the (possibly aggregated) y value for its (z, x) combination, or NA_real_ if the cell is empty.
#---------------------------------------------------------------------------- # Basic reshape: one observation per (z, x) cell #---------------------------------------------------------------------------- library(bkmodel) tall <- data.frame( y = c(1, 2, 3, 4), x = factor(c("A", "B", "A", "B")), z = factor(c("1", "1", "2", "2")) ) tall # Wide layout: one row per level of z, one column per level of x tall_to_wide(tall, y ~ x | z) # Equivalent (fast) character interface ftall_to_wide(tall, y = "y", x = "x", z = "z") # Equivalent base R interface reshape( data = tall, direction = "wide", idvar = "z", timevar = "x", v.names = "y" ) #---------------------------------------------------------------------------- # Aggregating duplicate (z, x) cells #---------------------------------------------------------------------------- dup <- data.frame( y = c(10, 20, 3, 4), x = factor(c("A", "A", "A", "B")), z = factor(c("1", "1", "2", "2")) ) # Built-in aggregator tall_to_wide(dup, y ~ x | z, agg_fun = "mean") # User-supplied aggregator, called once per non-empty cell tall_to_wide(dup, y ~ x | z, agg_fun = function(v) max(abs(v)))#---------------------------------------------------------------------------- # Basic reshape: one observation per (z, x) cell #---------------------------------------------------------------------------- library(bkmodel) tall <- data.frame( y = c(1, 2, 3, 4), x = factor(c("A", "B", "A", "B")), z = factor(c("1", "1", "2", "2")) ) tall # Wide layout: one row per level of z, one column per level of x tall_to_wide(tall, y ~ x | z) # Equivalent (fast) character interface ftall_to_wide(tall, y = "y", x = "x", z = "z") # Equivalent base R interface reshape( data = tall, direction = "wide", idvar = "z", timevar = "x", v.names = "y" ) #---------------------------------------------------------------------------- # Aggregating duplicate (z, x) cells #---------------------------------------------------------------------------- dup <- data.frame( y = c(10, 20, 3, 4), x = factor(c("A", "A", "A", "B")), z = factor(c("1", "1", "2", "2")) ) # Built-in aggregator tall_to_wide(dup, y ~ x | z, agg_fun = "mean") # User-supplied aggregator, called once per non-empty cell tall_to_wide(dup, y ~ x | z, agg_fun = function(v) max(abs(v)))