missForest is a nonparametric imputation method for mixed-type tabular data in R. It handles numeric and categorical variables simultaneously by iteratively training random forests to predict missing entries from the observed ones. No explicit modeling assumptions, no matrix factorizations—just strong predictive baselines that work well out of the box.
variablewise — return per-variable OOB error if TRUE.
parallelize — "no", "variables", or "forests".
num.threads — threads for ranger (ignored by randomForest).
backend — "ranger" (default) or "randomForest".
xtrue — optional complete data for benchmarking (adds $error).
Some argument mappings for backend = "ranger":
ntree → num.trees
nodesize → min.bucket (separately for regression/classification; default c(5,1))
sampsize (counts) → sample.fraction (fractions; overall or per-class)
classwt → class.weights
cutoff handled by fitting probability forests and post-thresholding
Utilities
mixError(ximp, xmis, xtrue) — computes NRMSE (numeric) and PFC (factor) over true missing entries.
nrmse(ximp, xmis, xtrue) — NRMSE for continuous-only data.
prodNA(x, noNA = 0.1) — injects MCAR missingness into a data frame.
varClass(x) — returns "numeric"/"factor" per column.
Tips & best practices
Convert character columns to factors before calling missForest.
For wide data, consider parallelize = "variables". For deep/expensive trees, consider parallelize = "forests".
Set a seed for quasi-reproducible results:
set.seed(123); imp <- missForest(x)
You can lower ntree during prototyping to speed up iteration.
Citation
If you use missForest, please cite:
Stekhoven, D. J. & Bühlmann, P. (2012). MissForest—nonparametric missing value imputation for mixed-type data.Bioinformatics, 28(1), 112–118. https://doi.org/10.1093/bioinformatics/btr597
You can also cite the package:
citation("missForest")
Contributing
Issues and pull requests are welcome. Please include a minimal reproducible example when reporting bugs. For performance discussions, share small benchmarks and session info.
missForest
missForest is a nonparametric imputation method for mixed-type tabular data in R. It handles numeric and categorical variables simultaneously by iteratively training random forests to predict missing entries from the observed ones. No explicit modeling assumptions, no matrix factorizations—just strong predictive baselines that work well out of the box.
ranger(default) andrandomForest(legacy/compat)The package also includes utilities to measure imputation error, generate missingness for experiments, and inspect variable types.
Installation
Quick start
Choosing a backend
Parallelization
Two modes are available via
parallelize:"variables": build forests for different variables in parallel (register a foreach backend)."forests": parallelize within a single variable’s forest (ranger threads; or foreach sub-forests for randomForest).API overview
missForest(xmis, ...)Core imputation function.
Key arguments:
xmis— data frame/matrix with missing values (columns must benumericorfactor).maxiter— maximum iterations (default10).ntree— trees per forest (default100).mtry— variables tried at each split (defaultsqrt(p)).nodesize— length-2 numeric: minimum node size for c(numeric, factor). Defaultc(5, 1).variablewise— return per-variable OOB error ifTRUE.parallelize—"no","variables", or"forests".num.threads— threads forranger(ignored byrandomForest).backend—"ranger"(default) or"randomForest".xtrue— optional complete data for benchmarking (adds$error).Some argument mappings for
backend = "ranger":ntree → num.treesnodesize → min.bucket(separately for regression/classification; defaultc(5,1))sampsize(counts) →sample.fraction(fractions; overall or per-class)classwt → class.weightscutoffhandled by fitting probability forests and post-thresholdingUtilities
mixError(ximp, xmis, xtrue)— computes NRMSE (numeric) and PFC (factor) over true missing entries.nrmse(ximp, xmis, xtrue)— NRMSE for continuous-only data.prodNA(x, noNA = 0.1)— injects MCAR missingness into a data frame.varClass(x)— returns"numeric"/"factor"per column.Tips & best practices
Convert character columns to factors before calling
missForest.For wide data, consider
parallelize = "variables". For deep/expensive trees, considerparallelize = "forests".Set a seed for quasi-reproducible results:
You can lower
ntreeduring prototyping to speed up iteration.Citation
If you use missForest, please cite:
You can also cite the package:
Contributing
Issues and pull requests are welcome. Please include a minimal reproducible example when reporting bugs. For performance discussions, share small benchmarks and session info.
License
GPL (≥ 2)
Contact
Daniel J. Stekhoven — stekhoven@nexus.ethz.ch