The naivebayes package presents an efficient implementation of the
widely-used Naïve Bayes classifier. It upholds three core principles:
efficiency, user-friendliness, and reliance solely on Base R. By
adhering to the latter principle, the package ensures stability and
reliability without introducing external dependencies[^1]. This design
choice maintains efficiency by leveraging the optimized routines
inherent in Base R, many of which are programmed in high-performance
languages like C/C++ or FORTRAN. By following these principles, the
naivebayes package provides a reliable and efficient tool for Naïve
Bayes classification tasks, ensuring that users can perform their
analyses effectively and with ease.
The naive_bayes() function is designed to determine the class of each
feature in a dataset, and depending on user specifications, it can
assume various distributions for each feature. It currently supports the
following class conditional distributions:
categorical distribution for discrete features (with Bernoulli
distribution as a special case for binary outcomes)
Poisson distribution for non-negative integer features
Gaussian distribution for continuous features
non-parametrically estimated densities via Kernel Density Estimation
for continuous features
In addition to that specialized functions are available which implement:
Bernoulli Naive Bayes via bernoulli_naive_bayes()
Multinomial Naive Bayes via multinomial_naive_bayes()
Poisson Naive Bayes via poisson_naive_bayes()
Gaussian Naive Bayes via gaussian_naive_bayes()
Non-Parametric Naive Bayes via nonparametric_naive_bayes()
These specialized functions are carefully optimized for efficiency,
utilizing linear algebra operations to excel when handling dense
matrices. Additionally, they can also exploit sparsity of matrices for
enhanced performance and work in presence of missing data. The package
also includes various helper functions to improve user experience.
Moreover, users can access the general naive_bayes() function through
the excellent Caret package, providing additional versatility.
2. Installation
The naivebayes package can be installed from the CRAN repository by
simply executing in the console the following line:
install.packages("naivebayes")
# Or the the development version from GitHub:
devtools::install_github("majkamichal/naivebayes")
3. Usage
The naivebayes package provides a user friendly implementation of the
Naïve Bayes algorithm via formula interlace and classical combination of
the matrix/data.frame containing the features and a vector with the
class labels. All functions can recognize missing values, give an
informative warning and more importantly - they know how to handle them.
In following the basic usage of the main function naive_bayes() is
demonstrated. Examples with the specialized Naive Bayes classifiers can
be found in the extended documentation:
https://majkamichal.github.io/naivebayes/ in this
article.
3.1 Example data
library(naivebayes)
#> naivebayes 1.0.0 loaded
#> For more information please visit:
#> https://majkamichal.github.io/naivebayes/
# Simulate example data
n <- 100
set.seed(1)
data <- data.frame(class = sample(c("classA", "classB"), n, TRUE),
bern = sample(LETTERS[1:2], n, TRUE),
cat = sample(letters[1:3], n, TRUE),
logical = sample(c(TRUE,FALSE), n, TRUE),
norm = rnorm(n),
count = rpois(n, lambda = c(5,15)))
train <- data[1:95, ]
test <- data[96:100, -1]
X <- train[-1]
class <- train$class
nb2 <- naive_bayes(x = X, y = class)
nb2 %prob% test
#> classA classB
#> [1,] 0.7174638 0.2825362
#> [2,] 0.2599418 0.7400582
#> [3,] 0.6341795 0.3658205
#> [4,] 0.5365311 0.4634689
#> [5,] 0.7186026 0.2813974
3.4 Non-parametric estimation for continuous features
Kernel density estimation can be used to estimate class conditional
densities of continuous features. It has to be explicitly requested via
the parameter usekernel=TRUE otherwise Gaussian distribution will be
assumed. The estimation is performed with the built in R function
density(). By default, Gaussian smoothing kernel and Silverman’s rule
of thumb as bandwidth selector are used:
In general, there are 7 different smoothing kernels available:
gaussian
epanechnikov
rectangular
triangular
biweight
cosine
optcosine
and they can be specified in naive_bayes() via parameter additional
parameter kernel. Gaussian kernel is the default smoothing kernel.
Please see density() and bw.nrd() for further details.
The parameter adjust allows to rescale the estimated bandwidth and
thus introduces more flexibility to the estimation process. For values
below 1 (no rescaling; default setting) the density becomes “wigglier”
and for values above 1 the density tends to be “smoother”:
3.5 Model non-negative integers with Poisson distribution
Class conditional distributions of non-negative integer predictors can
be modelled with Poisson distribution. This can be achieved by setting
usepoisson=TRUE in the naive_bayes() function and by making sure
that the variables representing counts in the dataset are of class
integer.
[^1]: Specialized Naïve Bayes functions within the package may
optionally utilize sparse matrices if the Matrix package is
installed. However, the Matrix package is not a dependency, and
users are not required to install or use it.
Extended documentation can be found on the website: https://majkamichal.github.io/naivebayes/
Naïve Bayes
1. Overview
The
naivebayespackage presents an efficient implementation of the widely-used Naïve Bayes classifier. It upholds three core principles: efficiency, user-friendliness, and reliance solely on BaseR. By adhering to the latter principle, the package ensures stability and reliability without introducing external dependencies[^1]. This design choice maintains efficiency by leveraging the optimized routines inherent in BaseR, many of which are programmed in high-performance languages likeC/C++orFORTRAN. By following these principles, thenaivebayespackage provides a reliable and efficient tool for Naïve Bayes classification tasks, ensuring that users can perform their analyses effectively and with ease.The
naive_bayes()function is designed to determine the class of each feature in a dataset, and depending on user specifications, it can assume various distributions for each feature. It currently supports the following class conditional distributions:In addition to that specialized functions are available which implement:
bernoulli_naive_bayes()multinomial_naive_bayes()poisson_naive_bayes()gaussian_naive_bayes()nonparametric_naive_bayes()These specialized functions are carefully optimized for efficiency, utilizing linear algebra operations to excel when handling dense matrices. Additionally, they can also exploit sparsity of matrices for enhanced performance and work in presence of missing data. The package also includes various helper functions to improve user experience. Moreover, users can access the general
naive_bayes()function through the excellentCaretpackage, providing additional versatility.2. Installation
The
naivebayespackage can be installed from theCRANrepository by simply executing in the console the following line:3. Usage
The
naivebayespackage provides a user friendly implementation of the Naïve Bayes algorithm via formula interlace and classical combination of the matrix/data.frame containing the features and a vector with the class labels. All functions can recognize missing values, give an informative warning and more importantly - they know how to handle them. In following the basic usage of the main functionnaive_bayes()is demonstrated. Examples with the specialized Naive Bayes classifiers can be found in the extended documentation: https://majkamichal.github.io/naivebayes/ in this article.3.1 Example data
3.2 Formula interface
3.3 Matrix/data.frame and class vector
3.4 Non-parametric estimation for continuous features
Kernel density estimation can be used to estimate class conditional densities of continuous features. It has to be explicitly requested via the parameter
usekernel=TRUEotherwise Gaussian distribution will be assumed. The estimation is performed with the built inRfunctiondensity(). By default, Gaussian smoothing kernel and Silverman’s rule of thumb as bandwidth selector are used:3.4.1 Changing kernel
In general, there are 7 different smoothing kernels available:
gaussianepanechnikovrectangulartriangularbiweightcosineoptcosineand they can be specified in
naive_bayes()via parameter additional parameterkernel. Gaussian kernel is the default smoothing kernel. Please seedensity()andbw.nrd()for further details.3.4.2 Changing bandwidth selector
The
density()function offers 5 different bandwidth selectors, which can be specified viabwparameter:nrd0(Silverman’s rule-of-thumb)nrd(variation of the rule-of-thumb)ucv(unbiased cross-validation)bcv(biased cross-validation)SJ(Sheather & Jones method)3.4.3 Adjusting bandwidth
The parameter
adjustallows to rescale the estimated bandwidth and thus introduces more flexibility to the estimation process. For values below 1 (no rescaling; default setting) the density becomes “wigglier” and for values above 1 the density tends to be “smoother”:3.5 Model non-negative integers with Poisson distribution
Class conditional distributions of non-negative integer predictors can be modelled with Poisson distribution. This can be achieved by setting
usepoisson=TRUEin thenaive_bayes()function and by making sure that the variables representing counts in the dataset are of classinteger.[^1]: Specialized Naïve Bayes functions within the package may optionally utilize sparse matrices if the Matrix package is installed. However, the Matrix package is not a dependency, and users are not required to install or use it.