Incorporate v2.2.1 branch into main (#590)
cherried some of #584 for CRAN bugfix release
bump version number
use Debian-friendly timezone
gitignore/rbuildignore revdep
fix codecov shield URL
prep for release
add CRAN-SUBMISSIONS to Rbuildignore
bump to dev pkg number
Co-authored-by: Bill Denney billdenney@users.noreply.github.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
janitor
janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can perform many of these tasks already, but with janitor they can do it faster and save their thinking for the fun stuff.
The main janitor functions:
table(); andThe tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.
janitor is a #tidyverse-oriented package. Specifically, it plays nicely with the
%>%pipe and is optimized for cleaning data brought in with the readr and readxl packages.Installation
You can install:
Using janitor
A full description of each function, organized by topic, can be found in janitor’s catalog of functions vignette. There you will find functions not mentioned in this README, like
compare_df_cols()which provides a summary of differences in column names and types when given a set of data.frames.Below are quick examples of how janitor tools are commonly used.
Cleaning dirty data
Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx:
Dirtiness includes:
Here’s that data after being read in to R:
Now, to clean it up, starting with the column names.
Name cleaning comes in two flavors.
make_clean_names()operates on character vectors and can be used during data import:clean_names()is a convenience version ofmake_clean_names()that can be used for piped data.frame workflows. The equivalent steps withclean_names()would be:The data.frame now has clean names. Let’s tidy it up further:
Examining dirty data
Finding duplicates
Use
get_dupes()to identify and examine duplicate records during data cleaning. Let’s see if any teachers are listed more than once:Yes, some teachers appear twice. We ought to address this before counting employees.
Tabulating tools
A variable (or combinations of two or three variables) can be tabulated with
tabyl(). The resulting data.frame can be tweaked and formatted with the suite ofadorn_functions for quick analysis and printing of pretty results in a report.adorn_functions can be helpful with non-tabyls, too.tabyl()Like
table(), but pipe-able, data.frame-based, and fully featured.tabyl()can be called two ways:tabyl(roster$subject)roster %>% tabyl(subject, employee_status).%>%pipe; this allowstabylto be used in an analysis pipelineOne variable:
Two variables:
Three variables:
Adorning tabyls
The
adorn_functions dress up the results of these tabulation calls for fast, basic reporting. Here are some of the functions that augment a summary table for reporting:Pipe that right into
knitr::kable()in your RMarkdown report.These modular adornments can be layered to reduce R’s deficit against Excel and SPSS when it comes to quick, informative counts. Learn more about
tabyl()and theadorn_functions from the tabyls vignette.Contact me
You are welcome to: