Extract all names, and also first names and last names:
name_rex <- paste0(
"(?<first>[[:upper:]][[:lower:]]+) ",
"(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore"
)
not <- re_match_all(notables, name_rex)
not
#> # A tibble: 2 × 4
#> first last .text .match
#> <list> <list> <chr> <list>
#> 1 <chr [2]> <chr [2]> " Ben Franklin and Jefferson Davis" <chr [2]>
#> 2 <chr [1]> <chr [1]> "\tMillard Fillmore" <chr [1]>
re_exec and re_exec_all are similar to re_match and
re_match_all, but they also return match positions. These functions
return match records. A match record has three components: match,
start, end, and each component can be a vector. It is similar to a
data frame in this respect.
pos <- re_exec(notables, name_rex)
pos
#> # A tibble: 2 × 4
#> first last .text .match
#> <rmtch_rc> <rmtch_rc> <chr> <rmtch_rc>
#> 1 <named list [3]> <named list [3]> " Ben Franklin and Jefferson … <named list>
#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore" <named list>
Unfortunately R does not allow hierarchical data frames (i.e. a column
of a data frame cannot be another data frame), but rematch2 defines
some special classes and an $ operator, to make it easier to extract
parts of re_exec and re_exec_all matches. You simply query the
match, start or end part of a column:
re_exec_all is very similar, but these queries return lists, with
arbitrary number of matches:
allpos <- re_exec_all(notables, name_rex)
allpos
#> # A tibble: 2 × 4
#> first last .text .match
#> <rmtch_ll> <rmtch_ll> <chr> <rmtch_ll>
#> 1 <named list [3]> <named list [3]> " Ben Franklin and Jefferson … <named list>
#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore" <named list>
rematch2
A small wrapper on regular expression matching functions
regexprandgregexprto return the results in tidy data frames.Installation
Stable version:
Development version:
Rematch vs rematch2
Note that
rematch2is not compatible with the originalrematchpackage. There are at least three major changes:rematch2thetextvector is first, andpatternis second..matchis the last column instead of the first.rematch2returnstibbledata frames. See https://github.com/tidyverse/tibble.Usage
First match
With capture groups:
Named capture groups:
A slightly more complex example:
All matches
Extract all names, and also first names and last names:
Match positions
re_execandre_exec_allare similar tore_matchandre_match_all, but they also return match positions. These functions return match records. A match record has three components:match,start,end, and each component can be a vector. It is similar to a data frame in this respect.Unfortunately R does not allow hierarchical data frames (i.e. a column of a data frame cannot be another data frame), but
rematch2defines some special classes and an$operator, to make it easier to extract parts ofre_execandre_exec_allmatches. You simply query thematch,startorendpart of a column:re_exec_allis very similar, but these queries return lists, with arbitrary number of matches:License
MIT © Mango Solutions, Gábor Csárdi