目录

rematch2

Match Regular Expressions with a Nicer ‘API’

R-CMD-check Codecov test
coverage CRAN
status

A small wrapper on regular expression matching functions regexpr and gregexpr to return the results in tidy data frames.


Installation

Stable version:

install.packages("rematch2")

Development version:

pak::pak("r-lib/rematch2")

Rematch vs rematch2

Note that rematch2 is not compatible with the original rematch package. There are at least three major changes:

  • The order of the arguments for the functions is different. In rematch2 the text vector is first, and pattern is second.
  • In the result, .match is the last column instead of the first.
  • rematch2 returns tibble data frames. See https://github.com/tidyverse/tibble.

Usage

First match

library(rematch2)

With capture groups:

dates <- c("2016-04-20", "1977-08-08", "not a date", "2016",
  "76-03-02", "2012-06-30", "2015-01-21 19:58")
isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
re_match(text = dates, pattern = isodate)
#> # A tibble: 7 × 5
#>   ``    ``    ``    .text            .match    
#>   <chr> <chr> <chr> <chr>            <chr>     
#> 1 2016  04    20    2016-04-20       2016-04-20
#> 2 1977  08    08    1977-08-08       1977-08-08
#> 3 <NA>  <NA>  <NA>  not a date       <NA>      
#> 4 <NA>  <NA>  <NA>  2016             <NA>      
#> 5 <NA>  <NA>  <NA>  76-03-02         <NA>      
#> 6 2012  06    30    2012-06-30       2012-06-30
#> 7 2015  01    21    2015-01-21 19:58 2015-01-21

Named capture groups:

isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
re_match(text = dates, pattern = isodaten)
#> # A tibble: 7 × 5
#>   year  month day   .text            .match    
#>   <chr> <chr> <chr> <chr>            <chr>     
#> 1 2016  04    20    2016-04-20       2016-04-20
#> 2 1977  08    08    1977-08-08       1977-08-08
#> 3 <NA>  <NA>  <NA>  not a date       <NA>      
#> 4 <NA>  <NA>  <NA>  2016             <NA>      
#> 5 <NA>  <NA>  <NA>  76-03-02         <NA>      
#> 6 2012  06    30    2012-06-30       2012-06-30
#> 7 2015  01    21    2015-01-21 19:58 2015-01-21

A slightly more complex example:

github_repos <- c(
    "metacran/crandb",
    "jeroenooms/curl@v0.9.3",
    "jimhester/covr#47",
    "hadley/dplyr@*release",
    "r-lib/remotes@550a3c7d3f9e1493a2ba",
    "/
amp;@R64&3" ) owner_rx <- "(?:(?<owner>[^/]+)/)?" repo_rx <- "(?<repo>[^/@#]+)" subdir_rx <- "(?:/(?<subdir>[^@#]*[^@#/]))?" ref_rx <- "(?:@(?<ref>[^*].*))" pull_rx <- "(?:#(?<pull>[0-9]+))" release_rx <- "(?:@(?<release>[*]release))" subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx) github_rx <- sprintf( "^(?:%s%s%s%s|(?<catchall>.*))$", owner_rx, repo_rx, subdir_rx, subtype_rx ) re_match(text = github_repos, pattern = github_rx) #> # A tibble: 6 × 9 #> owner repo subdir ref pull release catchall .text .match #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 "metacran" "crandb" "" "" "" "" "" meta… metac… #> 2 "jeroenooms" "curl" "" "v0.9.3" "" "" "" jero… jeroe… #> 3 "jimhester" "covr" "" "" "47" "" "" jimh… jimhe… #> 4 "hadley" "dplyr" "" "" "" "*rele… "" hadl… hadle… #> 5 "r-lib" "remotes" "" "550a3c7d3f… "" "" "" r-li… r-lib… #> 6 "" "" "" "" "" "" "/
amp;@R6… /
amp;@… /
amp;@R…

All matches

Extract all names, and also first names and last names:

name_rex <- paste0(
  "(?<first>[[:upper:]][[:lower:]]+) ",
  "(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
  "  Ben Franklin and Jefferson Davis",
  "\tMillard Fillmore"
)
not <- re_match_all(notables, name_rex)
not
#> # A tibble: 2 × 4
#>   first     last      .text                                .match   
#>   <list>    <list>    <chr>                                <list>   
#> 1 <chr [2]> <chr [2]> "  Ben Franklin and Jefferson Davis" <chr [2]>
#> 2 <chr [1]> <chr [1]> "\tMillard Fillmore"                 <chr [1]>
not$first
#> [[1]]
#> [1] "Ben"       "Jefferson"
#> 
#> [[2]]
#> [1] "Millard"
not$last
#> [[1]]
#> [1] "Franklin" "Davis"   
#> 
#> [[2]]
#> [1] "Fillmore"
not$.match
#> [[1]]
#> [1] "Ben Franklin"    "Jefferson Davis"
#> 
#> [[2]]
#> [1] "Millard Fillmore"

Match positions

re_exec and re_exec_all are similar to re_match and re_match_all, but they also return match positions. These functions return match records. A match record has three components: match, start, end, and each component can be a vector. It is similar to a data frame in this respect.

pos <- re_exec(notables, name_rex)
pos
#> # A tibble: 2 × 4
#>   first            last             .text                           .match      
#>   <rmtch_rc>       <rmtch_rc>       <chr>                           <rmtch_rc>  
#> 1 <named list [3]> <named list [3]> "  Ben Franklin and Jefferson … <named list>
#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore"            <named list>

Unfortunately R does not allow hierarchical data frames (i.e. a column of a data frame cannot be another data frame), but rematch2 defines some special classes and an $ operator, to make it easier to extract parts of re_exec and re_exec_all matches. You simply query the match, start or end part of a column:

pos$first$match
#> [1] "Ben"     "Millard"
pos$first$start
#> [1] 3 2
pos$first$end
#> [1] 5 8

re_exec_all is very similar, but these queries return lists, with arbitrary number of matches:

allpos <- re_exec_all(notables, name_rex)
allpos
#> # A tibble: 2 × 4
#>   first            last             .text                           .match      
#>   <rmtch_ll>       <rmtch_ll>       <chr>                           <rmtch_ll>  
#> 1 <named list [3]> <named list [3]> "  Ben Franklin and Jefferson … <named list>
#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore"            <named list>
allpos$first$match
#> [[1]]
#> [1] "Ben"       "Jefferson"
#> 
#> [[2]]
#> [1] "Millard"
allpos$first$start
#> [[1]]
#> [1]  3 20
#> 
#> [[2]]
#> [1] 2
allpos$first$end
#> [[1]]
#> [1]  5 28
#> 
#> [[2]]
#> [1] 8

License

MIT © Mango Solutions, Gábor Csárdi

关于

提供正则表达式匹配功能,用于在R语言中提取和操作字符串

204.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号