splitstackshape::cSplit
relies on base strsplit
under the hood, and strsplit
uses Extended Regular Expressions by default. This type of regex has slightly different syntax and fewer features than the Javascript implementation. strsplit
can also use Perl-like (PCRE) regular expressions via the perl = TRUE
parameter, but cSplit
isn't exposing this option to you, so you're stuck with Extended Regular Expression syntax.
A valid R extended regex string that matches what you want it to match is:
"\\n[[:digit:]]+\\. "
BUT: strsplit
(and therefore cSplit
) does not include the delimiter in the split output. So even with the right syntax, you're going to lose the step numbers in every split line after the first one:
txt <- c(
"1. line one
2. line two
3. line three",
"24. line one
25. 1) first point
2) second point"
)
strsplit(txt, "\\n[[:digit:]]+\\. ")
#> [[1]]
#> [1] "1. line one" "line two" "line three"
#>
#> [[2]]
#> [1] "24. line one" "1) first point\n2) second point"
In R's Perl-like regex, you could solve this problem with a lookahead, but R's extended regex doesn't support lookaheads. I'm afraid I can't think of a way to keep the delimiters using only R extended regex — maybe another helper can?
Otherwise, I can see a couple of options:
- Use more string processing to restore the missing step numbers after the splitting is done ()
- Fork
cSplit
and add a perl = TRUE
parameter to its call to strsplit
, then use a Perl-like regex. For example, the following R Perl-like regex gives the splits you want:
strsplit(txt, "\\n(?=\\d+\\. )(?m)", perl = TRUE)
#> [[1]]
#> [1] "1. line one" "2. line two" "3. line three"
#>
#> [[2]]
#> [1] "24. line one"
#> [2] "25. 1) first point\n2) second point"
If you did fork cSplit
to support the perl = TRUE
option, you might consider opening an issue (or even a pull request!) on splitstackshape
's GitHub to see if the maintainer wants to add this functionality into the package.
About your regex in Javascript...
Btw, I'm not sure that the regex you tried would have given you the desired output in Javascript, either. The lookahead was around the whole expression, so nothing is matched, and the \d
and \.
were over-escaped. I find RegExr to be a great tool for debugging (and learning about) regular expressions, especially because of its "explain" feature. Here's a test of the regex you tried: https://regexr.com/3oat4