splitstackshape::cSplit relies on base strsplit under the hood, and strsplit uses Extended Regular Expressions by default. This type of regex has slightly different syntax and fewer features than the Javascript implementation. strsplit can also use Perl-like (PCRE) regular expressions via the perl = TRUE parameter, but cSplit isn't exposing this option to you, so you're stuck with Extended Regular Expression syntax.
A valid R extended regex string that matches what you want it to match is:
"\\n[[:digit:]]+\\. "
BUT: strsplit (and therefore cSplit) does not include the delimiter in the split output. So even with the right syntax, you're going to lose the step numbers in every split line after the first one:
txt <- c(
"1. line one
2. line two
3. line three",
"24. line one
25. 1) first point
2) second point"
)
strsplit(txt, "\\n[[:digit:]]+\\. ")
#> [[1]]
#> [1] "1. line one" "line two" "line three"
#>
#> [[2]]
#> [1] "24. line one" "1) first point\n2) second point"
In R's Perl-like regex, you could solve this problem with a lookahead, but R's extended regex doesn't support lookaheads. I'm afraid I can't think of a way to keep the delimiters using only R extended regex — maybe another helper can?
Otherwise, I can see a couple of options:
- Use more string processing to restore the missing step numbers after the splitting is done (
)
- Fork
cSplit and add a perl = TRUE parameter to its call to strsplit, then use a Perl-like regex. For example, the following R Perl-like regex gives the splits you want:
strsplit(txt, "\\n(?=\\d+\\. )(?m)", perl = TRUE)
#> [[1]]
#> [1] "1. line one" "2. line two" "3. line three"
#>
#> [[2]]
#> [1] "24. line one"
#> [2] "25. 1) first point\n2) second point"
If you did fork cSplit to support the perl = TRUE option, you might consider opening an issue (or even a pull request!) on splitstackshape's GitHub to see if the maintainer wants to add this functionality into the package.
About your regex in Javascript...
Btw, I'm not sure that the regex you tried would have given you the desired output in Javascript, either. The lookahead was around the whole expression, so nothing is matched, and the \d and \. were over-escaped. I find RegExr to be a great tool for debugging (and learning about) regular expressions, especially because of its "explain" feature. Here's a test of the regex you tried: https://regexr.com/3oat4