how to split a column

considered this type of column

df<-data.frame(variable = c("CH2 Frais de personnel (72 287) (65 828) (132 064)"))

Created on 2021-03-17 by the reprex package (v1.0.0.9002)

how can I split it so that it looks like this
Capture 3

thank you

Using extract() from the tidyr package in the tidyverse, you can use a regex to grab the parts of the string you want and store them in variables in your data.frame. Regexs can be tricky to understand though.

library(tidyverse)
extract(
  df, 
  variable,
  into=c("variable","year_2020","year_2019","year_2018"),
  regex = "(.*?) \\((.*?)\\) \\((.*?)\\) \\((.*?)\\)"
)

The regex in the above function has four parts, separated by spaces. the first is (.*?) this means anything can match this bit of text and is therefore matched to "CH2 Frais de personnel" in your example.

After that there are three repetitions of \\((.*?)\\)". The \\( and \\) around the outside means we want to match something in between braces. In regex a bracket is a special character, so we need to signal to R that we want to explicitly look for brackets using the double slash, and not that this is a regex command. Inside the slashed brackets, we again have (.*?), which again matches anything. So these statements mean match to anything inside brackets.

We're saying match to anything, then anything brackets then anything inside brackets then anything inside brackets, all separated by spaces. So overall, this regex will match the 4 things in your string. This means that the into= variable, must also have four variable names.

If you want to also keep the original variable, you can use the argument remove = FALSE, but you'll need to change the variable name in the into= argument to avoid clashes.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.