how to split the column element with customed regular in r

I want to do some character processing

the raw data frame just like

raw <- data.frame(id = c(1,2,1), column = "APP School.  APP2 School. May 5 Scholl, type2.", 
              column2 = " abc school, type2, type3. aaa university, type3.",
              column3 =" abc school, type2, type3. aaa university, type3.")

i want to "group_by(id)" to know the frequncy about "APP School", "APP2 School", "abc school" and so on,

i want to transform the raw data with annother format.

one is

format1 <- data.frame(id = c(1,2,1), name_1 = c("APP School","abc school", "abc school"),
                      class_1 = c(NA, "type2", "type2"), class_2 = c(NA, "type3", "type3"),
                      name_2 = c("PP2 School", "aaa university", "aaa university"),
                      class_3 = c(NA, "type3","type3"),
                      name_3 = c("May 5 Scholl", NA, NA), class_4 = c("type2", NA, NA))

format2 <- data.frame(id = c(1,2,1) , APP_School = c(1,0,0), APP2_School = c(1,0,0),  May_5_Scholl = c(1,0,0),
                        type2 = c(1,1,1), type3  = c(0,2,2), abc_school = c(0,1,1), aaa_university = c(0,1,1))


format3 <- data.frame(id = c(1,2), APP_School = c(1,0), APP2_School = c(1,0),  May_5_Scholl = c(1,0),
                        type2 = c(2,1), type3  = c(2,2), abc_school = c(1,1), aaa_university = c(1,1))

How can i get it.

Thanks a lot.

Hi,

Welcome to the RStudio community!

Your data format is a bit weird with 3 columns (all of the data is repeated for every case). In order to make this work, I assumed that column, column2 and column3 are actually the value of one column, one for each ID.

I then used a list of variables to check for, and the str_detect function to find them in each string.

library(tidyverse)

#Data (note that there is only one column not 3)
raw <- data.frame(
  id = c(1,2,1), 
  column = c("APP School.  APP2 School. May 5 Scholl, type2.", 
  " abc school, type2, type3. aaa university, type3.",
  " abc school, type2, type3. aaa university, type3."))

raw
#>   id                                            column
#> 1  1    APP School.  APP2 School. May 5 Scholl, type2.
#> 2  2  abc school, type2, type3. aaa university, type3.
#> 3  1  abc school, type2, type3. aaa university, type3.

#Variables to check
toCheck = c("APP School", "APP2 School", "May 5 Scholl",
            "type2", "type3", "aaa university", "abc school")


#Add new column for ever variable
new = bind_cols(
  raw, 
  #Check if variable is present in the text
  sapply(toCheck, function(x){
    str_detect(raw$column, x) %>% as.integer()
  })
)

#Show the result (ignore text for better display)
new %>% select(-column)
#>   id APP School APP2 School May 5 Scholl type2 type3 aaa university abc school
#> 1  1          1           1            1     1     0              0          0
#> 2  2          0           0            0     1     1              1          1
#> 3  1          0           0            0     1     1              1          1

#Summarise
new2 = new %>% group_by(id) %>% summarise(across(-column, sum))

new2
#> # A tibble: 2 × 8
#>      id `APP School` `APP2 School` `May 5 Scholl` type2 type3 aaa univ…¹ abc s…²
#>   <dbl>        <int>         <int>          <int> <int> <int>      <int>   <int>
#> 1     1            1             1              1     2     1          1       1
#> 2     2            0             0              0     1     1          1       1
#> # … with abbreviated variable names ¹​`aaa university`, ²​`abc school`

Created on 2022-08-07 by the reprex package (v2.0.1)

The results are not identical to your format dataframes, but I guess you made them by hand hence some errors on your side when counting?

Hope this helps,
PJ

1 Like

Thank You Very Much. I learned a lot

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.