Hi,
I happen to have a text that has no punctuation marks and needs to convert to a df. Is it possible to do so by turning e.g every five words into one sentence for the first row, next five for the second row, and so on?
Of course that can be done. See code below.
Question is "what do you want to do with that data.frame?"
Some ideas you can find in Welcome to Text Mining with R by Julia Silge and David Robinson.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
mytext <- c("This is a text. It contains rather a lot of words",
"Another text with more than 5 words.")
splitted <- strsplit(mytext,' ')
combined <- do.call(c,splitted)
text_df <- tibble(line = seq_along(combined), text = combined) %>%
mutate( col = 1 + ((line - 1) %% 5) ,
line = 1+ (line - 1) %/% 5)
textdf2 <- pivot_wider(text_df,values_from =text,names_from = col,id_cols=line)
as.data.frame(textdf2)
#> line 1 2 3 4 5
#> 1 1 This is a text. It
#> 2 2 contains rather a lot of
#> 3 3 words Another text with more
#> 4 4 than 5 words. <NA> <NA>
Created on 2021-07-21 by the reprex package (v2.0.0)