Hoe to get a df out of a text without any punctuaiton?

I happen to have a text that has no punctuation marks and needs to convert to a df. Is it possible to do so by turning e.g every five words into one sentence for the first row, next five for the second row, and so on?

Thank you in advance for your advice!

Of course that can be done. See code below.
Question is "what do you want to do with that data.frame?"
Some ideas you can find in Welcome to Text Mining with R by Julia Silge and David Robinson.

#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>     filter, lag
#> The following objects are masked from 'package:base':
#>     intersect, setdiff, setequal, union

mytext <- c("This is a text. It contains rather a lot of words",
            "Another text with more than 5 words.")
splitted <- strsplit(mytext,' ')
combined <- do.call(c,splitted)
text_df <- tibble(line = seq_along(combined), text = combined) %>%
      mutate( col = 1 + ((line - 1) %% 5) ,
              line = 1+  (line - 1) %/% 5)
textdf2 <- pivot_wider(text_df,values_from =text,names_from = col,id_cols=line)
#>   line        1       2      3     4    5
#> 1    1     This      is      a text.   It
#> 2    2 contains  rather      a   lot   of
#> 3    3    words Another   text  with more
#> 4    4     than       5 words.  <NA> <NA>
Created on 2021-07-21 by the reprex package (v2.0.0)
1 Like

This is genius! Thank you so much. The only thing I'd mention is that I actually need the chunk of 5 words in a single column:

1 "This is a text It"
2 "contains rather a lot of"

And on line 3: there was a '/n': 3 words\n.

Maybe because I used the text without any punctuation in it:

mytext <- c("This is a text It contains rather a lot of words
Another text with more than 5 words")


This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.