Hoe to get a df out of a text without any punctuaiton?

Nile · July 21, 2021, 2:16am

Hi,
I happen to have a text that has no punctuation marks and needs to convert to a df. Is it possible to do so by turning e.g every five words into one sentence for the first row, next five for the second row, and so on?

Thank you in advance for your advice!

HanOostdijk · July 21, 2021, 8:45am

Of course that can be done. See code below.
Question is "what do you want to do with that data.frame?"
Some ideas you can find in Welcome to Text Mining with R by Julia Silge and David Robinson.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

mytext <- c("This is a text. It contains rather a lot of words",
            "Another text with more than 5 words.")
splitted <- strsplit(mytext,' ')
combined <- do.call(c,splitted)
text_df <- tibble(line = seq_along(combined), text = combined) %>%
      mutate( col = 1 + ((line - 1) %% 5) ,
              line = 1+  (line - 1) %/% 5)
textdf2 <- pivot_wider(text_df,values_from =text,names_from = col,id_cols=line)
as.data.frame(textdf2)
#>   line        1       2      3     4    5
#> 1    1     This      is      a text.   It
#> 2    2 contains  rather      a   lot   of
#> 3    3    words Another   text  with more
#> 4    4     than       5 words.  <NA> <NA>
Created on 2021-07-21 by the reprex package (v2.0.0)

Nile · July 22, 2021, 5:10am

This is genius! Thank you so much. The only thing I'd mention is that I actually need the chunk of 5 words in a single column:

line
1 "This is a text It"
2 "contains rather a lot of"

And on line 3: there was a '/n': 3 words\n.

Maybe because I used the text without any punctuation in it:

mytext <- c("This is a text It contains rather a lot of words
Another text with more than 5 words")

.....

system · August 12, 2021, 5:11am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.