Loops - create a random string and add to a data frame

DarthPathos · November 21, 2019, 6:04pm

Hi all

I'm sure this is a ridiculously easy question but I've been stuck on it for a week and keep getting lost in the code. Basically, I'm creating a generic simulated data set (for the purposes of testing clinical trial protocols) and I have 95% of the data set done. The last component I need is a "Comments" field, using random words (so we can test out sentiment analysis etc.).

I am able to create a string of random words using the janeaustenr package

install.packages("janeaustenr")
library(janeaustenr)
sdata <- sample(words[1:15], size = sample(1:15), replace = TRUE)
paste(sdata, sep=' ', collapse=' ')

but I can't figure out how to a) create 10,000 rows of these random strings and b) merge that into my existing dataframe.

Any thoughts would be greatly appreciated!
Chris

valeri · November 21, 2019, 6:18pm

Hi @DarthPathos,

when I run this code, the object words is not recognized. So, how do you generate the random words?

DarthPathos · November 21, 2019, 6:31pm

stupid copy / paste error, sorry about that! (code below modified from this thread)

install.packages(c("janeaustenr","tidytext"))

library(janeaustenr)
library(dplyr)
library(stringr)
library(tidytext)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()

tidy_books <- original_books %>%   unnest_tokens(word, text)
words <- tidy_books$word
sdata <- sample(words[1:15], size = sample(1:15), replace = TRUE)
paste(sdata, sep=' ', collapse=' ')

valeri · November 21, 2019, 6:51pm

OK, Maybe I don't get it - but why cant you just do

sdata <- sample(words, size = 10000)

and then assign sdata to a column in your as you say "existing data frame" ? Or do you need more than one random word per row? If so, should it be equal number of words per row or some random number of words in some range?

DarthPathos · November 21, 2019, 7:03pm

@valeri - Thanks for your time - I do want more than 1 random word - the sdata that my code creates correctly creates a random string between 1 and 15 words long (I want to mimic a comments field in a database). So for example, I'd like

[1] long chapter austen the sensibility by
[2] family sensibility long of jane 1811
.
.
.
[1000] of dashwood austen jane sensibility and family sense had sense and 1
.
.

being merged with my dataset. Hoping this makes sense - running on very little sleep (2 1/2 month old son was up at 3am...).
Chris

valeri · November 21, 2019, 7:13pm

How about this:

tidy_books <- original_books %>% unnest_tokens(word, text)
words <- tidy_books$word
sdata <- purrr::map_chr(.x = 1:10000, .f = function(x){paste(sample(words[1:15], size = sample(1:15), replace = TRUE), sep=' ', collapse=' ')}) 
sdata <- data.frame(comment = sdata)

... and I feel you reg. the baby - been there twice

technocrat · November 21, 2019, 7:16pm

Hi, couple of prelims

It's considered bad form to include

install.packages(c("janeaustenr","tidytext"))

in sample code.

Putting the example in reproducible example, called a reprex form cuts down on cut-and-paste errors and makes it possible to follow the code without having to run it.

My suggestion is that the approach being taken is too granular and doesn't need to be random, merely representative of values on which sentiment analysis of your real data can be run.

Consider this example, starting from help(unnest_tokens)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(janeaustenr)
library(tidytext)
d <- tibble(txt = prideprejudice)
placeholder <- d %>% unnest_tokens(sentence, txt, token = "paragraphs")
placeholder <- placeholder[1:2000,]
placeholder <- rbind(placeholder,placeholder,placeholder,placeholder, placeholder)
placeholder
#> # A tibble: 10,000 x 1
#>    sentence                                                                     
#>    <chr>                                                                        
#>  1 pride and prejudice                                                          
#>  2 by jane austen                                                               
#>  3 chapter 1                                                                    
#>  4 " it is a truth universally acknowledged, that a single man in possession of…
#>  5 however little known the feelings or views of such a man may be on his first…
#>  6 "\"my dear mr. bennet,\" said his lady to him one day, \"have you heard that…
#>  7 mr. bennet replied that he had not.                                          
#>  8 "\"but it is,\" returned she; \"for mrs. long has just been here, and she to…
#>  9 mr. bennet made no answer.                                                   
#> 10 "\"do you not want to know who has taken it?\" cried his wife impatiently."  
#> # … with 9,990 more rows

^{Created on 2019-11-21 by the reprex package (v0.3.0)}

It gets you a 10,000 row tibble filled with paragraphs (each repeated 5 times).

DarthPathos · November 21, 2019, 7:21pm

Thanks for the tip re: install.packages; in other forums (non-R), it's recommended to include so everyone's on the same page. Shall make sure that doesn't happen again

As a (relatively) new R user, I'll have to take a closer look at your code - I think I see what you're doing but need a large cup of tea and some heavy metal haha. For what it's worth, @valeri 's code works for me and as I'm somewhat familiar with purrr, I can sort of see his logic.

thank you both. this has saved me a lot of time and headache. I hope to repay your kindness!
Chris

DarthPathos · November 21, 2019, 7:25pm

thanks for this......have tested and works, shall add it into the larger code and see how it all fits.

And as I said to my mom, why the heck would anyone go back for another kid?? (I have a younger brother) - she said repressed memories

technocrat · November 21, 2019, 7:51pm

The rationale is that users may not want to install libraries, for a variety of reasons. Including the line commented out is ok--it requires a deliberate choice or bringing a package in with require(), also. The beauty of a reprex is being able to see exactly what happens without the need to install anything.

@valeri's purrr approach is, indeed, better from a conciseness standpoint and can easily be modified to bring in paragraphs, rather than words. I tend to start out more ponderously to see each step in bite sized pieces.

system · December 12, 2019, 7:51pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.