I'm sure this is a ridiculously easy question but I've been stuck on it for a week and keep getting lost in the code. Basically, I'm creating a generic simulated data set (for the purposes of testing clinical trial protocols) and I have 95% of the data set done. The last component I need is a "Comments" field, using random words (so we can test out sentiment analysis etc.).
I am able to create a string of random words using the janeaustenr package
OK, Maybe I don't get it - but why cant you just do
sdata <- sample(words, size = 10000)
and then assign sdata to a column in your as you say "existing data frame" ? Or do you need more than one random word per row? If so, should it be equal number of words per row or some random number of words in some range?
@valeri - Thanks for your time - I do want more than 1 random word - the sdata that my code creates correctly creates a random string between 1 and 15 words long (I want to mimic a comments field in a database). So for example, I'd like
[1] long chapter austen the sensibility by
[2] family sensibility long of jane 1811
.
.
.
[1000] of dashwood austen jane sensibility and family sense had sense and 1
.
.
being merged with my dataset. Hoping this makes sense - running on very little sleep (2 1/2 month old son was up at 3am...).
Chris
Putting the example in reproducible example, called a reprex form cuts down on cut-and-paste errors and makes it possible to follow the code without having to run it.
My suggestion is that the approach being taken is too granular and doesn't need to be random, merely representative of values on which sentiment analysis of your real data can be run.
Consider this example, starting from help(unnest_tokens)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(janeaustenr)
library(tidytext)
d <- tibble(txt = prideprejudice)
placeholder <- d %>% unnest_tokens(sentence, txt, token = "paragraphs")
placeholder <- placeholder[1:2000,]
placeholder <- rbind(placeholder,placeholder,placeholder,placeholder, placeholder)
placeholder
#> # A tibble: 10,000 x 1
#> sentence
#> <chr>
#> 1 pride and prejudice
#> 2 by jane austen
#> 3 chapter 1
#> 4 " it is a truth universally acknowledged, that a single man in possession of…
#> 5 however little known the feelings or views of such a man may be on his first…
#> 6 "\"my dear mr. bennet,\" said his lady to him one day, \"have you heard that…
#> 7 mr. bennet replied that he had not.
#> 8 "\"but it is,\" returned she; \"for mrs. long has just been here, and she to…
#> 9 mr. bennet made no answer.
#> 10 "\"do you not want to know who has taken it?\" cried his wife impatiently."
#> # … with 9,990 more rows
Thanks for the tip re: install.packages; in other forums (non-R), it's recommended to include so everyone's on the same page. Shall make sure that doesn't happen again
As a (relatively) new R user, I'll have to take a closer look at your code - I think I see what you're doing but need a large cup of tea and some heavy metal haha. For what it's worth, @valeri 's code works for me and as I'm somewhat familiar with purrr, I can sort of see his logic.
thank you both. this has saved me a lot of time and headache. I hope to repay your kindness!
Chris
The rationale is that users may not want to install libraries, for a variety of reasons. Including the line commented out is ok--it requires a deliberate choice or bringing a package in with require(), also. The beauty of a reprex is being able to see exactly what happens without the need to install anything.
@valeri's purrr approach is, indeed, better from a conciseness standpoint and can easily be modified to bring in paragraphs, rather than words. I tend to start out more ponderously to see each step in bite sized pieces.