Hi all!
I am pulling text from a podcast transcript and although most lines are prefaced by the speaker, there are some lines that have no preface and are run-ons from the previous speaker.
For example:
JON SMITH: How are you all doing today?
The weather is pretty cold I think
JANE DOE you are right about that Jon
df <-
tibble(quotes =
c("JON SMITH: How are you all doing today?",
"The weather is pretty cold I think",
"JANE DOE you are right about that Jon"),
line = 1:length(quotes))
# Create Speaker Column
df %>%
mutate(speaker = case_when(
str_detect("^JO") ~ "Jon",
str_detect("^JA") ~ "Jane"))
The resulting table would look like this:
line | quotes | speaker
1 | JON SMITH: How are you all doing today? | Jon
2 | The weather is pretty cold I think | NA
3 | JANE DOE you are right about that Jon | Jane
I am able to create a new column speaker_na
with the following code:
speaker_na = ifelse(is.na(speaker), lag(speaker), NA))
Which results in:
line | quotes | speaker | speaker_na
1 | JON SMITH: How are you all doing today? | Jon | NA
2 | The weather is pretty cold I think | NA | Jon
3 | JANE DOE you are right about that Jon | Jane | NA
I can't seem to figure out how to a) then collapse these columns and b) what to do in cases where a speaker happens to say three or four lines of text
line | quotes | speaker | speaker_na
4 | JON SMITH: But you already knew that | Jon | NA
5 | the typical turn around waiting could be | NA | Jon
6 | anywhere from 2 to 6 hours | NA | NA
7 | and that is being generous! | NA | NA
Thank you for any help! I tried to provide enough information, but if anything else is requested I will happily supply!