Create a column with value from cell in a different row, based on another column

rorytodd98 · March 1, 2023, 4:28pm

Hello! My data looks like below (it is partially cleaned output from using udpipe). There is a token for each row, drawn from the sentence held in the column 'sentence', with its position in the sentence held in the column 'token_id'. The position of the sentence in each document is held in 'sentence_id', and the number of the document in the overall corpus is held in 'doc_id'. The 'head_token_id' hold the 'token_id' for the word which the 'token' depends on. For example, in 'The red dog barked', the row with 'red' as the token would have a token_id of 2, but a head_token_id of 2, since red is the adjective for 'dog' in that sentence. 'dep_rel' gives the exact relationship (so the row with 'red' would be 'adj' for adjective).

I would like to create a new column, called 'head_token', which contains the value in the column 'token' - but from the row described by 'head_token_id' (as well as doc_id and sentence_id). So in the 'the red dog barked' example, the row with the token 'red' should have a 'head_token' value of 'head'. As an added complication, head_token_id = 0 means that the token is the root of the sentence, so 'head_token' should be NA. Hope that makes sense and appreciate any ideas!

Example data:

data <- structure(list(doc_id = c("doc1", "doc1", "doc1", "doc1", "doc1", 
                          "doc1", "doc1", "doc1", "doc1", "doc1"), sentence_id = c(1L, 
                                                                                   1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), sentence = c("THE LEARNERS' LINK $ NOT FOR SALE www.moe.gov.bz YOUR WINDOW TO CONTINUED LEARNING Volume I/Issue 2: May 8, 2020 Learning Activity", 
                                                                                                                                     "THE LEARNERS' LINK $ NOT FOR SALE www.moe.gov.bz YOUR WINDOW TO CONTINUED LEARNING Volume I/Issue 2: May 8, 2020 Learning Activity", 
                                                                                                                                     "THE LEARNERS' LINK $ NOT FOR SALE www.moe.gov.bz YOUR WINDOW TO CONTINUED LEARNING Volume I/Issue 2: May 8, 2020 Learning Activity", 
                                                                                                                                     "THE LEARNERS' LINK $ NOT FOR SALE www.moe.gov.bz YOUR WINDOW TO CONTINUED LEARNING Volume I/Issue 2: May 8, 2020 Learning Activity", 
                                                                                                                                     "THE LEARNERS' LINK $ NOT FOR SALE www.moe.gov.bz YOUR WINDOW TO CONTINUED LEARNING Volume I/Issue 2: May 8, 2020 Learning Activity", 
                                                                                                                                     "THE LEARNERS' LINK $ NOT FOR SALE www.moe.gov.bz YOUR WINDOW TO CONTINUED LEARNING Volume I/Issue 2: May 8, 2020 Learning Activity", 
                                                                                                                                     "THE LEARNERS' LINK $ NOT FOR SALE www.moe.gov.bz YOUR WINDOW TO CONTINUED LEARNING Volume I/Issue 2: May 8, 2020 Learning Activity", 
                                                                                                                                     "THE LEARNERS' LINK $ NOT FOR SALE www.moe.gov.bz YOUR WINDOW TO CONTINUED LEARNING Volume I/Issue 2: May 8, 2020 Learning Activity", 
                                                                                                                                     "THE LEARNERS' LINK $ NOT FOR SALE www.moe.gov.bz YOUR WINDOW TO CONTINUED LEARNING Volume I/Issue 2: May 8, 2020 Learning Activity", 
                                                                                                                                     "THE LEARNERS' LINK $ NOT FOR SALE www.moe.gov.bz YOUR WINDOW TO CONTINUED LEARNING Volume I/Issue 2: May 8, 2020 Learning Activity"
                                                                                   ), token = c("THE", "LEARNERS", "'", "LINK", "$", "NOT", "FOR", 
                                                                                                "SALE", "www.moe.gov.bz", "YOUR"), token_id = c("1", "2", "3", 
                                                                                                                                                "4", "5", "6", "7", "8", "9", "10"), head_token_id = c("2", "0", 
                                                                                                                                                                                                       "5", "5", "2", "5", "8", "2", "11", "11"), dep_rel = c("det", 
                                                                                                                                                                                                                                                              "root", "case", "advmod", "nmod", "advmod", "case", "nmod", "mark", 
                                                                                                                                                                                                                                                              "nsubj")), row.names = c(NA, 10L), class = "data.frame")

scottyd22 · March 1, 2023, 9:54pm

Does this get to the desired outcome?

library(tidyverse)

out = data %>%
  group_by(doc_id, sentence_id) %>%
  mutate(mylist = list(token)) %>%
  ungroup() %>%
  rowwise() %>%
  mutate(head_token = case_when(
    head_token_id == 0 ~ NA_character_,
    TRUE ~ mylist[as.numeric(head_token_id)][1])
  ) %>%
  ungroup() %>%
  select(-mylist)

out
#> # A tibble: 10 × 8
#>    doc_id sentence_id sentence             token token…¹ head_…² dep_rel head_…³
#>    <chr>        <int> <chr>                <chr> <chr>   <chr>   <chr>   <chr>  
#>  1 doc1             1 THE LEARNERS' LINK … THE   1       2       det     LEARNE…
#>  2 doc1             1 THE LEARNERS' LINK … LEAR… 2       0       root    <NA>   
#>  3 doc1             1 THE LEARNERS' LINK … '     3       5       case    $      
#>  4 doc1             1 THE LEARNERS' LINK … LINK  4       5       advmod  $      
#>  5 doc1             1 THE LEARNERS' LINK … $     5       2       nmod    LEARNE…
#>  6 doc1             1 THE LEARNERS' LINK … NOT   6       5       advmod  $      
#>  7 doc1             1 THE LEARNERS' LINK … FOR   7       8       case    SALE   
#>  8 doc1             1 THE LEARNERS' LINK … SALE  8       2       nmod    LEARNE…
#>  9 doc1             1 THE LEARNERS' LINK … www.… 9       11      mark    <NA>   
#> 10 doc1             1 THE LEARNERS' LINK … YOUR  10      11      nsubj   <NA>

Created on 2023-03-01 with reprex v2.0.2.9000

system · March 8, 2023, 9:55pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.