Here are some pieces, because the question is missing a reprex
.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stringr)
library(readr)
# rename source file with tsv extension
# and use readr::read_tsv()
dat <- read_tsv("/Users/ro/projects/demo/grist.tsv")
#> Rows: 2 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): col1, col2, col3
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# because file doesn't exist on anyone else's
# system
dat <- data.frame(
stringsAsFactors = FALSE,
col1 = c("document_1", "document_1","document_2"),
col2 = c("London Book Fair", "London","London"),
col3 = c("EVENT", "LOCATION","foo")
)
dat
#> col1 col2 col3
#> 1 document_1 London Book Fair EVENT
#> 2 document_1 London LOCATION
#> 3 document_2 London foo
# count occurrences of docs; those with more
# than one are the rows to check for overlapping
# col2 values
docs <- dat %>% group_by(col1) %>% count()
to_check <-
to_check <- docs[which(docs$n > 1),]$col1
to_check
#> [1] "document_1"
# single document case; will need to be
# adapted for use to apply to all cases
# where more than a single document has
# multiple occurences, which is the
# example provided in the question
to_check <- dat %>% filter(col1 == to_check)
to_check
#> col1 col2 col3
#> 1 document_1 London Book Fair EVENT
#> 2 document_1 London LOCATION
# consider tokenizing col2 to avoid
# problems with inconsistent capitalization
# similarly limited to data presented
str_match_all(c(to_check$col2)[1],c(to_check$col2)[2]) %>% unlist
#> [1] TRUE
# match found, but no rule provided in
# question to decide which to keep
The key to solving R
challenges like this is to shift focus from how to do something to what to do to get closer from what is at hand to what is desired.