Problem: writing code to remove partial duplicate rows from csv file

technocrat · December 12, 2022, 10:17pm

Here are some pieces, because the question is missing a reprex.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
library(readr)

# rename source file with tsv extension
# and use readr::read_tsv()
dat <- read_tsv("/Users/ro/projects/demo/grist.tsv")
#> Rows: 2 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): col1, col2, col3
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# because file doesn't exist on anyone else's
# system

dat <- data.frame(
  stringsAsFactors = FALSE,
  col1 = c("document_1", "document_1","document_2"),
  col2 = c("London Book Fair", "London","London"),
  col3 = c("EVENT", "LOCATION","foo")
)

dat
#>         col1             col2     col3
#> 1 document_1 London Book Fair    EVENT
#> 2 document_1           London LOCATION
#> 3 document_2           London      foo

# count occurrences of docs; those with more
# than one are the rows to check for overlapping
# col2 values

docs <- dat %>% group_by(col1) %>% count()
to_check <- 
  
  to_check <- docs[which(docs$n > 1),]$col1
to_check
#> [1] "document_1"
# single document case; will need to be
# adapted for use to apply to all cases
# where more than a single document has
# multiple occurences, which is the
# example provided in the question

to_check <- dat %>% filter(col1 == to_check)
to_check
#>         col1             col2     col3
#> 1 document_1 London Book Fair    EVENT
#> 2 document_1           London LOCATION

# consider tokenizing col2 to avoid
# problems with inconsistent capitalization

# similarly limited to data presented

str_match_all(c(to_check$col2)[1],c(to_check$col2)[2]) %>% unlist
#>  [1] TRUE
# match found, but no rule provided in 
# question to decide which to keep

The key to solving R challenges like this is to shift focus from how to do something to what to do to get closer from what is at hand to what is desired.