Hello,
Firstly, I wish to mention and stress that I do not have a lot of experience with R and programming in general.
The situation: I have a csv consisting of annotated data to be used for ABSA and Automatic Aspect Term Extraction. The first column ("col1") contains the name of the annotated file, the second column ("col2") contains the tagged aspect and the third column ("col3") contains the category tag given to the aspect. The columns are separated by tabs (as sometimes the aspect contains a comma).
Rather frequently, the same target words are tagged with different aspects or there is at least some overlap.
Example:
sentence = "It's time for the annual London Book Fair"
col1|col2|col3
document_1|London Book Fair|EVENT
document_1|London|LOCATION
To train the system for Automatic Aspect Term Extraction on the annotated data, there can be no overlap.
Goal: I wish to remove those rows where the value for the first column is an exact duplicate and where the value of the second column is at the same time a partial duplicate (third column can be ignored).
failing code: This is the script I wrote using the dplyr package to try and accomplish this, but unfortunately it does not seem to work. I only get a nearly empty csv file consisting of a single line, namely "","col1","col2"
.
This is my code:
#load package
library(dplyr)
#Next read in the csv file and store it as a dataframe.
#Columns are separated by tabs and sometimes contain quotes.
df_aspectcategories <- read.csv("C:/Users/…annotations_aspectcategory_tab.csv", sep ="\t", quote = "")
# First I want to filter the data.
#I want to select only those rows where the value in column 1 is an exact duplicate and where the value of column 2 is simultaneously a partial duplicate
filtered_data <-filter(df_aspectcategories, col1 == duplicated(col1) & col2 %in% duplicated (col2))
#Now I try to use the distinct() function to remove the duplicate rows.
#This function takes two arguments: the dataframe and a vector indicating which columns to consider when identifying duplicates.
#I only want them to consider the "col1" and the "col2".
dedup_aspectcategories <- distinct(filtered_data, col1, col2)
#finally, I want to export this dataframe containing the deduplicated data and store it as a new csv.
write.csv(dedup_aspectcategories, "C:/Users/…Dedup_aspectcategories.csv")
Any advice (preferably in very simple layman terms) on how to solve this problem or concrete help would be very much appreciated! Thank you in advance!