Hello
I have imported an Excel spreadsheet into R with 2000 rows and 22 columns. I would like to find and to display the duplicate records from the excel table.
I have found only the unique rows 1500 from the table with
unique_records <- unique_records %>% distinct(last_name, first_name, .keep_all = TRUE)
But I would like to be able to see also the rest 500 records in an different tab or table. if they are not appeared in the tab with the unique records, they are duplicates and I want to see them also.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
DF <- data.frame(FirstName = c("A", "B", "C", "A", "B", "C"),
LastName = c("X", "Y", "Z", "D", "Y", "Z"),
Value1 = rnorm(6), Value2 = rnorm(6))
distinct(DF, FirstName, LastName, .keep_all = TRUE)
#> FirstName LastName Value1 Value2
#> 1 A X -0.19443993 -1.495899
#> 2 B Y 0.76503945 -0.422182
#> 3 C Z -0.07138707 -2.119441
#> 4 A D 0.13985389 1.048779
NameCount <- DF |> group_by(FirstName, LastName) |> summarize(COUNT = n())
#> `summarise()` has grouped output by 'FirstName'. You can override using the `.groups` argument.
NameCount
#> # A tibble: 4 x 3
#> # Groups: FirstName [3]
#> FirstName LastName COUNT
#> <chr> <chr> <int>
#> 1 A D 1
#> 2 A X 1
#> 3 B Y 2
#> 4 C Z 2
DUPS <- inner_join(DF, NameCount, by = c("FirstName", "LastName")) |>
filter(COUNT > 1)
DUPS
#> FirstName LastName Value1 Value2 COUNT
#> 1 B Y 0.76503945 -0.4221820 2
#> 2 C Z -0.07138707 -2.1194407 2
#> 3 B Y 0.40575094 0.5006497 2
#> 4 C Z 0.53240029 -0.5305830 2
Thank you for this answer.
But I think this is unfortunately not the solution. The first thing is that I have 22 columns with 2000 records in them (like an Excel table) according to this proposed solution this would mean that I have to address column by column to get them as vectors. Also, I don't understand why I have to use the norm() function. I am only dealing with text records. The solution can't be so complicated. I've already got the unique records 1500 right, how do I find and display the rest?
How can I search in my table ( 22 columns and 2000 rows) for duplicates?
I just found out that with the following command I can get the remaining 500 records which are my duplicates.
The problem is I only see the indexes here and I want to see all the columns of data. I did random checks with the indexes and here are indeed the duplicates from my data table
Thank you, does anyone have another solution in mind?
I have never used these functions before but it looks like duplicated works the same as distinct: ie the first duplicated row is taken as the distinct row and all subsequent rows with same key fields are the duplicated.
So my_data_file[duplicate_indexes ,] gives the whole duplicated rowset