Find and display duplicates from an excel table imported to R

PDD · November 11, 2021, 3:17pm

Hello
I have imported an Excel spreadsheet into R with 2000 rows and 22 columns. I would like to find and to display the duplicate records from the excel table.
I have found only the unique rows 1500 from the table with
unique_records <- unique_records %>% distinct(last_name, first_name, .keep_all = TRUE)

But I would like to be able to see also the rest 500 records in an different tab or table. if they are not appeared in the tab with the unique records, they are duplicates and I want to see them also.

Do you have any idea how to do this?

thank you very much

PDD

FJCC · November 11, 2021, 4:22pm

Here is one method.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF <- data.frame(FirstName = c("A", "B", "C", "A", "B", "C"),
                 LastName = c("X", "Y", "Z", "D", "Y", "Z"),
                 Value1 = rnorm(6), Value2 = rnorm(6))

distinct(DF, FirstName, LastName, .keep_all = TRUE)  
#>   FirstName LastName      Value1    Value2
#> 1         A        X -0.19443993 -1.495899
#> 2         B        Y  0.76503945 -0.422182
#> 3         C        Z -0.07138707 -2.119441
#> 4         A        D  0.13985389  1.048779

NameCount <- DF |> group_by(FirstName, LastName) |>  summarize(COUNT = n())
#> `summarise()` has grouped output by 'FirstName'. You can override using the `.groups` argument.
NameCount
#> # A tibble: 4 x 3
#> # Groups:   FirstName [3]
#>   FirstName LastName COUNT
#>   <chr>     <chr>    <int>
#> 1 A         D            1
#> 2 A         X            1
#> 3 B         Y            2
#> 4 C         Z            2

DUPS <- inner_join(DF, NameCount, by = c("FirstName", "LastName")) |> 
  filter(COUNT > 1)
DUPS
#>   FirstName LastName      Value1     Value2 COUNT
#> 1         B        Y  0.76503945 -0.4221820     2
#> 2         C        Z -0.07138707 -2.1194407     2
#> 3         B        Y  0.40575094  0.5006497     2
#> 4         C        Z  0.53240029 -0.5305830     2

^{Created on 2021-11-11 by the reprex package (v2.0.1)}

PDD · November 12, 2021, 8:07am

Thank you for this answer.
But I think this is unfortunately not the solution. The first thing is that I have 22 columns with 2000 records in them (like an Excel table) according to this proposed solution this would mean that I have to address column by column to get them as vectors. Also, I don't understand why I have to use the norm() function. I am only dealing with text records. The solution can't be so complicated. I've already got the unique records 1500 right, how do I find and display the rest?
How can I search in my table ( 22 columns and 2000 rows) for duplicates?
I just found out that with the following command I can get the remaining 500 records which are my duplicates.

duplicate_indexes <- which(duplicated(my_data_file[c('name', 'first_name')]),)

The problem is I only see the indexes here and I want to see all the columns of data. I did random checks with the indexes and here are indeed the duplicates from my data table

Thank you, does anyone have another solution in mind?

mikecrobp · November 12, 2021, 10:28am

I have never used these functions before but it looks like duplicated works the same as distinct: ie the first duplicated row is taken as the distinct row and all subsequent rows with same key fields are the duplicated.

So my_data_file[duplicate_indexes ,] gives the whole duplicated rowset

have not tested this

PDD · November 12, 2021, 11:37am

Hi
Many thanks, it works very good, this is exactly what I want

system · November 19, 2021, 11:38am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.