Help with removing duplicates rows but keep certain rows in a data.frame

re123 · April 23, 2022, 5:56pm

Hi.
I need help removing duplicate row except certain row. My large data is similiar to the following:

Animal<-c("Unknow","Dog","cat","Lion","unknow","horse","dog","unknow","cat")
A_date<-c("12-08-2020","20-06-2018","01-01-2015","10-07-2021","15-08-2019","05-08-2013","15-11-2016","22-03-2022","15-05-2019")
Mydata<-data.frame(Animal, A_date)

 Mydata
  Animal     A_date
1 Unknow 12-08-2020
2    Dog 20-06-2018
3    cat 01-01-2015
4   Lion 10-07-2021
5 unknow 15-08-2019
6  horse 05-08-2013
7    dog 15-11-2016
8 unknow 22-03-2022
9    cat 15-05-2019

If the column Animal has duplicate row then it should be remove based on the date. For example in my data dog appear twice and I want to keep the row with the oldest date.
While removing the duplicate row I want to keep all the row with name Unknow even if it appears multiple time.
This is what I have tried so fare

library(data.table)
#lldata$D_soum<-as.Date(alldata$D_soum)
test1<-setDT(Mydata)[order(Animal, -as.IDate(A_date, "%Y-%m-%d"))][!duplicated(Animal)]

And the result is not what I'm excepting.
How can I do this?
Thanks in advance

FJCC · April 23, 2022, 6:29pm

Here is one approach.

Animal<-c("Unknow","Dog","cat","Lion","unknow","horse","dog","unknow","cat")
A_date<-c("12-08-2020","20-06-2018","01-01-2015","10-07-2021","15-08-2019","05-08-2013","15-11-2016","22-03-2022","15-05-2019")
Mydata<-data.frame(Animal, A_date)
library(dplyr)

Mydata <- Mydata %>% mutate(Animal = toupper(Animal), 
                            A_date = lubridate::dmy(A_date))

Mydata %>% arrange(A_date) %>%
  group_by(Animal) %>% 
  mutate(ROW = row_number()) %>% 
  filter(ROW == 1 | Animal == "UNKNOW") %>% 
  select(-ROW)
#> # A tibble: 7 × 2
#> # Groups:   Animal [5]
#>   Animal A_date    
#>   <chr>  <date>    
#> 1 HORSE  2013-08-05
#> 2 CAT    2015-01-01
#> 3 DOG    2016-11-15
#> 4 UNKNOW 2019-08-15
#> 5 UNKNOW 2020-08-12
#> 6 LION   2021-07-10
#> 7 UNKNOW 2022-03-22

^{Created on 2022-04-23 by the reprex package (v0.2.1)}

jrmuirhead · April 23, 2022, 7:00pm

A variation built off of what @FJCC had posted which uses the base::duplicated function.

Animal<-c("Unknow","Dog","cat","Lion","unknow","horse","dog","unknow","cat")
A_date<-c("12-08-2020","20-06-2018","01-01-2015","10-07-2021","15-08-2019",
  "05-08-2013","15-11-2016","22-03-2022","15-05-2019")

Mydata<-data.frame(Animal, A_date)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Mydata <- Mydata %>%
  mutate(Animal = toupper(Animal), A_date = lubridate::dmy(A_date)) %>%
  arrange(A_date)

Mydata %>%
  filter(Animal == "UNKNOW" | !duplicated(Animal, fromLast = TRUE))
#>   Animal     A_date
#> 1  HORSE 2013-08-05
#> 2    DOG 2018-06-20
#> 3    CAT 2019-05-15
#> 4 UNKNOW 2019-08-15
#> 5 UNKNOW 2020-08-12
#> 6   LION 2021-07-10
#> 7 UNKNOW 2022-03-22

^{Created on 2022-04-23 by the reprex package (v2.0.1)}

re123 · April 23, 2022, 8:43pm

Thanks. This is the result that I get:

Animal<-c("Unknow"," ","Dog","cat","Lion","unknow","horse","dog","unknow","cat", "")
A_date<-c("12-08-2020","20-06-2018","01-01-2015","10-07-2021","20-06-2018","15-08-2019","05-08-2013","20-06-2010","15-11-2016","22-03-2022","15-05-2019")
Mydata<-data.frame(Animal, A_date)


library(lubridate)
library(dplyr)
Mydata <- Mydata %>%
  mutate(Animal = toupper(Animal), A_date = lubridate::dmy(A_date)) %>%
  arrange(A_date)

Mydata %>%
  filter(Animal == "UNKNOW" | !duplicated(Animal, fromLast = TRUE)) %>% 
  select(Animal, A_date)

 Animal A_date
1  UNKNOW   <NA>
2           <NA>
3   CAT F   <NA>
4    LION   <NA>
5  UNKNOW   <NA>
6   HORSE   <NA>
7     DOG   <NA>
8  UNKNOW   <NA>
9     CAT   <NA>
10          <NA>

The date are not showing.

Also is there another way to do this without changing animal to uppercase?

jrmuirhead · April 23, 2022, 9:27pm

Is "Unknow" different than "unknow"? R is case-sensitive and so these would be considered as 2 separate animal categories.

Changing all to uppercase was standardizing the animal names in case that "dog" is to be treated the same as "Dog".

I've also explicitly coded the Animals with names that were zero-length strings or just composed of spaces as NA.

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library("tidyr")
library("lubridate")
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

Animal<-c("Unknow"," ","Dog","cat","Lion","unknow","horse","dog","unknow","cat", "")
A_date<-c("12-08-2020","20-06-2018","01-01-2015","10-07-2021","20-06-2018","15-08-2019","05-08-2013","20-06-2010","15-11-2016","22-03-2022","15-05-2019")
Mydata<-data.frame(Animal, A_date)


# Issue # 1. Sanitize empty strings and spaces of different lengths and
# code them specifically as NA.

Mydata <- Mydata %>%
  mutate(Animal = gsub("^\\s*$", NA_character_, Animal))

# Convert character to datetime to sort

Mydata <- Mydata %>%
  mutate(A_date = dmy(A_date)) %>%
  arrange(A_date)

Mydata %>%
  filter(Animal %in% c("Unknow", "unknow") | is.na(Animal) | 
      !duplicated(Animal, fromLast = TRUE))
#>    Animal     A_date
#> 1     dog 2010-06-20
#> 2   horse 2013-08-05
#> 3     Dog 2015-01-01
#> 4  unknow 2016-11-15
#> 5    <NA> 2018-06-20
#> 6    Lion 2018-06-20
#> 7    <NA> 2019-05-15
#> 8  unknow 2019-08-15
#> 9  Unknow 2020-08-12
#> 10    cat 2022-03-22

^{Created on 2022-04-23 by the reprex package (v2.0.1)}

You will have to run the code starting at the top with the raw data, as Mydata gets updated along the way.

system · April 30, 2022, 9:28pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.