Filter data based on lowest value


Hi!

I have a very large dataset with patients that performed different number of examinations during a specific timeframe, so that each patient (Patient_ID) can have several "Study_ID", but I want to include only the one with the accompanying lowest "exam_count", or earliest date for this examination. Any good ideas on how I can achieve that? A fake version of some of the variables would look like this

Patient_ID study_ID Exam_date exam_count age other

1 1 100 2016-06-20 00:00:00 1 77 10
2 1 200 2016-06-21 00:00:00 2 77 22
3 1 300 2016-06-22 00:00:00 3 77 34
4 1 400 2016-06-23 00:00:00 4 77 46
5 2 500 2016-06-24 00:00:00 2 55 58
6 2 600 2016-06-25 00:00:00 3 55 70

So, I want to include only one study for each Patient_ID, and that should be the one with either the earliest "Exam_date" or the lowest "exam_count" (that would be same thing). Btw the lowest exam_count is not necessarily "1" due to prior exclusions made. I upload the fake data if needed.

Thank you in advance
Thomas

First of all, don't post screenshots of your data. Provide reproducible examples. It makes it easier for people to answer your question:

Here is an example of what you are after using the iris dataset. You can adapt it to your problem.

library(dplyr)
iris %>% 
  group_by(Species) %>% 
  arrange(Petal.Length) %>% 
  slice(1)

For you, group by the PatientID, then arrange by the date or exam count.

1 Like

Hi,

Welcome to the RStudio community!

I agree with @williaml on the reprex, but since it's your first post and I was working on it while he posted as well, I shall share my reply too with your data in the form of a reprex

library(dplyr)

myData = data.frame(
  stringsAsFactors = FALSE,
               pID = c(1L, 1L, 1L, 1L, 2L, 2L),
               sID = c(100L, 200L, 300L, 400L, 500L, 600L),
                                   date = c("2016-06-20 00:00:00",
                                            "2016-06-21 00:00:00",
                                            "2016-06-22 00:00:00","2016-06-23 00:00:00",
                                            "2016-06-24 00:00:00",
                                            "2016-06-25 00:00:00"),
             count = c(1L, 2L, 3L, 4L, 2L, 3L),
               age = c(77L, 77L, 77L, 77L, 55L, 55L),
             other = c(10L, 22L, 34L, 46L, 58L, 70L)
                     )
#Make date a date format
myData = myData %>% 
  mutate(date = as.POSIXct(date))

#Filter
myData = myData %>% 
  group_by(pID) %>% 
  filter(date == min(date) | count == min(count))

myData
#> # A tibble: 2 x 6
#> # Groups:   pID [2]
#>     pID   sID date                count   age other
#>   <int> <int> <dttm>              <int> <int> <int>
#> 1     1   100 2016-06-20 00:00:00     1    77    10
#> 2     2   500 2016-06-24 00:00:00     2    55    58

Created on 2021-08-08 by the reprex package (v2.0.0)

You should read more about all of the Tidyverse functions, in this case especially dplyr.

Good luck
PJ

1 Like

Thank you both! Sorry about the Reprex thing.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.