Filter data based on lowest value

passahasse · August 8, 2021, 11:28am

Hi!

I have a very large dataset with patients that performed different number of examinations during a specific timeframe, so that each patient (Patient_ID) can have several "Study_ID", but I want to include only the one with the accompanying lowest "exam_count", or earliest date for this examination. Any good ideas on how I can achieve that? A fake version of some of the variables would look like this

Patient_ID study_ID Exam_date exam_count age other

1 1 100 2016-06-20 00:00:00 1 77 10
2 1 200 2016-06-21 00:00:00 2 77 22
3 1 300 2016-06-22 00:00:00 3 77 34
4 1 400 2016-06-23 00:00:00 4 77 46
5 2 500 2016-06-24 00:00:00 2 55 58
6 2 600 2016-06-25 00:00:00 3 55 70

So, I want to include only one study for each Patient_ID, and that should be the one with either the earliest "Exam_date" or the lowest "exam_count" (that would be same thing). Btw the lowest exam_count is not necessarily "1" due to prior exclusions made. I upload the fake data if needed.

Thank you in advance
Thomas

williaml · August 8, 2021, 11:47am

First of all, don't post screenshots of your data. Provide reproducible examples. It makes it easier for people to answer your question:

FAQ: What's a reproducible example (`reprex`) and how do I create one? meta

Why reprex? Getting unstuck is hard. Your first step here is usually to create a reprex, or reproducible example. The goal of a reprex is to package your code, and information about your problem so that others can run it and feel your pain. Then, hopefully, folks can more easily provide a solution. What's in a Reproducible Example? Parts of a reproducible example: background information - Describe what you are trying to do. What have you already done? complete set up - include any library() calls and data to reproduce your issue. data for a reprex: Here's a discussion on setting up data for a reprex make it run - include the minimal code required to reproduce your error on the data…

Here is an example of what you are after using the iris dataset. You can adapt it to your problem.

library(dplyr)
iris %>% 
  group_by(Species) %>% 
  arrange(Petal.Length) %>% 
  slice(1)

For you, group by the PatientID, then arrange by the date or exam count.

pieterjanvc · August 8, 2021, 11:53am

Hi,

Welcome to the RStudio community!

I agree with @williaml on the reprex, but since it's your first post and I was working on it while he posted as well, I shall share my reply too with your data in the form of a reprex

library(dplyr)

myData = data.frame(
  stringsAsFactors = FALSE,
               pID = c(1L, 1L, 1L, 1L, 2L, 2L),
               sID = c(100L, 200L, 300L, 400L, 500L, 600L),
                                   date = c("2016-06-20 00:00:00",
                                            "2016-06-21 00:00:00",
                                            "2016-06-22 00:00:00","2016-06-23 00:00:00",
                                            "2016-06-24 00:00:00",
                                            "2016-06-25 00:00:00"),
             count = c(1L, 2L, 3L, 4L, 2L, 3L),
               age = c(77L, 77L, 77L, 77L, 55L, 55L),
             other = c(10L, 22L, 34L, 46L, 58L, 70L)
                     )
#Make date a date format
myData = myData %>% 
  mutate(date = as.POSIXct(date))

#Filter
myData = myData %>% 
  group_by(pID) %>% 
  filter(date == min(date) | count == min(count))

myData
#> # A tibble: 2 x 6
#> # Groups:   pID [2]
#>     pID   sID date                count   age other
#>   <int> <int> <dttm>              <int> <int> <int>
#> 1     1   100 2016-06-20 00:00:00     1    77    10
#> 2     2   500 2016-06-24 00:00:00     2    55    58

^{Created on 2021-08-08 by the reprex package (v2.0.0)}

You should read more about all of the Tidyverse functions, in this case especially dplyr.

Good luck
PJ

passahasse · August 8, 2021, 11:55am

Thank you both! Sorry about the Reprex thing.

system · August 15, 2021, 11:56am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.