Calculate Median Age - Considering Distinct Count of Variable in another Variable

I have a dataset and am considering three of 20 variables in order to calculate median values. The three variables are id, Top 50, and Age. An id can have only one Age but it can have more than one observation from Top 50 attached to it because there are many Top 50 'types'. I need to find the median Age associated with each Top 50 observation 'type' which means I need the median Age of the distinct count of id associated with a given Top 50 observation 'type'.

I tried to create a replicable df that you all could use as an example but I CANT SEEM TO MAKE IT WORK IN RSTUDIO (super rookie over here, obv) but I do have that "df creation attempt" below so that you have an idea of the type of variables and observations that I am working with:

df<-data.frame(id = c("AXT123,"AXR456","AXV678","AXC789","AXB123","AXT124","AXV345","AXC890","AXC123","AXC345","AXV643","AXQ876","AXW345","AXE987","AXY678","AXT098","AXE345","AXT945","AXN267","AXP345"),(Top50 = c("Congenital heart conditions","Abdominal Pain","Chest Pain", "Cough","Viral infection","Cough","Cough","Viral infection","Dehydration","Abdominal Pain","Congenital heart conditions","Diabetes","Diabetes","Infection","Infection","Infection","Infection","Infection","Cancer","Congenital heart conditions"),Age = c("5","54","23","44","12","15","83","18","22","61","2","55","8","37","71","11","1","1","28","3"))

Let me know what else you need - I think I am just not considering the correct layering of commands for group_by and median but just really need some guidance.

Not bad. To do it more easily you can use a reprex. See the FAQ. Cut and paste the output from your data.frame, say d

# install .packages("reprex") if you don't already have it
# create or read in the data frame, then

and then paste and result in a post. That's it.

I'll come back to address the main question.

structure(list(id = c("AXT123", "AXR456", "AXV678", "AXC789", 
"AXB123", "AXT124", "AXV345", "AXC890", "AXC123", "AXC345", "AXV643", 
"AXQ876", "AXW345", "AXE987", "AXY678", "AXT098", "AXE345", "AXT945", 
"AXN267", "AXP345"), Top50 = c("Congenital_heart_conditions", 
"Abdominal_Pain", "Chest_Pain", "Cough", "Viral_infection", "Cough", 
"Cough", "Viral_infection", "Dehydration", "Abdominal_Pain", 
"Congenital_heart_conditions", "Diabetes", "Diabetes_2", "Infection_1", 
"Infection_2", "Infection_3", "Infection_4", "Infection_5", "Cancer", 
"ngenital_heart_conditions"), Age = c(5, 54, 23, 44, 12, 15, 
83, 18, 22, 61, 2, 55, 8, 37, 71, 11, 1, 1, 28, 3)), class = "data.frame", row.names = c(NA, 
d <- data.frame(
  id = c(
    "AXT123", "AXR456", "AXV678",
    "AXC789", "AXB123", "AXT124", "AXV345", "AXC890",
    "AXC123", "AXC345", "AXV643", "AXQ876", "AXW345",
    "AXE987", "AXY678", "AXT098", "AXE345", "AXT945",
    "AXN267", "AXP345"
  Top50 = c(
    "Abdominal_Pain", "Chest_Pain", "Cough",
    "Viral_infection", "Cough", "Cough", "Viral_infection",
    "Dehydration", "Abdominal_Pain",
    "Congenital_heart_conditions", "Diabetes",
    "Diabetes", "Infection", "Infection",
    "Infection", "Infection", "Infection",
    "Cancer", "Congenital_heart_conditions"
  Age = c(
    5, 54, 23, 44, 12, 15, 83,
    18, 22, 61, 2, 55, 8, 37, 71, 11,
    1, 1, 28, 3

diag <- unique(d$Top50)

get_md <- function(x) median(d[which(d[2] == x),3])

#> Congenital_heart_conditions              Abdominal_Pain 
#>                         3.0                        57.5 
#>                  Chest_Pain                       Cough 
#>                        23.0                        44.0 
#>             Viral_infection                 Dehydration 
#>                        15.0                        22.0 
#>                    Diabetes                   Infection 
#>                        31.5                        11.0 
#>                      Cancer 
#>                        28.0
#> Error in eval(expr, envir, enclos): object 'results' not found

Created on 2023-02-20 with reprex v2.0.2

Oh this is a helpful start, thank you! Any thoughts on calculating the median?

That's what the last expression does. For example, the ages for Infections are 1,1, 11, 37 and 71

> median(c(1,1,11,37,71))
[1] 11
1 Like

Jeez - obviously, thank you. The values of 2 and 3 in get_md <- function(x) median(d[which(d[2] == x),3])are those the 'column number counting from the left'? If not, what do those values represent?

Right, the subset operator goes row,column, the first selects condition rows based on matching the diag index and the second is the age column

1 Like

Alright - so I applied it to my dataset and it comes back with this error:
I did try converting 'Age' from 'integer' to 'numeric' and that resulted in the same error message. For reference, in my dataset, Top 50 is column 4 and Age is column 8. Any thoughts? I really appreciate your guidance.

Use my reprex. I converted age from character to numeric. Then adapt to your data.

1 Like

Ok I tried that and I moved data into columns 1-3 so that I could just 'copy/paste' but this is what I get back:

you most likely did not succeed in transitioning from character age to numeric age.
Perhaps show the code you used, so we can inspect it ?


seems reasonable; verify ?



so far so good;
I'll suggest that you try this variation of Technocrats approach; the difference being using the column names of interest; rather than column positions; just in case the columns are moved around, it may be safer.

get_md <- function(x) median(A[which(A["Top50"] == x),"Age"])
1 Like

Ugh, it's still saying it needs numeric data :frowning:

At this point you should make a reprex.
use dput() on A and share that here, formatted as code by writing triple backticks into the post

my text formatted as code 

I re ran with Debug and received this information - does this help?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.