Subsetting dataframe by changes in values by a threshold from one column to another

amjass12 · January 16, 2020, 11:28pm

Hi all, I have a question regarding subsetting a data frame based on a threshold value between different sets of columns and I am finding this surprisingly difficult to achieve. I would really appreciate some help!

Ultimately, the question is as follows: given a dataframe of (for example two columns), can i subset the data frame if the value in a row goes from a certain threshold in one column to another.

the result of the column above if i specified a threshold change of 1 would be:

rows 1,2 and 4:

1 2
2 3
5 4

Note that I specify the threshold to be a change between values of columns of 1. That means it could go up by 1 but also down by 1 (as above). I need the threshold to be specified like this if possible as i need to subset the changes in either direction... I am really struggling to figure out the code for this. I have an r object of genomic data. its a matrix of around 30 columns and many thousand rows. the columns are grouped in to conditions.. in this example two conditions, so 15 columns could be condition 1, and 15 columns could be condition 2.

I would like to know how count values change from condition 1 to 2 based on a threshold as above. The code i have so far is:

#Create rownames of individual groups samples = rownames(rObject@colData)[mSetSqFlt@colData[, "condition1"]== 1] samplesOne=rownames(rObject@colData)[mSetSqFlt@colData[, "condition2"]== 0]

I can then specify from the counts data which group of columns belongs to condition1 and which to condition2. This then splits my counts data as follows:

counts[samples,]
counts[samplesOne,]

Now what i would like to do is say either in all samples per group, identify rows that change from samples to sampleOne by a a threshold of counts (0.3 for example as they are scaled): so counts[samples,] > counts [samples2,] = 0.3 and then the output is a data frame where all counts from samples increase or decrease in value to sampleOne by at least 0.3.

Alternatively i would also like to explore the same thing but instead of all samples in each group, it could be, at least 4 samples from group 1 and 4 samples from group 2 have to have a threshold change.

i hope this makes sense and any help is much appreciated.

thanks!

andresrcs · January 17, 2020, 12:47am

Your description is not completely clear to me, is this what you mean?

library(dplyr)

sample_df <- data.frame(
           x = c(1, 2, 4, 5, 6),
           y = c(2, 3, 4, 4, 6)
)

sample_df %>% 
    filter(abs(y - x) >= 1)
#>   x y
#> 1 1 2
#> 2 2 3
#> 3 5 4

^{Created on 2020-01-17 by the reprex package (v0.3.0.9000)}

if not, please provide a proper REPRoducible EXample (reprex) illustrating your issue.

amjass12 · January 17, 2020, 9:00am

Thank you! yes this does answer the question if there are only two columns!

However, I have an unequal number of columns in each condition and when specifying these columns i get an error that the data frames are not of equal size. Here is my reproducible example:

#>   condition 1 condition 2 condition1 condition 1 condition 2
#> 1          5.1         3.5          6         5  3.1
#> 2          4.9         3.0          5         5.2  3
#> 3          4.7         3.2          4.2        4.8  2.7
#> 4          4.6         3.1          4.9         4.9  3.5
#> 5          5.0         3.6          6         5  3.9
#> 6          5.4         3.9          4.2         5.2  4

Condition 1 are all linked and conditions 2 are all linked. I could get the average by rowMeans(df) of condition 1 and the same for condition 2 and then carry out the code above.

I would also like to be able to to avoid the averaging of columns approach as well if possible by specifying that if there is for example, a change in at least 2 columns of condition 1 to at least 2 columns of condition 2 of a value of 'x', then subset the dataframe to the row meeting this criteria.

I have subset the data frame above in the example as follows:

entire dataframe= df

condition1<- "col1", "col3" condition2<- "col2", "col4", "col5" ...etc
please note that the total number of columns for condition 1 is not equal to the total number of columns for conditions. condition 1 may have 20 columns and condition 2 15. This is just the nature of the experiment.

df[,condition1]
df[,condition2]

each correctly subset the appropriate columns....

It is from here i would like to specifcy, for example, if the value of condition 1 changes in at least for example, 4 samples in condition 1 relative to 4 columns of condition 2 (or any other number of columns) with a change value of 'x', subset the data frame.

Please note that the value of x should be a change in any direction as in the original post, so if it is a value of 1, it could change by a value of 1 up from condition one to two (for example 1-2) or down by a value of 1 (for example 2-1).. again the solution you provided is perfect, but for is only working for individual or equal sized columns!

hope this makes sense?

thank you in advance!!

andresrcs · January 17, 2020, 11:48am

Sorry but that is not a reproducible example nor copy/paste friendly, please read the guide I gave you and try to make a proper reproducible example. That would make things way easier for people willing to help and increase your chances of getting help.

amjass12 · January 17, 2020, 1:06pm

apologies: code to recreate data frame:

df <- data.frame(
con1_a = c(1, 2, 4, 5, 6),
con2_a = c(2, 3, 4, 4, 6),
con1_b = c(1.1, 2.2, 4.2, 5.2, 6.2),
con2_b = c(2.2, 3.2, 4.2, 4.2, 6.2),
con1_c = c(1.3, 2.5, 4.1, 5.7, 6.9)

)

is this ok?

andresrcs · January 17, 2020, 3:16pm

Well, that is better as sample data but since you are not including any code or sample output it doesn't clarify things.
Is this closer to what you want?

library(tidyverse)

df <- data.frame(
    con1_a = c(1, 2, 4, 5, 6),
    con2_a = c(2, 3, 4, 4, 6),
    con1_b = c(1.1, 2.2, 4.2, 5.2, 6.2),
    con2_b = c(2.2, 3.2, 4.2, 4.2, 6.2),
    con1_c = c(1.3, 2.5, 4.1, 5.7, 6.9)
    
)

rows <- df %>%
    rownames_to_column("row_number") %>% 
    pivot_longer(
        cols = -row_number,
        names_to = c("condition", "set"),
        names_sep = "_",
        values_to = "value"
        ) %>% 
    group_by(row_number, set) %>% 
    summarise(dif = max(value) - min(value)) %>% 
    filter(sum(dif >= 1) > 1) %>% # More than 1 set of conditions have a difference greater or equal to 1
    distinct(row_number) %>% 
    pull()
    
df[rows,]
#>   con1_a con2_a con1_b con2_b con1_c
#> 1      1      2    1.1    2.2    1.3
#> 2      2      3    2.2    3.2    2.5
#> 4      5      4    5.2    4.2    5.7

amjass12 · January 17, 2020, 4:47pm

thank you! actually what i meant was, instead of # More than 1 set of conditions have a difference greater or equal to 1

more than or equal to x number of samples within a condition have to have a difference greater than 1.

so an expected output might be: at least 2 samples in each condition have to have a difference greater than 1.

So from the table above:

there are 3 con_1 and 2_con2. but in the last column con1_c row 1 has a value of 1.3. so the change in this value compared to con2_a and con2_b (2 and 2.2), the difference is not greater than 1, similarly for con1_b. in fact, con1_a is the only one that passes this threshold and so this row would not be included. does this make sense?

so the expected output if the aim was: at least 2 samples in each condition have to have a difference greater than 1. would be:


df
con1_a con2_a con1_b con2_b con1_c
1      1      2    1.0    2.0    1.0
2      2      3    2.2    3.2    2.5
3      4      4    4.2    4.2    4.1
4      5      4    5.2    4.0    5.7
5      6      6    6.2    6.2    6.9

output
con1_a con2_a con1_b con2_b con1_c
1      1      2    1.0    2.0    1.0
4      5      4    5.2    4.0    5.7

system · February 7, 2020, 4:47pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.