How to apply conditional filtering using different thresholds in R?

mtoufiq · June 17, 2023, 3:22am

Hi,

I have a question about applying conditional filtering using different thresholds in R. I have a dataframe with each columns denoting with suffix "_Diff", and "_FC". I am interested in filtering genes/rows based on these columns only specifically using different thresholds. What I am currently performing is subsetting the dataframe, one into the columns specific with "_Diff", and another with "_FC" columns, following this filtering on each dataframe. Instead If I would like to use the primary dataframe (ie., Diff_FC) and apply filters together for columns with "Diff" and "FC", it will be very helpful.

Thank you,
Toufiq

Diff_FC <- read.table("Diff.FC.data.tsv", sep = "\t", header = TRUE, row.names = 1)
colnames(Diff_FC)

FC <- Diff_FC[, c(2,4,6,8)]
# Extract genes/rows with FC >= 3 in atleast 1 (>= 1) column
keep.FC <- rowSums(abs(FC) >= 3) >= 1
FC.filtered <- FC[keep.FC,]


Diff <- Diff_FC[, c(1,3,5,7)]
# Extract genes/rows with Diff >= 200 in atleast 1 (>= 1) column
keep.Diff <- rowSums(abs(Diff) >= 200) >= 1
Diff.filtered <- Diff[keep.Diff,]

dput(Diff_FC)

#>         Sample_1_Diff Sample_1_FC   D12_Diff    D12_FC D1_2H_Diff D1_2H_FC
#> Gene_1          19.55   2.1540732   19.44110 1.7192711   13.97766 1.689234
#> Gene_2          -0.17   0.9825462  -28.11032 0.3793305   16.65163 2.551876
#> Gene_3           0.58   1.0111175  -14.67011 0.5353165   32.60851 1.919845
#> Gene_4         -23.46   0.7133081  -10.26711 0.6520463   58.62822 1.757373
#> Gene_5          73.63   1.3072269  141.15807 1.7474932  171.63016 2.001168
#> Gene_6         -27.77   0.5633648  -91.28413 0.3487805   42.60241 2.036303
#> Gene_7         -93.29   0.6439585 -345.97381 0.3603074  176.26676 2.175896
#> Gene_8         -35.10   0.5650019  -59.03095 0.3191430   76.90281 2.012012
#> Gene_9         -49.94   0.6891572 -233.95100 0.3726863  127.00796 2.139187
#> Gene_10        -16.30   0.8868291  -94.28914 0.3391526   85.18115 1.872311
#> Gene_11        -29.89   0.6463977  -75.25740 0.3349317   59.44275 2.160085
#> Gene_12       -119.42   0.5709564 -477.95820 0.3137795  195.64042 2.089433
#> Gene_13        -31.79   0.6444072  -47.98992 0.4058441   54.24593 2.047421
#> Gene_14        -84.66   0.4557027  -38.40021 0.1885030   94.01289 1.979301
#> Gene_15        -62.65   0.3736253  -61.64238 0.2593943   78.92820 2.051535
#> Gene_16        -41.80   0.5574847  -55.03278 0.2996840   56.62705 1.949481
#>         LI_D2_Diff  LI_D2_FC
#> Gene_1       27.40 2.0686427
#> Gene_2      -15.78 0.6371580
#> Gene_3       58.30 2.4303238
#> Gene_4      147.94 4.0185676
#> Gene_5      327.78 3.6519417
#> Gene_6       32.97 1.3159559
#> Gene_7       66.17 1.1563822
#> Gene_8      149.07 2.9327110
#> Gene_9       80.12 1.2806108
#> Gene_10     193.13 2.4029493
#> Gene_11      67.92 1.7149474
#> Gene_12     -42.08 0.9356634
#> Gene_13      81.73 2.2533354
#> Gene_14     216.73 6.2706712
#> Gene_15     130.93 2.8890492
#> Gene_16      98.11 2.5821642

dput(FC.filtered)

#>         Sample_1_FC    D12_FC D1_2H_FC LI_D2_FC
#> Gene_4    0.7133081 0.6520463 1.757373 4.018568
#> Gene_5    1.3072269 1.7474932 2.001168 3.651942
#> Gene_14   0.4557027 0.1885030 1.979301 6.270671

dput(Diff.filtered)

#>         Sample_1_Diff   D12_Diff D1_2H_Diff LI_D2_Diff
#> Gene_5          73.63  141.15807  171.63016     327.78
#> Gene_7         -93.29 -345.97381  176.26676      66.17
#> Gene_9         -49.94 -233.95100  127.00796      80.12
#> Gene_12       -119.42 -477.95820  195.64042     -42.08
#> Gene_14        -84.66  -38.40021   94.01289     216.73

FJCC · June 17, 2023, 2:56pm

Is this the sort of thing you want? I set the threshold on the FC columns to 2 because I did not use all of your columns and a threshold of 3 eliminated all rows.

library(dplyr)

Diff_FC <- read.csv("~/R/Play/Dummy.csv")
Diff_FC
#>         Sample_1_Diff Sample_1_FC   D12_Diff    D12_FC D1_2H_Diff D1_2H_FC
#> Gene_1          19.55   2.1540732   19.44110 1.7192711   13.97766 1.689234
#> Gene_2          -0.17   0.9825462  -28.11032 0.3793305   16.65163 2.551876
#> Gene_3           0.58   1.0111175  -14.67011 0.5353165   32.60851 1.919845
#> Gene_4         -23.46   0.7133081  -10.26711 0.6520463   58.62822 1.757373
#> Gene_5          73.63   1.3072269  141.15807 1.7474932  171.63016 2.001168
#> Gene_6         -27.77   0.5633648  -91.28413 0.3487805   42.60241 2.036303
#> Gene_7         -93.29   0.6439585 -345.97381 0.3603074  176.26676 2.175896
#> Gene_8         -35.10   0.5650019  -59.03095 0.3191430   76.90281 2.012012
#> Gene_9         -49.94   0.6891572 -233.95100 0.3726863  127.00796 2.139187
#> Gene_10        -16.30   0.8868291  -94.28914 0.3391526   85.18115 1.872311
#> Gene_11        -29.89   0.6463977  -75.25740 0.3349317   59.44275 2.160085
#> Gene_12       -119.42   0.5709564 -477.95820 0.3137795  195.64042 2.089433
#> Gene_13        -31.79   0.6444072  -47.98992 0.4058441   54.24593 2.047421
#> Gene_14        -84.66   0.4557027  -38.40021 0.1885030   94.01289 1.979301
#> Gene_15        -62.65   0.3736253  -61.64238 0.2593943   78.92820 2.051535
#> Gene_16        -41.80   0.5574847  -55.03278 0.2996840   56.62705 1.949481
Filtered <- Diff_FC |> rowwise() |> 
  mutate(MaxFC = max(abs(c_across(ends_with("FC")))),
         MaxDiff = max(abs(c_across(ends_with("Diff"))))) |> 
  filter(MaxFC >= 2, MaxDiff >= 200)
Filtered
#> # A tibble: 3 × 8
#> # Rowwise: 
#>   Sample_1_Diff Sample_1_FC D12_Diff D12_FC D1_2H_Diff D1_2H_FC MaxFC MaxDiff
#>           <dbl>       <dbl>    <dbl>  <dbl>      <dbl>    <dbl> <dbl>   <dbl>
#> 1         -93.3       0.644    -346.  0.360       176.     2.18  2.18    346.
#> 2         -49.9       0.689    -234.  0.373       127.     2.14  2.14    234.
#> 3        -119.        0.571    -478.  0.314       196.     2.09  2.09    478.

^{Created on 2023-06-17 with reprex v2.0.2}

AlexisW · June 17, 2023, 2:59pm

In some way or another, you'll always be subsetting the data frame columns (even though it might not be explicit). So, unless you want to switch to the tidyverse (which will make the subsetting implicit), I think your code is essentially the right one.

You can assemble the filtering steps if you prefer.

Diff_FC <- read.table(text = "  Sample_1_Diff   Sample_1_FC D12_Diff    D12_FC  D1_2H_Diff  D1_2H_FC    LI_D2_Diff  LI_D2_FC
Gene_1  19.55   2.1540732   19.4411 1.7192711   13.97766    1.689234    27.4    2.0686427
Gene_2  -0.17   0.9825462   -28.11032   0.3793305   16.65163    2.551876    -15.78  0.637158
Gene_3  0.58    1.0111175   -14.67011   0.5353165   32.60851    1.919845    58.3    2.4303238
Gene_4  -23.46  0.7133081   -10.26711   0.6520463   58.62822    1.757373    147.94  4.0185676
Gene_5  73.63   1.3072269   141.15807   1.7474932   171.63016   2.001168    327.78  3.6519417
Gene_6  -27.77  0.5633648   -91.28413   0.3487805   42.60241    2.036303    32.97   1.3159559
Gene_7  -93.29  0.6439585   -345.97381  0.3603074   176.26676   2.175896    66.17   1.1563822
Gene_8  -35.1   0.5650019   -59.03095   0.319143    76.90281    2.012012    149.07  2.932711
Gene_9  -49.94  0.6891572   -233.951    0.3726863   127.00796   2.139187    80.12   1.2806108
Gene_10 -16.3   0.8868291   -94.28914   0.3391526   85.18115    1.872311    193.13  2.4029493
Gene_11 -29.89  0.6463977   -75.2574    0.3349317   59.44275    2.160085    67.92   1.7149474
Gene_12 -119.42 0.5709564   -477.9582   0.3137795   195.64042   2.089433    -42.08  0.9356634
Gene_13 -31.79  0.6444072   -47.98992   0.4058441   54.24593    2.047421    81.73   2.2533354
Gene_14 -84.66  0.4557027   -38.40021   0.188503    94.01289    1.979301    216.73  6.2706712
Gene_15 -62.65  0.3736253   -61.64238   0.2593943   78.9282 2.051535    130.93  2.8890492
Gene_16 -41.8   0.5574847   -55.03278   0.299684    56.62705    1.949481    98.11   2.5821642",
header = TRUE)


# find the columns in each category
cols <- colnames(Diff_FC)
cols_FC <- endsWith(cols, "_FC")
cols_Diff <- endsWith(cols, "_Diff")

# create filters
keep.FC <- rowSums(abs(Diff_FC[,cols_FC]) >= 3) >= 1
keep.Diff <- rowSums(abs(Diff_FC[,cols_Diff]) >= 200) >= 1

# assemble filters
keep.all <- keep.FC & keep.Diff

# Final filtered data frame
Diff_FC[keep.all, ]
#>         Sample_1_Diff Sample_1_FC  D12_Diff   D12_FC D1_2H_Diff D1_2H_FC
#> Gene_5          73.63   1.3072269 141.15807 1.747493  171.63016 2.001168
#> Gene_14        -84.66   0.4557027 -38.40021 0.188503   94.01289 1.979301
#>         LI_D2_Diff LI_D2_FC
#> Gene_5      327.78 3.651942
#> Gene_14     216.73 6.270671

^{Created on 2023-06-17 with reprex v2.0.2}

mtoufiq · June 17, 2023, 4:47pm

@AlexisW thank you very much. This seems to be helpful.

Another question, In my dataset, additionally, there are some values “NaN” and “Inf”, suppose, lets's assume, I input these values in the same dataframe and modify your code (see below). Is this a right way to handle these type of values?

Diff_FC <- read.table(text = "  Sample_1_Diff   Sample_1_FC D12_Diff    D12_FC  D1_2H_Diff  D1_2H_FC    LI_D2_Diff  LI_D2_FC
Gene_1  NaN   2.1540732   19.4411 1.7192711   13.97766    1.689234    27.4    2.0686427
Gene_2  -0.17   0.9825462   -28.11032   0.3793305   16.65163    2.551876    -15.78  0.637158
Gene_3  0.58    1.0111175   -14.67011   0.5353165   32.60851    1.919845    58.3    2.4303238
Gene_4  -23.46  0.7133081   -10.26711   0.6520463   58.62822    1.757373    147.94  4.0185676
Gene_5  NaN   1.3072269   141.15807   1.7474932   NaN   2.001168    327.78  Inf
Gene_6  -27.77  0.5633648   -91.28413   0.3487805   42.60241    2.036303    32.97   1.3159559
Gene_7  -93.29  0.6439585   -345.97381  0.3603074   176.26676   2.175896    66.17   1.1563822
Gene_8  -35.1   0.5650019   -59.03095   0.319143    76.90281    2.012012    149.07  2.932711
Gene_9  -49.94  0.6891572   -233.951    0.3726863   127.00796   2.139187    80.12   1.2806108
Gene_10 -16.3   0.8868291   -94.28914   0.3391526   85.18115    1.872311    193.13  2.4029493
Gene_11 -29.89  0.6463977   -75.2574    0.3349317   59.44275    2.160085    67.92   1.7149474
Gene_12 -119.42 0.5709564   -477.9582   0.3137795   195.64042   2.089433    -42.08  0.9356634
Gene_13 -31.79  0.6444072   -47.98992   0.4058441   54.24593    2.047421    81.73   2.2533354
Gene_14 -84.66  0.4557027   -38.40021   0.188503    94.01289    1.979301    216.73  Inf
Gene_15 -62.65  0.3736253   -61.64238   0.2593943   Inf 2.051535    130.93  2.8890492
Gene_16 -41.8   0.5574847   -55.03278   0.299684    56.62705    1.949481    98.11   2.5821642",
                      header = TRUE)


# Replace "NaN" and "Inf" with NA
Diff_FC[Diff_FC == "NaN"] <- NA
Diff_FC[Diff_FC == "Inf"] <- NA

# find the columns in each category
cols <- colnames(Diff_FC)
cols_FC <- endsWith(cols, "_FC")
cols_Diff <- endsWith(cols, "_Diff")

# create filters
keep.FC <- rowSums(abs(Diff_FC[,cols_FC]) >= 3, na.rm = TRUE) >= 1
keep.Diff <- rowSums(abs(Diff_FC[,cols_Diff]) >= 200, na.rm = TRUE) >= 1

# assemble filters
keep.all <- keep.FC & keep.Diff

# Final filtered data frame
Diff_FC[keep.all, ]

AlexisW · June 19, 2023, 1:49pm

It's not necessarily wrong, but whether it's correct depends on the biology and on the downstream statistical questions.

First, you should be sure you understand where these values come from. Let's imagine that for each condition, FC is defined as FC = log( treatment / control )then you could have NaN if treatment or control is NaN (common for mass spec, not RNA-Seq), if treatment == 0, if control == Inf. In the context of your experiement, do those make sense?

Then if you just have a few NA sprinkled around, as in this example, you can probably ignore them safely (though you could also consider an imputation method). But are you sure there aren't some rows with lots of NA? Then your sum would be biased. Maybe you can check:

hist(rowSums(is.na(Diff_FC[,cols_FC])), breaks = 50)

Then you also need to keep that in mind when doing the next steps. For example, here, you're treating the Inf as an NA. So you might be removing the biggest values, is that really correct?

mtoufiq · June 19, 2023, 2:29pm

@AlexisW , thank you.

Yes, thats correct. We have designed this based on the downstream statistical question. Instead of working with log2 fold changes, we want to work the other way typically, in addition to using a linear FC (Treatment/Control) thresholds a second Difference (Treatment - Control) thresholds is applied, so that differences in gene raw count are say more than 100 or 200 or ?

It will eliminate genes that show relatively high FC, but which may not be robust or reproducible because the differences are small (which means the genes are probably expressed at low levels).

AlexisW · June 19, 2023, 3:48pm

To be clear with what I mean (maybe it was, and I'm just repeating), let's say gene i has a count of 0 in the control, and 5000 in the treatment, and say gene j has a count of 5000 in the control, and 0 in the treatment. Gene i is greatly increased by treatment, while gene j is greatly decreased by treatment, but in both cases you get FC = NaN, these two genes give you the same result.

Whether the Difference is meaningful depends on the type of data. Is it RNA-Seq? If so, are you aware that e.g. DESeq2 can analyze paired samples?

One other thing is that both FC and Diff are measures of the effect size, but don't take into account the variability; do you have some way of computing a p-value or FDR that takes it into account? In RNA-Seq, that is the usual way to to account for "robustness".

Also, note that if FC is defined by FC = log(Treatment / Control), then it can be rewritten as FC = log(Treatment) - log(Control), in other words, FC is a difference in log space. So it's not obvious to me why the "Difference" in natural space will be more robust than the FC, or difference in log space. If you make a scatter plot of FC vs Difference, do you have a clear, monotonous, relationship? If so, thresholding one is just equivalent to thresholding the other.

mtoufiq · June 21, 2023, 12:41am

@AlexisW thank you for your comments. Yes, at the same time we have been using another pipeline edgeR for paired analysis. We are using doing fold changes and difference calculation for different downstream purpose.

system · August 2, 2023, 12:42am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.