How to apply conditional filtering using different thresholds in R?
Hi,
I have a question about applying conditional filtering using different thresholds in R. I have a dataframe with each columns denoting with suffix "_Diff", and "_FC". I am interested in filtering genes/rows based on these columns only specifically using different thresholds. What I am currently performing is subsetting the dataframe, one into the columns specific with "_Diff", and another with "_FC" columns, following this filtering on each dataframe. Instead If I would like to use the primary dataframe (ie., Diff_FC) and apply filters together for columns with "Diff" and "FC", it will be very helpful.
Is this the sort of thing you want? I set the threshold on the FC columns to 2 because I did not use all of your columns and a threshold of 3 eliminated all rows.
In some way or another, you'll always be subsetting the data frame columns (even though it might not be explicit). So, unless you want to switch to the tidyverse (which will make the subsetting implicit), I think your code is essentially the right one.
You can assemble the filtering steps if you prefer.
@AlexisW thank you very much. This seems to be helpful.
Another question, In my dataset, additionally, there are some values “NaN” and “Inf”, suppose, lets's assume, I input these values in the same dataframe and modify your code (see below). Is this a right way to handle these type of values?
It's not necessarily wrong, but whether it's correct depends on the biology and on the downstream statistical questions.
First, you should be sure you understand where these values come from. Let's imagine that for each condition, FC is defined as FC = log( treatment / control )then you could have NaN if treatment or control is NaN (common for mass spec, not RNA-Seq), if treatment == 0, if control == Inf. In the context of your experiement, do those make sense?
Then if you just have a few NA sprinkled around, as in this example, you can probably ignore them safely (though you could also consider an imputation method). But are you sure there aren't some rows with lots of NA? Then your sum would be biased. Maybe you can check:
Then you also need to keep that in mind when doing the next steps. For example, here, you're treating the Inf as an NA. So you might be removing the biggest values, is that really correct?
Yes, thats correct. We have designed this based on the downstream statistical question. Instead of working with log2 fold changes, we want to work the other way typically, in addition to using a linear FC (Treatment/Control) thresholds a second Difference (Treatment - Control) thresholds is applied, so that differences in gene raw count are say more than 100 or 200 or ?
It will eliminate genes that show relatively high FC, but which may not be robust or reproducible because the differences are small (which means the genes are probably expressed at low levels).
To be clear with what I mean (maybe it was, and I'm just repeating), let's say gene i has a count of 0 in the control, and 5000 in the treatment, and say gene j has a count of 5000 in the control, and 0 in the treatment. Gene i is greatly increased by treatment, while gene j is greatly decreased by treatment, but in both cases you get FC = NaN, these two genes give you the same result.
Whether the Difference is meaningful depends on the type of data. Is it RNA-Seq? If so, are you aware that e.g. DESeq2 can analyze paired samples?
One other thing is that both FC and Diff are measures of the effect size, but don't take into account the variability; do you have some way of computing a p-value or FDR that takes it into account? In RNA-Seq, that is the usual way to to account for "robustness".
Also, note that if FC is defined by FC = log(Treatment / Control), then it can be rewritten as FC = log(Treatment) - log(Control), in other words, FC is a difference in log space. So it's not obvious to me why the "Difference" in natural space will be more robust than the FC, or difference in log space. If you make a scatter plot of FC vs Difference, do you have a clear, monotonous, relationship? If so, thresholding one is just equivalent to thresholding the other.
@AlexisW thank you for your comments. Yes, at the same time we have been using another pipeline edgeR for paired analysis. We are using doing fold changes and difference calculation for different downstream purpose.