Features are arranged row wise. Having extracted the top 1000 features, we know which rows have at least one occurrence of the feature. Samples are arranged column wise. Any given column either does, or does not contain an occurrence of the feature. We know only that at least one column does. This represents a binary 1/0 TRUE/FALSE for which we want a test. The test for occurrence is the magnitude of the variance returned by rowVars()
. If the variance falls below some value, the feature was not present among the top 1000 features for that column. That value is given by the minimum variance for all features in the top 1000. If a feature has a variance of zero, it was not present at all, and if it has a variance of less than the threshold value, the feature was not among the top 1000 for that column.
result
provides a truth table for the entire matrix at a single pass. Because TRUE/FALSE evaluate to 1/0 in the application of sum()
, the rowSums
total represents the number of occurrences of a feature, and that number can be compared to 18/10. The application of >=
returns another logical. If the logical is TRUE, the row is retained, otherwise discarded. This satisfies case 3.
The alternative that I saw was iteration, which is no more direct, more involved and less efficient.
Whenever possible, I try to frame a problem in R
as an instance of a truth table such as this, for two reasons. First, it permits vectors, arrays, matrixes, to be treated as a single object amenable to linear algebra. Second, such objects are amenable to functional, rather than procedural programming.
Every R
problem can be thought of with advantage as the interaction of three objects: an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra&mdash f(x) = y. Any of the objects can be composites.. This has the considerable advantage of directing focus on what rather than how. If a result is outside the range to be expected theoretically, it means the wrong f was selected, meaning that focus was lost on what the nature of the result was supposed to have been compared to the nature of the return value of f. This is functional programming, why help($f$)
has sections on arguments and values.
By contrast, in a procedural language an unreasonable result means some step was applied inappropriately, which requires an examination of each step, a how question. That process is potentially both more involved and requires being able to deduce not only what a reasonable final result looks like but what all intermediate steps look like. This is imperative/procedural programming in which the same general purpose steps are available at each point, and a choice among them must be made without direct consideration of the end result.