Filtering by columns in a large dataframe

MattH10 · March 12, 2025, 8:06pm

Hello,

I'm having difficulty filtering out columns which contain a value between 0.1:26
I have tried the following codes which looked close, but didn't remove anything.

test_filter <- df %>%
filter(if_any(everything() ~.(!=0 & <26)))
view(test_filter)

Test_filter <- df2 %>%
filter(if_any(everything(), function(x) {0<x & x<=26} ))

Here is what a small section of the dataframe looks like. Each column and row are individuals in a large population, so where the 0 is that is comparing the same individual vs itself. Each row/column therefore is a different ID code.
Any tips?

mduvekot · March 12, 2025, 9:08pm

df |> dplyr::select(where(function(x, min = 0.1 , max = 26) sum(dplyr::between(x, min, max)) == 0))

worked for me

AlexisW · March 12, 2025, 9:19pm

A fundamental question here is how to best represent the data, matrix or tidy. Your current format, with individuals in rows and columns, corresponds to a matrix format, and you can use matrix functions for it. To use the tidyverse functions, it's easier if you convert your data to a "tidy" format.

Let's make up some fake data for illustration:

set.seed(123)
# create 5 individuals
x <- matrix(runif(5),
            nrow = 5)

# make a data.frame of distances
dat <- (100 * dist(x)) |>
  as.matrix() |>
  as.data.frame()

dat
#>          1         2        3         4         5
#> 1  0.00000 50.072762 12.13994 59.543988 65.288976
#> 2 50.07276  0.000000 37.93282  9.471227 15.216215
#> 3 12.13994 37.932821  0.00000 47.404048 53.149036
#> 4 59.54399  9.471227 47.40405  0.000000  5.744988
#> 5 65.28898 15.216215 53.14904  5.744988  0.000000

^{Created on 2025-03-12 with reprex v2.1.0}

So this dat looks like your starting data frame.

Matrix format

We make this into a matrix:

mat <- as.matrix(dat)

Then we ask, for each column, does it contain a value between a and b (where a and b can be set to 0.1 and 26, here I use a different threshold for illustration).

"for each column" can be done with apply( , 2, ) (the 2 means column here). apply() can apply a function to each column, so we will create our function first:

fails_threshold <- function(x, a, b){
  ! any(x > a & x < b)
}

my_filter <- apply(mat, 2, fails_threshold, a = 0.1, b = 10)
my_filter
#>     1     2     3     4     5 
#>  TRUE FALSE  TRUE FALSE FALSE

So out of 5 columns, the 2nd, 4th and 5th contain values between 0.1 and 10.

Finally, we can subset the matrix, keeping only the 2 columns that pass the filter:

mat_subset <- mat[, my_filter]
#>          1        3
#> 1  0.00000 12.13994
#> 2 50.07276 37.93282
#> 3 12.13994  0.00000
#> 4 59.54399 47.40405
#> 5 65.28898 53.14904

Or we can use the same filter on both rows and columns:

mat_subset <- mat[my_filter, my_filter]
mat_subset
#>          1        3
#> 1  0.00000 12.13994
#> 3 12.13994  0.00000

Tidyverse approach

To best use this approach, you need "tidy" data: where each row is an observation, each column a variable. So your current format is "too wide", we first need to make it long:

library(tidyverse)

dat_long <- dat |>
  rownames_to_column("id_1") |>
  pivot_longer(cols = -id_1,
               names_to = "id_2",
               values_to = "distance")
dat_long
#> # A tibble: 25 × 3
#>    id_1  id_2  distance
#>    <chr> <chr>    <dbl>
#>  1 1     1         0   
#>  2 1     2        50.1 
#>  3 1     3        12.1 
#>  4 1     4        59.5 
#>  5 1     5        65.3 
#>  6 2     1        50.1 
#>  7 2     2         0   
#>  8 2     3        37.9 
#>  9 2     4         9.47
#> 10 2     5        15.2 
#> # ℹ 15 more rows

Since dat was symmetric, note that we have duplicates (where id_1 and id_2 are inverted), you could filter for only the cases where id_1 < id_2 for example.

Now, note that filter() is to filter rows, and select() to select columns. In this case we only want to filter rows. And we want to do it by group, so we'll use group_by():

dat_long |>
  group_by(id_1) |>
  filter(! any(distance > .1 & distance < 10))
#> # A tibble: 10 × 3
#> # Groups:   id_1 [2]
#>    id_1  id_2  distance
#>    <chr> <chr>    <dbl>
#>  1 1     1          0  
#>  2 1     2         50.1
#>  3 1     3         12.1
#>  4 1     4         59.5
#>  5 1     5         65.3
#>  6 3     1         12.1
#>  7 3     2         37.9
#>  8 3     3          0  
#>  9 3     4         47.4
#> 10 3     5         53.1

So just as with the matrix approach, we selected the id_1 with a value of 1 and 3, we do get the same filter.

MattH10 · March 12, 2025, 9:43pm

Alexis, the first matrix format seemed to work, although it deleted the first column which houses IDs for those rows.

The tidyverse approach I was met with this error

dat_long <- ind_dist_map2 |>
rownames_to_column("ID") |>
pivot_longer(cols=-ID,
names_to="id_2",
values_to="distance")

Error in pivot_longer():
! Can't combine Column1 character> and 19F-1526 double>.
Run rlang::last_trace() to see where the error occurred.

I should restate that, the filtering could be done either by columns, or rows. I just need to leave the individual ID as, lets say, a column if I filter out that individual ID row.

I apologize, I didn't include the column headers and column one in the first picture.

AlexisW · March 12, 2025, 10:05pm

In the cols argument, you can indicate which columns should (or not) be pivoted into longer format. In your case, it looks like your ID column has the title Column1, so you need to indicate:

# for my fake data.frame which had rownames
ind_dist_map2 <- dat |>
  rownames_to_column("Column1")

# this should work directly with your data where the ID is saved in `Column1`
dat_long <- ind_dist_map2 |>
  pivot_longer(cols= -Column1,
               names_to="id_2",
               values_to="distance")