Any tips/techniques for downsampling data in R ?

cwright1 · November 2, 2020, 10:46pm

Apologies for the very general question, but I am new to downsampling of data.

I'm working with a matrix of data, and need some techniques for finding out ways where the highest signal can be captured, while taking away pieces of the matrix (I think this is called downsampling). For example, some people might take just the diagonal and see if the highest signal is captured.

Example with built-in volcano dataset:

volcano <- volcano

library(pheatmap)
pheatmap(volcano, cluster_rows = FALSE, cluster_cols = FALSE)

We get this plot:

Say I'm most interested in the values over 170. What are some techniques I can use to test whether I can downsample this data and still capture these values?

In reality I'm working with a few hundred of these matrices.

AlexisW · November 3, 2020, 10:51pm

I'd say it depends a lot on what information you'll want to extract at the end, and why you want to downsample.

With unordered data it's common to take a subset of the data using sample() to see what would happen with a smaller sample, to me that's the most common definition of "downsampling". But that seems very inappropriate for spatial data: you would randomly select/drop pixels, totally changing the properties of the image.

If your goal is just to crop the image to reduce the storing space, you could do some thresholding and crop the rows and columns that don't contain information, for example:

image(volcano)
threshold <- EBImage::otsu(volcano, c(min(volcano), max(volcano)))
thresholded_volcano <- volcano >= threshold
image(thresholded_volcano)
image(EBImage::erode(volcano >= threshold))

cropped_volcano <- volcano[rowSums(thresholded_volcano)>0,
                           colSums(thresholded_volcano)>0]

dim(volcano)
dim(cropped_volcano)
image(cropped_volcano)

If you're specifically interested in the values over 170, you can threshold your matrix on 170, then use functions from EBImage to make a nice mask and apply it:

# make binary matrix (mask)
mask_thres <- volcano >= 170
# fill holes
mask_filled <- EBImage::closing(mask_thres)
mask_filled <- EBImage::fillHull(mask_filled)
# increase a bit the size of the mask to capture the surroundings
mask_dilated <- EBImage::dilate(mask_filled)
# Now take the original values in the mask, discarding the values outside
thresholded_volcano <- volcano*mask_dilated
image(thresholded_volcano)

Working with images is a particular art form, the right approach is really very very dependent on what your goal is.

cwright1 · November 4, 2020, 3:29pm

This is extremely helpful and has taught me a lot already, thank you! For my setup, I'm doing something analogous, I will describe it in the context of this volcano dataset, and ask your opinion.

I have treated some samples with DrugA and DrugB in a matrix. The columns are the concentrations used of DrugA, rows are concentrations of DrugB. The values of the matrix represent what are called 'synergy scores'.

All my matrices are 12x12 and the question is: Is there some concentration range that you can extract and get more-or-less the same answer?

So if I imagine the volcano dataset as a synergy score matrix like I have, how can I get an idea of how many rows columns to cut out? So what I've done is work through your example, and the resulting cropped_volcano is very similar to what I'm looking for.

My question to you is: Is there some way to set the thresholding to the top 10% of values, and also to keep the resulting matrix square? (I tested this on my data but the resulting matrix was no longer square, it had more columns than rows).
My dataset is at least like the volcano dataset in that the highest values are near each other (you don't have highest value in the middle, and second highest value on the edge).

Thanks so much

AlexisW · November 4, 2020, 4:13pm

Oh I never worked with that kind of data, I may not have the best answers. I would also try taking the difference between neighbors to find edges, not sure it will work better but it could.

set the thresholding to the top 10% of values

That's easy enough:

threshold <- quantile(my_matrix, probs = 0.90)

to keep the resulting matrix square?

Not as easy, I don't know of a function that already does it and there are some assumptions to make. I'd start with finding the minimum and maximum x and y (or drugA and drugB), take the wider one, and grow the other from the center. Then you would still have to figure out what to do if you're close to the border. Something along these lines:

xmin <- min(which(rowSums(thresholded_matrix)))
xmax <- max(                                  )
ymin <- min(      colSums(                    )
ymax <- max(                                  )

xwidth <- xmax - xmin
xcenter <- xmin + xwidth/2

ywidth <- ...
ycenter <- ...

if (xwidth > ywidth){
  new_ymin <- ycenter - xwidth/2
  new_ymax <- ycenter + xwidth/2)
  
  if(new_ymin < 0){
    new_ymax <- new_ymax - new_ymin
    new_ymin <- 0
  }
}

(of course this is just to illustrate, you would have to think hard about how to round the values correctly etc)

system · November 25, 2020, 4:13pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.