How to extract top 5% number from matrix?

Xiuju · August 29, 2019, 8:22am

Hello,
I heve a matrix(4260*34), a part as follow:
brown grey yellow green white blue
A -1.149255e-03 2.860265e-03 9.198340e-04 5.144036e-03 0.0026969686 5.746249e-03
B -9.174336e-04 8.317152e-03 -1.360653e-03 6.653484e-03 0.0065084797 1.006139e-02
C -7.864003e-03 -1.925244e-02 -1.097666e-03 1.098523e-02 0.0322852911 -1.666910e-02
D -7.039214e-03 -1.279118e-02 -1.351395e-03 6.403375e-03 0.0063110152 -9.466401e-03
E -7.040558e-03 -1.679522e-02 5.303559e-03 2.541649e-02 0.0249588512 -1.357130e-02
F 3.566300e-04 1.973854e-03 6.582043e-03 1.015424e-02 -0.0035802723 3.627824e-03
G 3.268834e-03 -4.372850e-03 -6.720695e-03 2.718662e-03 -0.0045350115 3.312930e-03
H -1.655860e-04 6.670511e-03 1.830364e-03 9.082381e-03 0.0256364955 2.357959e-03
I 1.185147e-02 -2.999770e-04 -1.264101e-03 3.508225e-02 0.0210897963 -1.898114e-04
J 4.564741e-03 7.082577e-03 4.027961e-03 1.801338e-02 -0.0012343277 4.510357e-03
K -8.806285e-03 -1.694895e-02 -1.134117e-01 -1.529373e-01 -0.0765283194 -1.916946e-01
M 1.241366e-03 3.154813e-02 4.881584e-02 -1.303083e-02 -0.0133092729 -2.146324e-02

I want to extract top 5% number informations, the result as follow:
0.0026969686 A white
0.0065084797 B white
0.0322852911 C white
0.006311015 D white

But I don't know how to do this use R code. Can you help me?

Cesc · August 29, 2019, 10:13am

Hi Xiuju, through your brief description I can't know your question(even the structure of the matrix) very much and I think it's a good idea to give more information here:

what's your column names of the matrix(Are they the color names)?
Are these letters your row names of matrix or are they just one of the attributes(i.e., a column)?
Are all the values in this matrix numbers?

Tony_Duan · August 29, 2019, 10:20am


install.packages('tidyverse')


library(tidyverse)


# 100 by 2 matrix 
matrx001=matrix(seq(1,200),100)
matrx001

# select top 5 row
top_percent=5

# add row_id
matrx002=as.data.frame(matrx001) %>% mutate(row_id=row_number())
matrx002

# get top 5 row
matrx003=matrx002 %>% filter(row_id<=nrow(matrx002)/100*top_percent)
matrx003

mdzorn · August 29, 2019, 7:38pm

There is a special function in dplyr called top_frac:

df %>% top_frac(.05)

You may have to select the columns you want first to achieve the desired result.

Xiuju · August 30, 2019, 2:15am

Hello, Tony, thank you for your answer. I think there are some things I didn't make clear. I only need to rank in the top 5% of the matrix, not the first 5 rows. For your example, the result which I need is as follow:
V1 row_id
196 96
197 97
198 98
199 99
200 100
If you know how to do this you can tell me,thanks very much!

Xiuju · August 30, 2019, 2:17am

Hello, mdzom, I do what you said, but I can't get I want.

df <- matrix(rnorm(1:20),nrow=5,ncol=4)
rownames(df)=c('A','B','C','D','E')
colnames(df)=c('M','N','H','K')
df
M N H K
A 0.4427897 1.410904366 -0.2201473 -0.8736758
B -0.1011622 -1.367856345 -1.7859678 -1.3119912
C 0.3071763 -0.216317135 0.8903342 -1.2456845
D -1.0168394 -0.006008248 -1.6123174 0.5252658
E 0.5484111 0.409088964 0.7684069 -0.4176306
df %>% top_frac(.05)
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "c('matrix', 'double', 'numeric')"

Xiuju · August 30, 2019, 2:22am

Hello, Cesc, you are right! My matrix as follow:

df <- matrix(rnorm(1:20),nrow=5,ncol=4)
rownames(df)=c('A','B','C','D','E')
colnames(df)=c('M','N','H','K')
df
M N H K
A 0.4427897 1.410904366 -0.2201473 -0.8736758
B -0.1011622 -1.367856345 -1.7859678 -1.3119912
C 0.3071763 -0.216317135 0.8903342 -1.2456845
D -1.0168394 -0.006008248 -1.6123174 0.5252658
E 0.5484111 0.409088964 0.7684069 -0.4176306
I only need to rank in the top 5% of the matrix. For this matrix, the result which I need is as follow:
1.410904366 A N
I don't know how to do this use R code. Do you know?

Tony_Duan · August 30, 2019, 3:00am

the one I share is to select top 5% row, the example I gave have total 100 rows . so top 5% will be 5 row, If your data have total 200 rows. the top 5% will be 10 rows.

Cesc · August 30, 2019, 3:35am

Hi Xiuju, here's my codes. It's a non-tidyverse version. Hope this helps

set.seed(111)

df <- matrix(round(rnorm(1:40),2),nrow=10,ncol=4)
rownames(df)=c('A','B','C','D','E','F','G','H','I','J')
colnames(df)=c('M','N','H','K')

# how many values we need to draw from the matrix
# 0.05 represents 5%
n <- length(df)*0.05
# the location of these values in the matrix
# we use "-df" to get decresing rank
loc <- matrix(rank(-df), nrow = nrow(df), ncol = ncol(df)) <= n
# the rownames and colnames of these values
rol_name <- rownames(df)[which(loc) %% nrow(df)]
col_name <- colnames(df)[(which(loc) %/% nrow(df)) + 1 ]
# get the values
value <- diag(df[rol_name, col_name])
# bind the results
rlt <- data.frame(value, rol_name, col_name,
                  stringsAsFactors = F)

finally rlt contains the result in the format you want.

joels · August 30, 2019, 5:22am

If you want the top 5% of values in the matrix, here is one way to do it:

# Fake data
set.seed(2)
x = matrix(rnorm(100), nrow=10)
rownames(x) = colors()[seq(1,100, length=10)]
colnames(x) = LETTERS[1:10]

round(x, 2)

                    A     B     C     D     E     F     G     H     I     J
white           -0.90  0.42  2.09  0.74 -0.38 -0.84 -1.79 -0.92  1.00  1.60
aquamarine4      0.18  0.98 -1.20  0.32 -1.96  2.07  2.03  0.33 -1.70  1.68
bisque4          1.59 -0.39  1.59  1.08 -0.84 -0.56 -0.70 -0.14 -0.53 -1.18
brown2          -1.13 -1.04  1.95 -0.28  1.90  1.28  0.16  0.43 -1.37 -1.36
cadetblue3      -0.08  1.78  0.00 -0.78  0.62 -1.05  0.51 -0.05 -2.21 -1.51
chocolate4       0.13 -2.31 -2.45 -0.60  1.99 -1.97 -0.82 -0.91  1.82 -1.25
cornsilk4        0.71  0.88  0.48 -1.73 -0.31 -0.32 -2.00  1.30 -0.65  1.96
darkgoldenrod3  -0.24  0.04 -0.60 -0.90 -0.09  0.94 -0.48  0.77 -0.28  0.01
darkolivegreen4  1.98  1.01  0.79 -0.56 -0.18  1.14  0.08  1.05 -0.39 -0.84
darkred         -0.14  0.43  0.29 -0.25 -1.20  1.67 -0.90 -1.41  0.39 -0.60

Return the top 5% of values in the matrix:

x[x > quantile(x, prob=0.95)]

[1] 1.984474 2.090819 1.990920 2.066301 2.031243

Return the indices of the top 5% of values:

idx = which(x > quantile(x, prob=0.95), arr.ind=TRUE)
idx

                row col
darkolivegreen4   9   1
white             1   3
chocolate4        6   5
aquamarine4       2   6
aquamarine4       2   7

x[idx]

[1] 1.984474 2.090819 1.990920 2.066301 2.031243

If you want to keep the values and the index names together you could do:

idx.names = cbind(as.data.frame(idx), 
                  rownames=rownames(x)[idx[,1]], 
                  colnames=colnames(x)[idx[,2]])
rownames(idx.names) = 1:nrow(idx.names)

idx.names = cbind(idx.names, values=x[as.matrix(idx.names[, 1:2])])

idx.names

  row col        rownames colnames   values
1   9   1 darkolivegreen4        A 1.984474
2   1   3           white        C 2.090819
3   6   5      chocolate4        E 1.990920
4   2   6     aquamarine4        F 2.066301
5   2   7     aquamarine4        G 2.031243

Xiuju · August 30, 2019, 10:58am

Hi Cesc, you are so good! Your reply is I need! Thank you,Thank you!

Xiuju · August 30, 2019, 11:00am

Hi joels, your answer is I need. Thank you!

Xiuju · August 30, 2019, 12:18pm

Hi Cesc, I have a difficult problem to solve.

df <- read.csv("x.csv", head=T)
df
X M N H K
1 A -0.17337708 -1.47562557 0.48054349 -1.14083398
2 S 0.56560834 0.26986868 0.21414174 0.73769042
3 D 0.21880017 -0.62837209 0.61618021 -0.07402473
4 F -0.18875043 0.76027917 -0.82245489 0.02695310
5 G -0.36401747 0.92628928 1.40450018 1.32389502
6 H -0.82097074 0.93342682 -0.42148351 0.29391919
7 J 0.47496027 0.79466831 -0.63601189 -0.93417474
8 K 0.23366316 -0.75559961 -1.25027516 -1.30956657
9 L 0.45825475 1.12597993 0.68329038 -0.01052644
10 Z -0.80406907 0.88412186 -0.56832841 0.08793940
11 X -0.09304866 -0.39683868 -0.26384998 -0.91758480
12 C 1.15424946 0.21091290 0.58233304 0.54414017
13 V 1.11158894 -0.54699294 -2.20494755 -0.63227544
14 B 0.27842947 -0.37175272 -1.49597546 0.74006791
15 N -1.21412210 0.65449700 -0.78654542 -0.67358232
16 M -2.37475968 1.02638772 0.05970742 0.61636767
17 Q -0.76512893 -1.28531565 -0.41649300 0.69400024
18 W -0.69010155 0.41606084 -1.32883692 -1.19228179
19 E -0.13251960 1.06921415 -0.69657158 0.50787640
20 R 0.75092858 -0.53609374 0.44786898 -0.05484746
21 T 0.54822210 2.46805224 0.23809158 1.31612971
22 Y 0.23780419 -0.22920283 0.97582058 0.68382231
23 U -2.33601789 -1.15880502 -1.14675045 -0.14582799
24 O -2.21027932 0.05173626 -1.64546006 0.64885458
25 I -1.31013945 1.01669225 1.01012223 1.86679267

how many values we need to draw from the matrix

0.05 represents 5%

n <- length(df)*0.05
n
[1] 5

the location of these values in the matrix

we use "-df" to get decresing rank

loc <- (matrix(rank(-df), nrow = nrow(df), ncol = ncol(df)) )<= n
Error in FUN(left) : invalid argument to unary operator

I think the wrong was made by read.csv, the first column is generated automatically. What should I do?

Cesc · August 30, 2019, 1:57pm

That's fine, just let the row.names equal to the first column and delete the first column.

df <- read.csv('df.csv')
# change row names
row.names(df) <- df[,1]
# delete the first column
df <- df[,-1]

I guess the problem results from when you use write.csv you didn't set the parameter that row.names = F.

Xiuju · August 30, 2019, 2:00pm

Thank you very much!

Yarnabrina · August 30, 2019, 2:06pm

@Cesc, I think this is a better way to do the same:

df <- read.csv('df.csv', row.names = 1)

Cesc · August 30, 2019, 3:00pm

wow, that's cool. I've never used it before and I just try it and it works. Thank you for making us know that

mdzorn · August 30, 2019, 4:46pm

top_frac is a data.frame function, so you need to convert the matrix into a data.frame:

as.data.frame(df) %>% select(M) %>% top_frac(0.25)

[You only have 5 rows in the example, so 5% does not yield anything, Top 25% yields the highest number in the column.]

For the overall highest numbers, Joels' suggestion is the way to go.

system · September 6, 2019, 4:46pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.