list out empty variable from data frame

str_guru · September 9, 2020, 10:43am

I have a sample data frame like below, my original data can have many columns

DD	AA	CC	BB	CC
				
1				
				
		1		
				
1				
				
				
				
1

now i am looking for a algorithm where i can list out the name of blank columns from my data frame.
so according to above data frame the output should come like:

Blank_cols
AA
BB
CC

pieterjanvc · September 9, 2020, 11:08am

Hi,

Welcome to the RStudio community!

Here is one example how to do this:

set.seed(4000)

#Generate data
myData = matrix("", nrow = 200, ncol = 20)
myData[sample(1:4000, 10)] = 1
myData = as.data.frame(myData)

#Find empty columns
result = apply(myData, 2, function(myCol){
  sum(myCol == "1") > 0
})

names(result)[result == F]
#>  [1] "V1"  "V3"  "V6"  "V8"  "V9"  "V10" "V12" "V14" "V15" "V16" "V17"

^{Created on 2020-09-09 by the reprex package (v0.3.0)}

I assumed that the empty values in your data frame were in text format "", as I didn't nee any NA or 0 in your sample. You can change this in the sum(myCol == "1") > 0 part if needed.

The trick is to do an apply by column and check for every column if it contains the value of interest (myCol == "1"). This will output a TRUE every time it is. TRUE = 1, so if we sum them up and the total column sum = 0, there are no values of interest.

Hope this helps,
PJ

str_guru · September 9, 2020, 11:11am

Thanks for solution , explanation is appreciated as i am new to R
Thanks again

str_guru · September 9, 2020, 11:14am

pieterjanvc:

set.seed(4000)

#Generate data
myData = matrix("", nrow = 200, ncol = 20)
myData[sample(1:4000, 10)] = 1
myData = as.data.frame(myData)

#Find empty columns
result = apply(myData, 2, function(myCol){
  sum(myCol == "1") > 0
})

names(result)[result == F]
#>  [1] "V1"  "V3"  "V6"  "V8"  "V9"  "V10" "V12" "V14" "V15" "V16" "V17"
Created on 2020-09-09 by the reprex package (v0.3.0)

I assumed that the empty values in your data frame were in text format "", as I didn't nee any NA or 0 in your sample. You can change this in the sum(myCol == "1") > 0 part if needed.

The trick is to do an apply by column and check for every column if it contains the value of interest (myCol == "1"). This will output a TRUE every time it is. TRUE = 1, so if we sum them up and the total column sum = 0, there are no values of interest.

Hope this helps,
PJ

what is set.seed
i mean how it works and uses...

pieterjanvc · September 9, 2020, 11:17am

Hi,

That is solely for reproducibility of my results. My code generates a random data frame, and since it's random that would mean if you ran it results would be different than mine (not of course with your real data). To prevent that, I set the seed as the random function is pseudo random and will generate the same random stream if you force it to.

You can safely remove this part of course, but it's always handy to use if you are generating stuff at random and want someone to exactly reproduce what you did.

The number of seed (in my case 4000) is just a number you can choose. The same number will always generate the same randomness, but different numbers generate different randomness without any clear pattern between them.

PJ

str_guru · September 9, 2020, 11:30am

thanks, so for my knowledge if my dataframe would have NA's the what should be a solution for that

pieterjanvc · September 9, 2020, 11:37am

Hi,

In that case sum(myCol == "1") > 0 would become sum(!is.na(myCol)) > 0.

The output of this line is TRUE if the value if found, FALSE if not. So if we look at the latter case, is.na() checks if a value is NA, and the '!' inverts the answer, meaning not NA will be 1.

Hope this helps,
PJ

str_guru · September 9, 2020, 11:45am

Thanks alot, one last question
myCol == "1" why we use "1" here , denotes blank...??

pieterjanvc · September 9, 2020, 11:55am

Hi,

I don't understand that question.
The "1" refers to the value you're looking for in a column. The fact that I wrote it as a string is because your original example suggested that empty values were not NA but blank text, so that would mean that numbers would also be stored as text, hence "1" instead of 1.

If the values are numbers and NA's then you'll have to use the sum(!is.na(myCol)) > 0 part because NA is tricky and NA == 1 would return NA instead of FALSE. Alternatively, some functions like the sum() have built in way of avoiding this if you'd set sum(myCol == 1, na.rm = T).

Please elaborate if your question was not answered.

PJ

system · September 16, 2020, 11:55am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.