How transform "<X" values to "X/2" in a dataframe with several variables

DavideM · December 21, 2020, 10:22am

Hi all,

I am dealing with a dataframe containing several variables and some of them are non-mesurable and reported as "<X" - for example <10.
How can I convert <X values in (X/2)?

I tried the following, using pacman + rio + tidyverse, but it is very long and seems to give some issues:

df2 <- df %>% #CREATE NEW DATA FRAME WITH NO NON-DETECTS
mutate(Benzene = ifelse(str_detect(
string = Benzene,
pattern = "<"
),
parse_number(gsub(
pattern = "<",
replacement = "", x = Benzene
)) / 2, parse_number(Benzene)
)) %>%
mutate(C6_C10 = ifelse(str_detect( #SUBSTITUTE C6-C10
string = C6_C10,
pattern = "<"
),
parse_number(gsub(
pattern = "<",
replacement = "", x = C6_C10
)) / 2, parse_number(C6_C10)
)) %>% #AND SO ON FOR ALL VARIABLES, WHICH ARE 20

Regards,

AlexisW · December 21, 2020, 8:02pm

The approach seems valid overall, but can be simplified since parse_number() already removes any non-string characters.

To apply it to many columns, you can simply put the operations in a function that you call with across().

You don't provide data to test, I would expect something like that to work:


library(tidyverse)
df <- data.frame(col1=c(1,4,7,2,"<56",8,3,8,"<10",7,2),
                 col2=c(65,"<7",1,5,"<99",3,6,8,876,"<34",1))

process_numbers <- function(x){
  if_else(str_detect(x, "^<"),
          parse_number(x)/2,
          parse_number(x))
}

df %>%
  mutate(across(col1:col2, process_numbers))
#>    col1  col2
#> 1     1  65.0
#> 2     4   3.5
#> 3     7   1.0
#> 4     2   5.0
#> 5    28  49.5
#> 6     8   3.0
#> 7     3   6.0
#> 8     8   8.0
#> 9     5 876.0
#> 10    7  17.0
#> 11    2   1.0

^{Created on 2020-12-21 by the reprex package (v0.3.0)}

Of course you can consider using round() or floor() to have only integer results.

DavideM · December 25, 2020, 6:54am

Hi AlexisW,

Thank you so much for answering. It is appreciated very much.
I have tried your code many times, but I cannot make it work. This is the error I get:

Error: Problem with mutate() input ..1.
x is.character(x) is not TRUE
i Input ..1 is across(C6_C10:Sulphate, process_numbers).
Run rlang::last_error() to see where the error occurred.

Note that columns in my "df", , which are 20, are named C6_C10 throught to Sulphate.

nirgrahamuk · December 25, 2020, 1:30pm

are they all character columns or a mix of character columns with the artifacts your describe, but some purely numeric with no such artifacts ? I think its likely a matter of identifying the variables to process (and which to not), or to add conditional tests into the process function, so it can work with either (i.e. the full set)

DavideM · December 29, 2020, 10:24am

They are all value, comprised between 0.01 to 11,000. Some data point were non-detects and reported as less-than a certain value, for example <2 or <0.01.
However, some columns are "dbl" while others are "chr".

I tried this:

INSTALL AND LOAD PACKAGES

pacman::p_load(pacman, rio, tidyverse)

LOAD AND PREPARE DATA

df <- import("data/Dubbo_One_event.xlsx") %>%
as_tibble() %>%
mutate(Well_ID = as.factor(Well_ID)) %>%
rename(y = Well_ID) %>%
print()

#CONVERT CHARACTERS INTO NUMBERS

df <- lapply(df, function(x) as.numeric(as.character(x)))
warnings()
sapply(df, class)

#CONVERT 'LESS-THAN X' INTO (X/2)

library(tidyverse)

process_numbers <- function(x){
if_else(str_detect(x, "<"),
parse_number(x)/2,
parse_number(x))
}

df %>%
mutate(across(C6_C10:Sulphate, process_numbers))

IT STILL GIVES AN ERROR:

rlang::last_error()
<error/dplyr:::mutate_error>
Problem with mutate() input ..1.
x is.character(x) is not TRUE
i Input ..1 is across(C6_C10:Sulphate, process_numbers).
Backtrace:
Run rlang::last_trace() to see the full context.

nirgrahamuk · December 29, 2020, 10:48am

process_numbers <- function(x){
if(is.character(x){
if_else(str_detect(x, "<"),
parse_number(x)/2,
parse_number(x))} else{x}
}

MyKo101 · December 29, 2020, 12:40pm

As an alternative to changing process_numbers(), the across() function can be used in conjunction with where():

process_numbers <- function(x){
  if_else(str_detect(x, "<"),
  parse_number(x)/2,
  parse_number(x))
}

df %>%
  mutate(
    across(
      C6_C10:Sulphate & where(is.character),
      process_numbers
    )
  )

I add this as an alternative, because in this example, you have defined process_numbers() yourself, sometimes we may wish to apply a function in this way that we don't have control over and using where() can add that additional caveat.

DavideM · January 2, 2021, 9:47am

Hi guys,
thanks for all your answers. It still does not work - it seems hard to conver characters '<X" into X/2. I thought with R it would have been easy but I might do it using excel.

I still got:

Error: Problem with mutate() input ..1.
x object 'process_numbers' not found
i Input ..1 is across(C6_C10:Sulphate & where(is.character), process_numbers).
Run rlang::last_error() to see where the error occurred.

nirgrahamuk · January 2, 2021, 9:50am

This means that in your session, you have not run code that defines the process_numbers function.

system · February 10, 2021, 3:08pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.