Is there a way to create a numeric variable from a numeric set of measurements with different ranges?

PatriciaNunes · November 10, 2021, 8:22pm

I am working with a dataset that contains limnological parameters. One of them is turbidity-related measures in different units and consequently different ranges. Is there a way to create a numerical variable called "Turbidity" that comprises all the measurements in different units? I want to use this variable in a linear regression to assess its relationship with plankton parameters.

I can not just use the measures together in different units because it will create a range that goes from 0.1 to 1000 in absolute numbers and some turbidity units are high at 10. I also don´t want to use them separately because each has only a few measurements, so I´m trying to make my dataset more robust.

Reprex

values	unit	abundance
0.5	x	500
10	x	30
50	y	50
100	y	100
30	z	20
60	z	60
500	z	80

GreyMerchant · November 10, 2021, 8:50pm

If you can tell us what the turbidity value should be for the various rows in your example we can put a formula or piece of code together which will do that conversion for you while taking into consideration the specific constraints you specified.

PatriciaNunes · November 10, 2021, 8:56pm

I am not sure if I understand your request. The "values" column are my measurements in X , Y or Z unit. All of the units correspond to turbidity just as american dollar, euro, pound sterling and Yen correspond to "currency".

Hope it helped.

GreyMerchant · November 10, 2021, 8:59pm

I get that values is the measured value. I need to know how to approach unit and how to use abundance to do the conversion correctly since as you say it is like a dollar, euro, pound and they have varying strengths with each other.

PatriciaNunes · November 10, 2021, 9:15pm

I don´t actually have a numerical kind of approach of the variables´ relationship with abundance, that´s actually my goal after I get to summarize all the values into one variable. What I know is that different units have wide varying ranges but they influence abundance the same way (higher unit = lower abundance).

I thought that was a simple R command that could do this like standardize or similar.

GreyMerchant · November 10, 2021, 9:16pm

There are a lot of normalisations and standardisations we can perform but they mean nothing without proper context for your data. That is sort of the problem here...

PatriciaNunes · November 10, 2021, 9:20pm

Is there a comparable example of the possibilities? A standardisation sounds like something that would help. I think I am not being able to provide the context because I don´t know what has to be explained.

GreyMerchant · November 10, 2021, 9:26pm

I just know that a row corresponds with something but how column 1 corresponds to column 2 I don't know. Essentially you're saying there is data here and you have some issue to rescale/adjust it but there is no way for us to know really how or in what way to perform it...

Have a look here: https://www.datanovia.com/en/blog/how-to-normalize-and-standardize-data-in-r-for-great-heatmap-visualization/ and here: https://www.pluralsight.com/guides/normalizing-data-r

More in depth look at: https://cran.r-project.org/web/packages/standardize/standardize.pdf / https://github.com/CDEager/standardize and Using the bestNormalize Package

PatriciaNunes · November 10, 2021, 9:39pm

I´ll try to give an example based on this dataset:

  Dependents     	Income     	Loan_amount   	Term_months  
3 Min.   :0.0000   Min.   : 173200   Min.   :  18600   Min.   : 36.0 
4 1st Qu.:0.0000   1st Qu.: 389550   1st Qu.:  61500   1st Qu.:384.0 
5 Median :0.0000   Median : 513050   Median :  76500   Median :384.0 
6 Mean   :0.7561   Mean   : 715589   Mean   : 333702   Mean   :365.5 
7 3rd Qu.:1.0000   3rd Qu.: 774800   3rd Qu.: 136250   3rd Qu.:384.0 
8 Max.   :6.0000   Max.   :8444900   Max.   :7780000   Max.   :504.0 
9 
10approval_status   	Age      
11 Min.   :0.0000   Min.   :22.00 
12 1st Qu.:0.0000   1st Qu.:37.00 
13 Median :1.0000   Median :51.00 
14 Mean   :0.6955   Mean   :49.71 
15 3rd Qu.:1.0000   3rd Qu.:61.75 
16 Max.   :1.0000   Max.   :76.00

My case is like all "Dependents", "Income" and "Age" data were stacked at the same column (value) and beside it there was a column with rows "Dependents" "Dependents" "Dependents" "Income" "Income" "Income" "Age""Age" "Age" stacked according to the data it relates.
The values have different ranges but in my example they refer to a wider classification, which is the variable I´m trying to create.

I need to standardize this column indicating to the function that "variables" have different ranges.

Is that even possible??

Hope it is easier to comprehend now

GreyMerchant · November 10, 2021, 9:54pm

It is definitely possible. You can coerce any structure in a specific way and adjust it. This is just overall not what most would consider a "tidy" data format as you have this sort of weird overlap or columns don't seem to be truly independent.

To be honest, it sounds like someone https://discourse.datamethods.org here might be able to tell you how you need to transform the data and then come back to here when you have more clarity on how it should look. If we have A your data and we know what B the output should be we can help put together that process but we need to know exactly what to do. Sort of doesn't really cut it here because it is a specific field and there is some nuance there (there might be a good reason on this shape and I wouldn't know).

PatriciaNunes · November 10, 2021, 9:56pm

Ok, I am looking for further help. Thank you so much!

mdzorn · November 18, 2021, 2:33am

I believe we have been hung up on the different units and were looking for a way to convert them into one unit.

But looking at your example with Dependents, etc. I think you may just be looking to split the data set into separate columns for each unit. You can do that with pivot_wider (in tidyr) (or a spread function available in other packages). Then once you have the data separated you can plot them and figure out the relationship between different units (as you stated as goal).

Old_FORTRAN_Dog · November 19, 2021, 7:34pm

Hello @PatriciaNunes,

I believe @GreyMerchant is asking about what your expected ranges (especially the maximum values) are for each of your turbidity units. Then you would just normalize to each of those values. Below is an example bit of code to do that, but instead of a known maximum, it normalized to the maximum observed value for each unit type. Is that helpful? If you actually know your maximum and they are not in the observed data, e.g., x = 15 is as turbid as y = 200 and z = 1000, then you would just normalize by those values instead.

Cheers,
Dave

library(tidyverse)

df <- tribble(~values,  ~unit,  ~abundance,
0.5,    "x",    500,
10, "x",    30,
50, "y",    50,
100,    "y",    100,
30, "z",    20,
60, "z",    60,
500,    "z",    80
)

df_rescaled <- df %>% 
  group_by(unit) %>% 
  mutate(rescaled_values = values/max(values)) %>% 
  ungroup()

df_rescaled
#> # A tibble: 7 x 4
#>   values unit  abundance rescaled_values
#>    <dbl> <chr>     <dbl>           <dbl>
#> 1    0.5 x           500            0.05
#> 2   10   x            30            1   
#> 3   50   y            50            0.5 
#> 4  100   y           100            1   
#> 5   30   z            20            0.06
#> 6   60   z            60            0.12
#> 7  500   z            80            1

Created on 2021-11-19 by the reprex package (v2.0.1)

PatriciaNunes · November 19, 2021, 7:50pm

Thank you very much!

Old_FORTRAN_Dog · November 19, 2021, 9:17pm

Glad that helped. When I put on my phyto scientist hat, instead of my R-coder hat, I have one more related observation. The validity of the previous approach relies on you measuring the same relative maximum amount of turbidity with each of your instruments. That may, or may not, be a good assumption, depending on where and how they were deployed. If that is a concern, I suggest instead of the above approach, you check each instrument for its maximum value, or the value it get when reading a very turbid standard, and normalize to those readings. Then I think your use of the normalized turbidity readings in subsequent analyses would be more valid.

Below is another bit of code to use predefined values that you get from the maximum scale or maximum turbidity standard. I've used the example values I mentioned in my previous post. Note that if you want, you can drop the max_scale variable by piping the dataframe through select(-max_scale). That might give you a cleaner data set.

library(tidyverse)

df <- tribble(~values,  ~unit,  ~abundance,
0.5,    "x",    500,
10, "x",    30,
50, "y",    50,
100,    "y",    100,
30, "z",    20,
60, "z",    60,
500,    "z",    80
)
scale_max <- c(15, 200, 1000) # Maximum possible in units of x, y and z, respectively

df_rescaled2 <- df %>% 
  mutate(max_scale = case_when(
    unit == "x" ~ scale_max[1],
    unit == "y" ~ scale_max[2],
    unit == "z" ~ scale_max[3] ),
    rescaled_values = values/max_scale)

df_rescaled2
#> # A tibble: 7 x 5
#>   values unit  abundance max_scale rescaled_values
#>    <dbl> <chr>     <dbl>     <dbl>           <dbl>
#> 1    0.5 x           500        15          0.0333
#> 2   10   x            30        15          0.667 
#> 3   50   y            50       200          0.25  
#> 4  100   y           100       200          0.5   
#> 5   30   z            20      1000          0.03  
#> 6   60   z            60      1000          0.06  
#> 7  500   z            80      1000          0.5

Created on 2021-11-19 by the reprex package (v2.0.1)

PatriciaNunes · November 19, 2021, 9:35pm

Ok, I see it. Nice observation, certainly very helpful. Thanks again!

system · November 26, 2021, 9:35pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.