using tibble with several models

supermarco · April 17, 2025, 2:20pm

Hello

I was looking to organise data with tiblle but got mistakes. Please any help?
I give an example with a data set with inputs and one ouput: i want as in the table to arrange the output according
to the inputs for each model to ease my interpretation.

I did something like: DF %>% group_by(Models) %>% filter(any(0< var_M < 20%), any(20%< var_M < 20%)) etc. but yet no reslts.

Thanks a lot.

what i have is below dataset

Models	input_M	input_T	output_X

m1	20%	0,1	50
m1	25%	0,15	60
m1	80%	0,2	40
m2	75%	0,8	35
m2	15%	1	25
m3	55%	1,2	18
m4	65%	1,5	85
m4	45%	0,6	125
m4	35%	0,8	10
m4	15%	0,6	1
m5	10%	0,1	50
m5	120%	0,9	20

what i want is below table with output averages

		var_M
var_T	[0-20%]	[>20% - < 80%]	[80%-100%]

[0,1 - 0,5]

[0,5 - 1]

[>1]

AlexisW · April 17, 2025, 7:40pm

Not sure what you mean with"output averages". Here is with the counts.

First, let's read the data:

library(tidyverse)


data <- read.table(text = "Models   input_M     input_T     output_X
m1  20%     0,1     50
m1  25%     0,15    60
m1  80%     0,2     40
m2  75%     0,8     35
m2  15%     1   25
m3  55%     1,2     18
m4  65%     1,5     85
m4  45%     0,6     125
m4  35%     0,8     10
m4  15%     0,6     1
m5  10%     0,1     50
m5  120%    0,9     20",
           header = TRUE,dec = ",")


data
#>    Models input_M input_T output_X
#> 1      m1     20%    0.10       50
#> 2      m1     25%    0.15       60
#> 3      m1     80%    0.20       40
#> 4      m2     75%    0.80       35
#> 5      m2     15%    1.00       25
#> 6      m3     55%    1.20       18
#> 7      m4     65%    1.50       85
#> 8      m4     45%    0.60      125
#> 9      m4     35%    0.80       10
#> 10     m4     15%    0.60        1
#> 11     m5     10%    0.10       50
#> 12     m5    120%    0.90       20

Then, we need input_M to be numeric, right now it's text because of the "%" character.

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric()
)

We can then create "bins", categories based on the values of M:

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)))
#>    Models input_M input_T output_X input_M_numeric  M_binned
#> 1      m1     20%    0.10       50              20    (0,20]
#> 2      m1     25%    0.15       60              25   (20,80]
#> 3      m1     80%    0.20       40              80   (20,80]
#> 4      m2     75%    0.80       35              75   (20,80]
#> 5      m2     15%    1.00       25              15    (0,20]
#> 6      m3     55%    1.20       18              55   (20,80]
#> 7      m4     65%    1.50       85              65   (20,80]
#> 8      m4     45%    0.60      125              45   (20,80]
#> 9      m4     35%    0.80       10              35   (20,80]
#> 10     m4     15%    0.60        1              15    (0,20]
#> 11     m5     10%    0.10       50              10    (0,20]
#> 12     m5    120%    0.90       20             120 (100,Inf]

You can rename the bins with the labels argument, see ?cut.

We also need T to be binned, we can do it the same way:

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
    T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
  )

Now we can count the number of observations in each combination of bins of M and T. This can be replaced with summarize() with appropriate groups if you want a mean or something else:

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
    T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
  ) |>
  count(M_binned, T_binned)
#>    M_binned  T_binned n
#> 1    (0,20]   (0,0.1] 2
#> 2    (0,20]   (0.5,1] 2
#> 3   (20,80] (0.1,0.5] 2
#> 4   (20,80]   (0.5,1] 3
#> 5   (20,80]   (1,Inf] 2
#> 6 (100,Inf]   (0.5,1] 1

And finally to get a table with the same format/shape as in your post, we can pivot it to be wider.

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
    T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
  ) |>
  count(M_binned, T_binned) |>
  pivot_wider(id_cols = T_binned,
              names_from = "M_binned",
              values_from = "n")

A problem here, is that some combinations of bins have no counts, and will end up with NA. But in our case we know NA means a count of 0. So we will use values_fill:

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
    T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
  ) |>
  count(M_binned, T_binned) |>
  pivot_wider(id_cols = T_binned,
              names_from = "M_binned",
              values_from = "n",
              values_fill = 0)
#> # A tibble: 4 × 4
#>   T_binned  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <int>     <int>       <int>
#> 1 (0,0.1]          2         0           0
#> 2 (0.5,1]          2         3           1
#> 3 (0.1,0.5]        0         2           0
#> 4 (1,Inf]          0         2           0

supermarco · April 18, 2025, 12:47pm

Hello,
Thanks a lot for responding.
I still wonder if it is the output i was looking for. Imagine i want the output table
for, say model "m1": how do i get it from your final code where i don't see any model name?
Remember the output for each model should look like (var_T in left column, var_M in top row) and output_X fill the blanks in below table for each model.:

               var_M	  [0-20%]	[>20% - < 80%]	[80%-100%]

var_T
[0,1 - 0,5]
[0,5 - 1]
[>1]

AlexisW · April 18, 2025, 3:28pm

I don't understand what you want.

If the goal is an averaged output by bins across models, then you can simply replace count() by summarize(), for example:

summarize(mean_X = mean(output_X),
          .by = c(M_binned, T_binned))

but if you want an output by model, I don't know what it is you're trying to get. Can you build it by hand for your example data?

supermarco · April 18, 2025, 5:00pm

Thank you for your response. I'm sorry i was not clear. Of course i can come back with a small real example (below is just a small fake example just to figure out).
What I was trying to say is to summarize the results of a test by averaging the output for each model (i), distilled between variable M (per interval) et variable T (per interval).

Model1 var_M [0-20%] [>20% - < 80%] [80%-100%]
var_T
[0,1 - 0,5] .53 0.47 0.34
[0,5 - 1] 1.1 1.4 1.9
[>1] 2.2 3.2 3.9

AlexisW · April 18, 2025, 5:27pm

Can you explain how you obtained these values (0.53, 0.47, ...)?

supermarco · April 19, 2025, 10:04am

Hello, sorry i wrote fake figures just to show an example but it was confusing so please
find below tables i did on excel to show from the 'data' base which type of output i would like to build. Many thanks.

supermarco · April 19, 2025, 10:04am

Models	input_M	input_T	output_X
m1	20%	0,1	50
m1	25%	0,15	60
m1	80%	0,2	40
m2	75%	0,8	35
m2	15%	1	25
m3	55%	1,2	18
m4	65%	1,5	85
m4	45%	0,6	125
m4	35%	0,8	10
m4	15%	0,6	1
m5	10%	0,1	50
m5	120%	0,9	20

supermarco · April 19, 2025, 10:04am

	m1	M
T		0-20%	>20 <80%	>80 < 120%
	0.1-0.5	50	60	40
	0.5-1	0	0	0
	>1	0	0	0

	m2	M
T		0-20%	>20 <80%	>80 < 120%
	0.1-0.5	0	0	0
	0.5-1	25	35	0
	>1	0	0	0

	m3	M
T		0-20%	>20 <80%	>80 < 120%
	0.1-0.5	0	0	0
	0.5-1	0	0	0
	>1	0	18	0

	m4	M
T		0-20%	>20 <80%	>80 < 120%
	0.1-0.5	0	0	0
	0.5-1	1	average(10;125)	0
	>1	0	85	0

	m5	M
T		0-20%	>20 <80%	>80 < 120%
	0.1-0.5	50	0	0
	0.5-1	0	0	20
	>1	0	0	0

supermarco · April 19, 2025, 10:06am

Above are the output tables for each model (if i did not a mistake) and finally a table that average all above tables (m1,m2,m3,m4,m5) that i did not report. Many thanks.

AlexisW · April 21, 2025, 2:54pm

OK, so you want to obtain a list of tables, one for each model. There are several ways that can work: we can fist split by model, then average; or first average, then split. Let's start with splitting:

split(data, data$Models)
#> $m1
#>   Models input_M input_T output_X
#> 1     m1     20%    0.10       50
#> 2     m1     25%    0.15       60
#> 3     m1     80%    0.20       40
#> 
#> $m2
#>   Models input_M input_T output_X
#> 4     m2     75%     0.8       35
#> 5     m2     15%     1.0       25
#> 
#> $m3
#>   Models input_M input_T output_X
#> 6     m3     55%     1.2       18
#> 
#> $m4
#>    Models input_M input_T output_X
#> 7      m4     65%     1.5       85
#> 8      m4     45%     0.6      125
#> 9      m4     35%     0.8       10
#> 10     m4     15%     0.6        1
#> 
#> $m5
#>    Models input_M input_T output_X
#> 11     m5     10%     0.1       50
#> 12     m5    120%     0.9       20

Then we can apply the previous workflow to each element of the list:

split(data, data$Models) |>
  map(
    ~ .x |>
      mutate(
        input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
        M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
        T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
      ) |>
      summarize(mean_X = mean(output_X),
                .by = c(M_binned, T_binned)) |>
      pivot_wider(id_cols = T_binned,
                  names_from = "M_binned",
                  values_from = "mean_X")
  )

If you run that, you will see a problem: the rows and columns of each Model table only have the values of M and T for this Model. If we want to ensure we get all existing combinations, we first need to transform M and T to factors, so R is aware of all possible levels, and then use complete() to fill in the combinations of M and T not present in a particular model:

split(data_factors, data_factors$Models) |>
  map(~ .x |> complete(M_factor, T_factor))

So we can now build the averaged data.frames and pivot them to wider format:

split(data_factors, data_factors$Models) |>
  map(
    ~ .x |>
      complete(M_factor, T_factor) |>
      summarize(mean_X = mean(output_X),
                .by = c(M_factor, T_factor)) |>
      pivot_wider(id_cols = T_factor,
                  names_from = "M_factor",
                  values_from = "mean_X")
  )
#> $m1
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         50        NA          NA
#> 2 (0.1,0.5]       NA        50          NA
#> 3 (0.5,1]         NA        NA          NA
#> 4 (1,Inf]         NA        NA          NA
#> 
#> $m2
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         25        35          NA
#> 4 (1,Inf]         NA        NA          NA
#> 
#> $m3
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         NA        NA          NA
#> 4 (1,Inf]         NA        18          NA
#> 
#> $m4
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA      NA            NA
#> 2 (0.1,0.5]       NA      NA            NA
#> 3 (0.5,1]          1      67.5          NA
#> 4 (1,Inf]         NA      85            NA
#> 
#> $m5
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         50        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         NA        NA          20
#> 4 (1,Inf]         NA        NA          NA

^{Created on 2025-04-21 with reprex v2.1.1}

supermarco · April 22, 2025, 11:24am

Thanks a lot ! I was not able to do such a code. Please, where is the data_factors in your code? I guess you did as.factor (data$M)?

supermarco · April 22, 2025, 12:58pm

By the way, i have a small question please?
Imagine I want to sum input_M with input_T with one lag using your piece of code, how to handle?
mutate(
input_MT _n = input_M_n + input_T_n-1
I ask this question because in my code i have a variable that is computed with lagged variables (like T). Thanks a lot,

AlexisW · April 22, 2025, 1:35pm

Yes, I didn't repeat the code in the previous posts to keep it compact, but here is the full reprex:

library(tidyverse)


data <- read.table(text = "Models   input_M     input_T     output_X
m1  20%     0,1     50
m1  25%     0,15    60
m1  80%     0,2     40
m2  75%     0,8     35
m2  15%     1   25
m3  55%     1,2     18
m4  65%     1,5     85
m4  45%     0,6     125
m4  35%     0,8     10
m4  15%     0,6     1
m5  10%     0,1     50
m5  120%    0,9     20",
           header = TRUE,dec = ",")


# data

data_factors <- data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
    T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
  ) |>
  mutate(M_factor = factor(M_binned),
         T_factor = factor(T_binned)) |>
  select(Models, M_factor, T_factor, output_X)

split(data_factors, data_factors$Models) |>
  map(
    ~ .x |>
      complete(M_factor, T_factor) |>
      summarize(mean_X = mean(output_X),
                .by = c(M_factor, T_factor)) |>
      pivot_wider(id_cols = T_factor,
                  names_from = "M_factor",
                  values_from = "mean_X")
  )
#> $m1
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         50        NA          NA
#> 2 (0.1,0.5]       NA        50          NA
#> 3 (0.5,1]         NA        NA          NA
#> 4 (1,Inf]         NA        NA          NA
#> 
#> $m2
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         25        35          NA
#> 4 (1,Inf]         NA        NA          NA
#> 
#> $m3
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         NA        NA          NA
#> 4 (1,Inf]         NA        18          NA
#> 
#> $m4
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA      NA            NA
#> 2 (0.1,0.5]       NA      NA            NA
#> 3 (0.5,1]          1      67.5          NA
#> 4 (1,Inf]         NA      85            NA
#> 
#> $m5
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         50        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         NA        NA          20
#> 4 (1,Inf]         NA        NA          NA

^{Created on 2025-04-22 with reprex v2.1.1}

AlexisW · April 22, 2025, 1:39pm

I would use dplyr::lag(). For example:

library(tidyverse)


data <- read.table(text = "Models   input_M     input_T     output_X
m1  20%     0,1     50
m1  25%     0,15    60
m1  80%     0,2     40
m2  75%     0,8     35
m2  15%     1   25
m3  55%     1,2     18
m4  65%     1,5     85
m4  45%     0,6     125
m4  35%     0,8     10
m4  15%     0,6     1
m5  10%     0,1     50
m5  120%    0,9     20",
           header = TRUE,dec = ",")


# data

data |>
  mutate(
    input_M_percent = input_M |> str_remove("%") |> as.numeric(),
    input_M_proportions = input_M_percent / 100,
    input_MT = input_M_proportions + lag(input_T)
  )
#>    Models input_M input_T output_X input_M_percent input_M_proportions input_MT
#> 1      m1     20%    0.10       50              20                0.20       NA
#> 2      m1     25%    0.15       60              25                0.25     0.35
#> 3      m1     80%    0.20       40              80                0.80     0.95
#> 4      m2     75%    0.80       35              75                0.75     0.95
#> 5      m2     15%    1.00       25              15                0.15     0.95
#> 6      m3     55%    1.20       18              55                0.55     1.55
#> 7      m4     65%    1.50       85              65                0.65     1.85
#> 8      m4     45%    0.60      125              45                0.45     1.95
#> 9      m4     35%    0.80       10              35                0.35     0.95
#> 10     m4     15%    0.60        1              15                0.15     0.95
#> 11     m5     10%    0.10       50              10                0.10     0.70
#> 12     m5    120%    0.90       20             120                1.20     1.30

^{Created on 2025-04-22 with reprex v2.1.1}

Indeed on row 2, input_MT = .25 + .10 = .35

You can make it even more explicit:

data |>
  mutate(
    input_M_percent = input_M |> str_remove("%") |> as.numeric(),
    input_M_proportions = input_M_percent / 100,
    lag_T = lag(input_T),
    input_MT = input_M_proportions + lag_T
  )

supermarco · April 22, 2025, 2:20pm

Thank you very much, please let me try the code and come back.

supermarco · April 23, 2025, 12:26pm

Thanks a lot. It looks good.

supermarco · April 23, 2025, 12:26pm

Thank you very much. Really interesting and simplied view.
Please, two small responses:
-If we look simply at m1 model i did on excel...
T 0-20% >20 <80% >80 < 120%
0.1-0.5 50 60 40
0.5-1 0 0 0

1 0 0 0
and what appear from the code, it looks different:
T_factor (0,20] (20,80] (100,Inf]
#>
#> 1 (0,0.1] 50 NA NA
#> 2 (0.1,0.5] NA 50 NA
#> 3 (0.5,1] NA NA NA
#> 4 (1,Inf] NA NA NA
I think it could be linked to the cut off function but not sure.

-I wrote that "finally a table that average all above tables (m1,m2,m3,m4,m5) " which mean a global table that summarize the five models. I tried but could not find a simple way.
Many thanks.

AlexisW · April 23, 2025, 1:54pm

Yes, here the interval is (0,0.1], meaning x > 0 && x <= 0.1. If you want instead x < 0.1 you can check the right = TRUE argument of cut(), or replace cut() with case_when() to manually control the categories.

Isn't it what the code I gave previously does? Or do you mean something different?

AlexisW:

If the goal is an averaged output by bins across models, then you can simply replace count() by summarize(), for example:
summarize(mean_X = mean(output_X),
          .by = c(M_binned, T_binned))

supermarco · April 23, 2025, 3:59pm

Thanks a lot.
-I ll see about controlling the cut because i did not see a perfect match between my excel and the code.
-Yes, you may be right, the goal is to average across the models as an final test (but what you done was perfect, i was looking after each model).
Please let me try and come back.