using tibble with several models

Hello

I was looking to organise data with tiblle but got mistakes. Please any help?
I give an example with a data set with inputs and one ouput: i want as in the table to arrange the output according
to the inputs for each model to ease my interpretation.

I did something like: DF %>% group_by(Models) %>% filter(any(0< var_M < 20%), any(20%< var_M < 20%)) etc. but yet no reslts.

Thanks a lot.

  • what i have is below dataset
Models input_M input_T output_X
m1 20% 0,1 50
m1 25% 0,15 60
m1 80% 0,2 40
m2 75% 0,8 35
m2 15% 1 25
m3 55% 1,2 18
m4 65% 1,5 85
m4 45% 0,6 125
m4 35% 0,8 10
m4 15% 0,6 1
m5 10% 0,1 50
m5 120% 0,9 20
  • what i want is below table with output averages
var_M
var_T [0-20%] [>20% - < 80%] [80%-100%]
[0,1 - 0,5]
[0,5 - 1]
[>1]

Not sure what you mean with"output averages". Here is with the counts.

First, let's read the data:

library(tidyverse)


data <- read.table(text = "Models   input_M     input_T     output_X
m1  20%     0,1     50
m1  25%     0,15    60
m1  80%     0,2     40
m2  75%     0,8     35
m2  15%     1   25
m3  55%     1,2     18
m4  65%     1,5     85
m4  45%     0,6     125
m4  35%     0,8     10
m4  15%     0,6     1
m5  10%     0,1     50
m5  120%    0,9     20",
           header = TRUE,dec = ",")


data
#>    Models input_M input_T output_X
#> 1      m1     20%    0.10       50
#> 2      m1     25%    0.15       60
#> 3      m1     80%    0.20       40
#> 4      m2     75%    0.80       35
#> 5      m2     15%    1.00       25
#> 6      m3     55%    1.20       18
#> 7      m4     65%    1.50       85
#> 8      m4     45%    0.60      125
#> 9      m4     35%    0.80       10
#> 10     m4     15%    0.60        1
#> 11     m5     10%    0.10       50
#> 12     m5    120%    0.90       20

Then, we need input_M to be numeric, right now it's text because of the "%" character.

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric()
)

We can then create "bins", categories based on the values of M:

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)))
#>    Models input_M input_T output_X input_M_numeric  M_binned
#> 1      m1     20%    0.10       50              20    (0,20]
#> 2      m1     25%    0.15       60              25   (20,80]
#> 3      m1     80%    0.20       40              80   (20,80]
#> 4      m2     75%    0.80       35              75   (20,80]
#> 5      m2     15%    1.00       25              15    (0,20]
#> 6      m3     55%    1.20       18              55   (20,80]
#> 7      m4     65%    1.50       85              65   (20,80]
#> 8      m4     45%    0.60      125              45   (20,80]
#> 9      m4     35%    0.80       10              35   (20,80]
#> 10     m4     15%    0.60        1              15    (0,20]
#> 11     m5     10%    0.10       50              10    (0,20]
#> 12     m5    120%    0.90       20             120 (100,Inf]

You can rename the bins with the labels argument, see ?cut.

We also need T to be binned, we can do it the same way:

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
    T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
  ) 

Now we can count the number of observations in each combination of bins of M and T. This can be replaced with summarize() with appropriate groups if you want a mean or something else:

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
    T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
  ) |>
  count(M_binned, T_binned)
#>    M_binned  T_binned n
#> 1    (0,20]   (0,0.1] 2
#> 2    (0,20]   (0.5,1] 2
#> 3   (20,80] (0.1,0.5] 2
#> 4   (20,80]   (0.5,1] 3
#> 5   (20,80]   (1,Inf] 2
#> 6 (100,Inf]   (0.5,1] 1

And finally to get a table with the same format/shape as in your post, we can pivot it to be wider.

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
    T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
  ) |>
  count(M_binned, T_binned) |>
  pivot_wider(id_cols = T_binned,
              names_from = "M_binned",
              values_from = "n")

A problem here, is that some combinations of bins have no counts, and will end up with NA. But in our case we know NA means a count of 0. So we will use values_fill:

data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
    T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
  ) |>
  count(M_binned, T_binned) |>
  pivot_wider(id_cols = T_binned,
              names_from = "M_binned",
              values_from = "n",
              values_fill = 0)
#> # A tibble: 4 × 4
#>   T_binned  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <int>     <int>       <int>
#> 1 (0,0.1]          2         0           0
#> 2 (0.5,1]          2         3           1
#> 3 (0.1,0.5]        0         2           0
#> 4 (1,Inf]          0         2           0

Hello,
Thanks a lot for responding.
I still wonder if it is the output i was looking for. Imagine i want the output table
for, say model "m1": how do i get it from your final code where i don't see any model name?
Remember the output for each model should look like (var_T in left column, var_M in top row) and output_X fill the blanks in below table for each model.:

               var_M	  [0-20%]	[>20% - < 80%]	[80%-100%]

var_T
[0,1 - 0,5]
[0,5 - 1]
[>1]

I don't understand what you want.

If the goal is an averaged output by bins across models, then you can simply replace count() by summarize(), for example:

summarize(mean_X = mean(output_X),
          .by = c(M_binned, T_binned))

but if you want an output by model, I don't know what it is you're trying to get. Can you build it by hand for your example data?

Thank you for your response. I'm sorry i was not clear. Of course i can come back with a small real example (below is just a small fake example just to figure out).
What I was trying to say is to summarize the results of a test by averaging the output for each model (i), distilled between variable M (per interval) et variable T (per interval).

Model1 var_M [0-20%] [>20% - < 80%] [80%-100%]
var_T
[0,1 - 0,5] .53 0.47 0.34
[0,5 - 1] 1.1 1.4 1.9
[>1] 2.2 3.2 3.9

Can you explain how you obtained these values (0.53, 0.47, ...)?

Hello, sorry i wrote fake figures just to show an example but it was confusing so please
find below tables i did on excel to show from the 'data' base which type of output i would like to build. Many thanks.

Models input_M input_T output_X
m1 20% 0,1 50
m1 25% 0,15 60
m1 80% 0,2 40
m2 75% 0,8 35
m2 15% 1 25
m3 55% 1,2 18
m4 65% 1,5 85
m4 45% 0,6 125
m4 35% 0,8 10
m4 15% 0,6 1
m5 10% 0,1 50
m5 120% 0,9 20
m1 M
T 0-20% >20 <80% >80 < 120%
0.1-0.5 50 60 40
0.5-1 0 0 0
>1 0 0 0
m2 M
T 0-20% >20 <80% >80 < 120%
0.1-0.5 0 0 0
0.5-1 25 35 0
>1 0 0 0
m3 M
T 0-20% >20 <80% >80 < 120%
0.1-0.5 0 0 0
0.5-1 0 0 0
>1 0 18 0
m4 M
T 0-20% >20 <80% >80 < 120%
0.1-0.5 0 0 0
0.5-1 1 average(10;125) 0
>1 0 85 0
m5 M
T 0-20% >20 <80% >80 < 120%
0.1-0.5 50 0 0
0.5-1 0 0 20
>1 0 0 0

Above are the output tables for each model (if i did not a mistake) and finally a table that average all above tables (m1,m2,m3,m4,m5) that i did not report. Many thanks.

OK, so you want to obtain a list of tables, one for each model. There are several ways that can work: we can fist split by model, then average; or first average, then split. Let's start with splitting:

split(data, data$Models)
#> $m1
#>   Models input_M input_T output_X
#> 1     m1     20%    0.10       50
#> 2     m1     25%    0.15       60
#> 3     m1     80%    0.20       40
#> 
#> $m2
#>   Models input_M input_T output_X
#> 4     m2     75%     0.8       35
#> 5     m2     15%     1.0       25
#> 
#> $m3
#>   Models input_M input_T output_X
#> 6     m3     55%     1.2       18
#> 
#> $m4
#>    Models input_M input_T output_X
#> 7      m4     65%     1.5       85
#> 8      m4     45%     0.6      125
#> 9      m4     35%     0.8       10
#> 10     m4     15%     0.6        1
#> 
#> $m5
#>    Models input_M input_T output_X
#> 11     m5     10%     0.1       50
#> 12     m5    120%     0.9       20

Then we can apply the previous workflow to each element of the list:

split(data, data$Models) |>
  map(
    ~ .x |>
      mutate(
        input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
        M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
        T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
      ) |>
      summarize(mean_X = mean(output_X),
                .by = c(M_binned, T_binned)) |>
      pivot_wider(id_cols = T_binned,
                  names_from = "M_binned",
                  values_from = "mean_X")
  )

If you run that, you will see a problem: the rows and columns of each Model table only have the values of M and T for this Model. If we want to ensure we get all existing combinations, we first need to transform M and T to factors, so R is aware of all possible levels, and then use complete() to fill in the combinations of M and T not present in a particular model:

split(data_factors, data_factors$Models) |>
  map(~ .x |> complete(M_factor, T_factor))

So we can now build the averaged data.frames and pivot them to wider format:

split(data_factors, data_factors$Models) |>
  map(
    ~ .x |>
      complete(M_factor, T_factor) |>
      summarize(mean_X = mean(output_X),
                .by = c(M_factor, T_factor)) |>
      pivot_wider(id_cols = T_factor,
                  names_from = "M_factor",
                  values_from = "mean_X")
  )
#> $m1
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         50        NA          NA
#> 2 (0.1,0.5]       NA        50          NA
#> 3 (0.5,1]         NA        NA          NA
#> 4 (1,Inf]         NA        NA          NA
#> 
#> $m2
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         25        35          NA
#> 4 (1,Inf]         NA        NA          NA
#> 
#> $m3
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         NA        NA          NA
#> 4 (1,Inf]         NA        18          NA
#> 
#> $m4
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA      NA            NA
#> 2 (0.1,0.5]       NA      NA            NA
#> 3 (0.5,1]          1      67.5          NA
#> 4 (1,Inf]         NA      85            NA
#> 
#> $m5
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         50        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         NA        NA          20
#> 4 (1,Inf]         NA        NA          NA

Created on 2025-04-21 with reprex v2.1.1

Thanks a lot ! I was not able to do such a code. Please, where is the data_factors in your code? I guess you did as.factor (data$M)?

By the way, i have a small question please?
Imagine I want to sum input_M with input_T with one lag using your piece of code, how to handle?
mutate(
input_MT _n = input_M_n + input_T_n-1
I ask this question because in my code i have a variable that is computed with lagged variables (like T). Thanks a lot,

Yes, I didn't repeat the code in the previous posts to keep it compact, but here is the full reprex:

library(tidyverse)


data <- read.table(text = "Models   input_M     input_T     output_X
m1  20%     0,1     50
m1  25%     0,15    60
m1  80%     0,2     40
m2  75%     0,8     35
m2  15%     1   25
m3  55%     1,2     18
m4  65%     1,5     85
m4  45%     0,6     125
m4  35%     0,8     10
m4  15%     0,6     1
m5  10%     0,1     50
m5  120%    0,9     20",
           header = TRUE,dec = ",")


# data

data_factors <- data |>
  mutate(
    input_M_numeric = input_M |> str_remove("%") |> as.numeric(),
    M_binned = cut(input_M_numeric, breaks = c(0,20,80, 100, Inf)),
    T_binned = cut(input_T, breaks = c(0, .1, .5, 1, Inf))
  ) |>
  mutate(M_factor = factor(M_binned),
         T_factor = factor(T_binned)) |>
  select(Models, M_factor, T_factor, output_X)

split(data_factors, data_factors$Models) |>
  map(
    ~ .x |>
      complete(M_factor, T_factor) |>
      summarize(mean_X = mean(output_X),
                .by = c(M_factor, T_factor)) |>
      pivot_wider(id_cols = T_factor,
                  names_from = "M_factor",
                  values_from = "mean_X")
  )
#> $m1
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         50        NA          NA
#> 2 (0.1,0.5]       NA        50          NA
#> 3 (0.5,1]         NA        NA          NA
#> 4 (1,Inf]         NA        NA          NA
#> 
#> $m2
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         25        35          NA
#> 4 (1,Inf]         NA        NA          NA
#> 
#> $m3
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         NA        NA          NA
#> 4 (1,Inf]         NA        18          NA
#> 
#> $m4
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         NA      NA            NA
#> 2 (0.1,0.5]       NA      NA            NA
#> 3 (0.5,1]          1      67.5          NA
#> 4 (1,Inf]         NA      85            NA
#> 
#> $m5
#> # A tibble: 4 × 4
#>   T_factor  `(0,20]` `(20,80]` `(100,Inf]`
#>   <fct>        <dbl>     <dbl>       <dbl>
#> 1 (0,0.1]         50        NA          NA
#> 2 (0.1,0.5]       NA        NA          NA
#> 3 (0.5,1]         NA        NA          20
#> 4 (1,Inf]         NA        NA          NA

Created on 2025-04-22 with reprex v2.1.1

I would use dplyr::lag(). For example:

library(tidyverse)


data <- read.table(text = "Models   input_M     input_T     output_X
m1  20%     0,1     50
m1  25%     0,15    60
m1  80%     0,2     40
m2  75%     0,8     35
m2  15%     1   25
m3  55%     1,2     18
m4  65%     1,5     85
m4  45%     0,6     125
m4  35%     0,8     10
m4  15%     0,6     1
m5  10%     0,1     50
m5  120%    0,9     20",
           header = TRUE,dec = ",")


# data

data |>
  mutate(
    input_M_percent = input_M |> str_remove("%") |> as.numeric(),
    input_M_proportions = input_M_percent / 100,
    input_MT = input_M_proportions + lag(input_T)
  )
#>    Models input_M input_T output_X input_M_percent input_M_proportions input_MT
#> 1      m1     20%    0.10       50              20                0.20       NA
#> 2      m1     25%    0.15       60              25                0.25     0.35
#> 3      m1     80%    0.20       40              80                0.80     0.95
#> 4      m2     75%    0.80       35              75                0.75     0.95
#> 5      m2     15%    1.00       25              15                0.15     0.95
#> 6      m3     55%    1.20       18              55                0.55     1.55
#> 7      m4     65%    1.50       85              65                0.65     1.85
#> 8      m4     45%    0.60      125              45                0.45     1.95
#> 9      m4     35%    0.80       10              35                0.35     0.95
#> 10     m4     15%    0.60        1              15                0.15     0.95
#> 11     m5     10%    0.10       50              10                0.10     0.70
#> 12     m5    120%    0.90       20             120                1.20     1.30

Created on 2025-04-22 with reprex v2.1.1

Indeed on row 2, input_MT = .25 + .10 = .35

You can make it even more explicit:

data |>
  mutate(
    input_M_percent = input_M |> str_remove("%") |> as.numeric(),
    input_M_proportions = input_M_percent / 100,
    lag_T = lag(input_T),
    input_MT = input_M_proportions + lag_T
  )

Thank you very much, please let me try the code and come back.

Thanks a lot. It looks good.

Thank you very much. Really interesting and simplied view.
Please, two small responses:
-If we look simply at m1 model i did on excel...
T 0-20% >20 <80% >80 < 120%
0.1-0.5 50 60 40
0.5-1 0 0 0

1 0 0 0
and what appear from the code, it looks different:
T_factor (0,20] (20,80] (100,Inf]
#>
#> 1 (0,0.1] 50 NA NA
#> 2 (0.1,0.5] NA 50 NA
#> 3 (0.5,1] NA NA NA
#> 4 (1,Inf] NA NA NA
I think it could be linked to the cut off function but not sure.

-I wrote that "finally a table that average all above tables (m1,m2,m3,m4,m5) " which mean a global table that summarize the five models. I tried but could not find a simple way.
Many thanks.

Yes, here the interval is (0,0.1], meaning x > 0 && x <= 0.1. If you want instead x < 0.1 you can check the right = TRUE argument of cut(), or replace cut() with case_when() to manually control the categories.

Isn't it what the code I gave previously does? Or do you mean something different?

Thanks a lot.
-I ll see about controlling the cut because i did not see a perfect match between my excel and the code.
-Yes, you may be right, the goal is to average across the models as an final test (but what you done was perfect, i was looking after each model).
Please let me try and come back.