Column mean based on values in another data set (function/loop)

mzoelck · February 8, 2023, 5:21pm

I have two data sets, lets call them A and B (dput of the first 5 rows of each below):

A: structure(list(Location = c(3960.82823, 3923.691, 3919.40593, 
3907.97909, 3886.55377), Height = c(0.163744751, 0.231555472, 
0.232150996, 0.192475738, 0.162966924), Start = c(3963.68494, 
3946.54468, 3920.83429, 3909.40745, 3895.1239), End = c(3953.68645, 
3920.83429, 3909.40745, 3895.1239, 3883.69706)), row.names = c(NA, 
5L), class = "data.frame")

B: structure(list(Wavenumber..cm.1. = c(3997.96546, 3996.5371, 3995.10875, 
3993.68039, 3992.25204), M100 = c(0.00106, 0.00105, 0.00095, 
0.00075, 0.00053), M101 = c(0.00081, 0.00092, 0.00102, 0.001, 
0.00082), M102 = c(0.00099, 0.00109, 0.00105, 9e-04, 0.00072), 
    M103 = c(0.00101, 0.00111, 0.0012, 0.00129, 0.00133), M104 = c(0.00081, 
    0.00083, 0.00084, 0.00086, 0.00089), M105 = c(0.00139, 0.00113, 
    0.00092, 0.00089, 0.00102), M106 = c(0.00095, 0.00103, 0.00095, 
    0.00074, 0.00058), M107 = c(0.00054, 0.00058, 0.00059, 0.00049, 
    0.00032), M108 = c(0.00042, 5e-04, 5e-04, 0.00034, 0.00011
    ), M109 = c(0.00069, 0.00051, 0.00043, 0.00051, 0.00065), 
    M110 = c(0.00113, 0.00121, 0.00124, 0.00116, 0.00099), M111 = c(0.00039, 
    0.00056, 0.00068, 0.00068, 0.00056), M112 = c(0.0011, 0.00112, 
    0.00112, 0.00108, 0.00099), M113 = c(3e-04, 3e-04, 3e-04, 
    0.00027, 0.00019), M114 = c(0.00029, 6e-05, -2e-05, 9e-05, 
    0.00028), M115 = c(0.00091, 0.00079, 0.00061, 0.00038, 2e-04
    ), M116 = c(0.00117, 0.00105, 0.00096, 0.00092, 0.00092), 
    M117 = c(0.00039, 2e-04, 6e-05, 6e-05, 0.00018), M118 = c(0.00096, 
    0.00073, 0.00055, 0.00047, 0.00049), M119 = c(0.00037, 0.00031, 
    0.00024, 0.00018, 0.00018), M120 = c(0.00116, 0.00098, 0.00084, 
    0.00076, 0.00067), M121 = c(0.00039, 0.00024, 0.00011, 7e-05, 
    0.00011), M122 = c(0.00032, 0.00038, 0.00045, 0.00044, 0.00035
    ), M123 = c(9e-04, 0.00097, 0.00108, 0.0012, 0.00128), M124 = c(-0.00082, 
    -0.00065, -0.00049, -0.00037, -0.00036), M125 = c(0.00053, 
    0.00054, 0.00055, 6e-04, 0.00071), M126 = c(7e-05, 0.00022, 
    0.00022, 0.00011, 2e-05), M127 = c(0.00086, 9e-04, 0.00086, 
    0.00073, 0.00058), M128 = c(0.00089, 0.00078, 0.00069, 0.00057, 
    0.00043), M129 = c(0.00094, 0.00097, 0.00106, 0.00114, 0.00105
    ), M130 = c(0.0013, 0.00118, 0.00115, 0.00116, 0.00111), 
    M131 = c(0.00029, 0.00033, 0.00033, 3e-04, 0.00022), M132 = c(0, 
    0.00026, 0.00048, 6e-04, 0.00063), M133 = c(3e-05, -6e-05, 
    -6e-05, 5e-05, 0.00019), M134 = c(0.00056, 0.00054, 0.00052, 
    0.00054, 0.00057), M135 = c(2e-05, -4e-05, 6e-05, 0.00031, 
    0.00057), M136 = c(0.00083, 0.00075, 0.00068, 0.00068, 0.00073
    ), M137 = c(0.00064, 0.00074, 0.00084, 0.00095, 0.00105), 
    M139 = c(0.00044, 0.00044, 0.00042, 0.00043, 0.00047), M140 = c(0.00138, 
    0.00113, 0.00102, 0.0011, 0.00121), M141 = c(0.00062, 0.00043, 
    2e-04, 2e-05, 0), M142 = c(-0.00022, -0.00017, -0.00014, 
    -1e-04, 0), M143 = c(0.00109, 0.00108, 0.00103, 0.00093, 
    0.00087), M144 = c(0.00104, 0.00116, 0.00117, 0.00105, 0.00085
    ), M145 = c(7e-04, 0.00096, 0.00109, 0.00098, 0.00069), M146 = c(0.0014, 
    0.00158, 0.00165, 0.00154, 0.0013), M147 = c(6e-04, 0.00071, 
    0.00075, 0.00072, 0.00065), M148 = c(0.00098, 0.00093, 0.00091, 
    9e-04, 0.00088), M149 = c(0.00055, 0.00058, 0.00054, 0.00037, 
    0.00017), M150 = c(7e-04, 0.00068, 8e-04, 0.00107, 0.00132
    ), M151 = c(0.00037, 0.00042, 0.00046, 0.00047, 0.00046), 
    M152 = c(0.00047, 0.00042, 0.00043, 0.00045, 0.00045), M153 = c(0.00095, 
    0.00088, 0.00083, 8e-04, 0.00072), M154 = c(6e-05, 0.00013, 
    0.00032, 0.00054, 0.00062), M155 = c(0.00061, 0.00057, 0.00043, 
    0.00022, 4e-05), M156 = c(0.00077, 0.00078, 0.00071, 0.00052, 
    0.00025), M157 = c(0.00088, 0.00078, 0.00069, 0.00063, 0.00058
    ), M158 = c(0.00091, 0.00085, 0.00082, 0.00081, 8e-04), M159 = c(0.00078, 
    0.00076, 0.00073, 0.00074, 0.00079), M160 = c(0.00068, 7e-04, 
    0.00075, 8e-04, 0.00079), M161 = c(0.00055, 0.00073, 0.00082, 
    0.00085, 9e-04), M162 = c(0.00104, 0.00111, 0.0011, 0.00104, 
    0.00102), M163 = c(0.00076, 0.00071, 0.00069, 0.00068, 0.00067
    ), M164 = c(0.0012, 0.00133, 0.00154, 0.00174, 0.00177), 
    M165 = c(0.00072, 0.00073, 0.00072, 0.00074, 0.00083), M166 = c(0.00067, 
    0.00055, 0.00035, 0.00012, -2e-05), M167 = c(0.00068, 0.00053, 
    0.00047, 0.00051, 0.00059), M168 = c(0.00067, 0.00092, 0.001, 
    0.00087, 0.00067), M169 = c(0.00124, 0.00107, 0.00101, 0.00108, 
    0.00118), M170 = c(0.00054, 0.00064, 0.00069, 0.00066, 0.00053
    ), M171 = c(0.00029, 3e-04, 3e-04, 0.00031, 3e-04), M172 = c(0.00085, 
    0.00091, 0.00082, 0.00063, 0.00052), M173 = c(0.00022, 0.00036, 
    0.00053, 0.00061, 0.00056), M174 = c(5e-04, 0.00031, 0.00021, 
    0.00023, 0.00031), M175 = c(0.00074, 0.00066, 0.00059, 0.00051, 
    0.00043), M176 = c(9e-04, 0.00062, 0.00044, 0.00039, 0.00039
    ), M177 = c(0.00045, 0.00038, 0.00033, 0.00035, 0.00043), 
    M178 = c(0.00075, 0.00092, 0.00097, 0.00086, 0.00067), M179 = c(0.00047, 
    0.00033, 0.00026, 3e-04, 0.00037), M180 = c(0.00083, 0.00077, 
    0.00074, 0.00074, 7e-04), M181 = c(0.0013, 0.00138, 0.00137, 
    0.00127, 0.00109), M182 = c(0.00062, 0.00049, 0.00043, 0.00042, 
    0.00038), M183 = c(0.00056, 4e-04, 0.00034, 0.00046, 0.00065
    ), M184 = c(0.00122, 0.00116, 0.00096, 0.00067, 0.00039), 
    M185 = c(0.00045, 0.00026, 0.00012, 1e-04, 0.00024), M187 = c(0.00078, 
    0.00038, 8e-05, 0, 0.00014)), row.names = c(NA, 5L), class = "data.frame")

I want to be able to calculate the means of the M columns in data set B, based on the Start and End columns in data set A (which correspond to the Wavenumber cm-1 column in data set B). So that for each Start and End set of values you have a corresponding mean for each M column in data set B.

So for example for the Start and End values in the first row of data set A:

Start: 3963.68494 End: 3953.68645 you would calculate the mean of each M column in data set B using the absorbance values corresponding to the Wavenumber cm-1 range of 3963.6849 to 3953.68645, which would then be stored in a separate data frame (with all the M column names) called meanData or something.

I can quite figure out how to write a function/loop that would do that, going and taking the Start and End values in dataset A, looking at dataset B getting the corresponding Absorbance values that fall into that Start and End range, calculate their mean and write it into a new data frame under its corresponding M column name and repeating this for each row of Start and End Values in dataset A. I know you would likely do it with an index, but I'm not sure how to write it exactly. Any help would be very much appreciated!

I tried creating different indexes for the Start and End columns and using them to try and specify the values I want in dataset B, using but I was unsuccessful:

`test<-mean(B$M100[which(B$Wavenumber..cm.1.[index2[i] to B$Wavenumber..cm.1.index3[i]])`

where index2 is the Start values in dataset A and index3 is the end values in dataset A, this did not work

DavoWW · February 11, 2023, 4:16am

Hi @mzoelck ,
If I understand your question correctly, this code should get close to what you want. But please check carefully that the output is as you expect!!

suppressPackageStartupMessages(library(tidyverse))

# Modify this data so that the first instance will capture some rows of B.
# Other pairs of Start-End miss all the rows of B. Is this correct?
A <- structure(list(Location = c(3995.0, 3960.82823, 3923.691, 3919.40593, 
3907.97909, 3886.55377), Height = c(0.1111111, 0.163744751, 0.231555472, 
0.232150996, 0.192475738, 0.162966924), Start = c(3998.0, 3963.68494, 
3946.54468, 3920.83429, 3909.40745, 3895.1239), End = c(3994.0, 3953.68645, 
3920.83429, 3909.40745, 3895.1239, 3883.69706)), row.names = c(NA, 
6L), class = "data.frame")

B <- structure(list(Wavenumber..cm.1. = c(3997.96546, 3996.5371, 3995.10875, 
3993.68039, 3992.25204), M100 = c(0.00106, 0.00105, 0.00095, 
0.00075, 0.00053), M101 = c(0.00081, 0.00092, 0.00102, 0.001, 
0.00082), M102 = c(0.00099, 0.00109, 0.00105, 9e-04, 0.00072), 
M103 = c(0.00101, 0.00111, 0.0012, 0.00129, 0.00133), M104 = c(0.00081, 
0.00083, 0.00084, 0.00086, 0.00089), M105 = c(0.00139, 0.00113, 
0.00092, 0.00089, 0.00102), M106 = c(0.00095, 0.00103, 0.00095, 
0.00074, 0.00058), M107 = c(0.00054, 0.00058, 0.00059, 0.00049, 
0.00032), M108 = c(0.00042, 5e-04, 5e-04, 0.00034, 0.00011
), M109 = c(0.00069, 0.00051, 0.00043, 0.00051, 0.00065), 
M110 = c(0.00113, 0.00121, 0.00124, 0.00116, 0.00099), M111 = c(0.00039, 
0.00056, 0.00068, 0.00068, 0.00056), M112 = c(0.0011, 0.00112, 
0.00112, 0.00108, 0.00099), M113 = c(3e-04, 3e-04, 3e-04, 
0.00027, 0.00019), M114 = c(0.00029, 6e-05, -2e-05, 9e-05, 
0.00028), M115 = c(0.00091, 0.00079, 0.00061, 0.00038, 2e-04
), M116 = c(0.00117, 0.00105, 0.00096, 0.00092, 0.00092), 
M117 = c(0.00039, 2e-04, 6e-05, 6e-05, 0.00018), M118 = c(0.00096, 
0.00073, 0.00055, 0.00047, 0.00049), M119 = c(0.00037, 0.00031, 
0.00024, 0.00018, 0.00018), M120 = c(0.00116, 0.00098, 0.00084, 
0.00076, 0.00067), M121 = c(0.00039, 0.00024, 0.00011, 7e-05, 
0.00011), M122 = c(0.00032, 0.00038, 0.00045, 0.00044, 0.00035
), M123 = c(9e-04, 0.00097, 0.00108, 0.0012, 0.00128), M124 = c(-0.00082, 
-0.00065, -0.00049, -0.00037, -0.00036), M125 = c(0.00053, 
0.00054, 0.00055, 6e-04, 0.00071), M126 = c(7e-05, 0.00022, 
0.00022, 0.00011, 2e-05), M127 = c(0.00086, 9e-04, 0.00086, 
0.00073, 0.00058), M128 = c(0.00089, 0.00078, 0.00069, 0.00057, 
0.00043), M129 = c(0.00094, 0.00097, 0.00106, 0.00114, 0.00105
), M130 = c(0.0013, 0.00118, 0.00115, 0.00116, 0.00111), 
M131 = c(0.00029, 0.00033, 0.00033, 3e-04, 0.00022), M132 = c(0, 
0.00026, 0.00048, 6e-04, 0.00063), M133 = c(3e-05, -6e-05, 
-6e-05, 5e-05, 0.00019), M134 = c(0.00056, 0.00054, 0.00052, 
0.00054, 0.00057), M135 = c(2e-05, -4e-05, 6e-05, 0.00031, 
0.00057), M136 = c(0.00083, 0.00075, 0.00068, 0.00068, 0.00073
), M137 = c(0.00064, 0.00074, 0.00084, 0.00095, 0.00105), 
M139 = c(0.00044, 0.00044, 0.00042, 0.00043, 0.00047), M140 = c(0.00138, 
0.00113, 0.00102, 0.0011, 0.00121), M141 = c(0.00062, 0.00043, 
2e-04, 2e-05, 0), M142 = c(-0.00022, -0.00017, -0.00014, 
-1e-04, 0), M143 = c(0.00109, 0.00108, 0.00103, 0.00093, 
0.00087), M144 = c(0.00104, 0.00116, 0.00117, 0.00105, 0.00085
), M145 = c(7e-04, 0.00096, 0.00109, 0.00098, 0.00069), M146 = c(0.0014, 
0.00158, 0.00165, 0.00154, 0.0013), M147 = c(6e-04, 0.00071, 
0.00075, 0.00072, 0.00065), M148 = c(0.00098, 0.00093, 0.00091, 
9e-04, 0.00088), M149 = c(0.00055, 0.00058, 0.00054, 0.00037, 
0.00017), M150 = c(7e-04, 0.00068, 8e-04, 0.00107, 0.00132
), M151 = c(0.00037, 0.00042, 0.00046, 0.00047, 0.00046), 
M152 = c(0.00047, 0.00042, 0.00043, 0.00045, 0.00045), M153 = c(0.00095, 
0.00088, 0.00083, 8e-04, 0.00072), M154 = c(6e-05, 0.00013, 
0.00032, 0.00054, 0.00062), M155 = c(0.00061, 0.00057, 0.00043, 
0.00022, 4e-05), M156 = c(0.00077, 0.00078, 0.00071, 0.00052, 
0.00025), M157 = c(0.00088, 0.00078, 0.00069, 0.00063, 0.00058
), M158 = c(0.00091, 0.00085, 0.00082, 0.00081, 8e-04), M159 = c(0.00078, 
0.00076, 0.00073, 0.00074, 0.00079), M160 = c(0.00068, 7e-04, 
0.00075, 8e-04, 0.00079), M161 = c(0.00055, 0.00073, 0.00082, 
0.00085, 9e-04), M162 = c(0.00104, 0.00111, 0.0011, 0.00104, 
0.00102), M163 = c(0.00076, 0.00071, 0.00069, 0.00068, 0.00067
), M164 = c(0.0012, 0.00133, 0.00154, 0.00174, 0.00177), 
M165 = c(0.00072, 0.00073, 0.00072, 0.00074, 0.00083), M166 = c(0.00067, 
0.00055, 0.00035, 0.00012, -2e-05), M167 = c(0.00068, 0.00053, 
0.00047, 0.00051, 0.00059), M168 = c(0.00067, 0.00092, 0.001, 
0.00087, 0.00067), M169 = c(0.00124, 0.00107, 0.00101, 0.00108, 
0.00118), M170 = c(0.00054, 0.00064, 0.00069, 0.00066, 0.00053
), M171 = c(0.00029, 3e-04, 3e-04, 0.00031, 3e-04), M172 = c(0.00085, 
0.00091, 0.00082, 0.00063, 0.00052), M173 = c(0.00022, 0.00036, 
0.00053, 0.00061, 0.00056), M174 = c(5e-04, 0.00031, 0.00021, 
0.00023, 0.00031), M175 = c(0.00074, 0.00066, 0.00059, 0.00051, 
0.00043), M176 = c(9e-04, 0.00062, 0.00044, 0.00039, 0.00039
), M177 = c(0.00045, 0.00038, 0.00033, 0.00035, 0.00043), 
M178 = c(0.00075, 0.00092, 0.00097, 0.00086, 0.00067), M179 = c(0.00047, 
0.00033, 0.00026, 3e-04, 0.00037), M180 = c(0.00083, 0.00077, 
0.00074, 0.00074, 7e-04), M181 = c(0.0013, 0.00138, 0.00137, 
0.00127, 0.00109), M182 = c(0.00062, 0.00049, 0.00043, 0.00042, 
0.00038), M183 = c(0.00056, 4e-04, 0.00034, 0.00046, 0.00065
), M184 = c(0.00122, 0.00116, 0.00096, 0.00067, 0.00039), 
M185 = c(0.00045, 0.00026, 0.00012, 1e-04, 0.00024), M187 = c(0.00078, 
0.00038, 8e-05, 0, 0.00014)), row.names = c(NA, 5L), class = "data.frame")

names(B)[1] <- "wavenumber"

# Do the row filtering and mean calculations; save the set, start and end.
filter_set <- function(i) {
    filter(.data=B, (wavenumber <= A$Start[i]) & (wavenumber >= A$End[i])) %>% 
    summarise(across(starts_with("M"), mean, .names="mean_{.col}")) %>% 
    mutate(set = i,
           start=A$Start[i],
           end=A$End[i]) %>% 
    select(set, start, end, everything())
}

out.lst <- vector(mode="list", length=nrow(A))

for(i in 1:nrow(A)){
  out.lst[[i]] <- filter_set(i)
  return(out.lst)
}

# Only one Start-End combination captured rows, all others did not.
(results <- bind_rows(out.lst))
#>   set start  end mean_M100    mean_M101   mean_M102   mean_M103    mean_M104
#> 1   1  3998 3994   0.00102 0.0009166667 0.001043333 0.001106667 0.0008266667
#>     mean_M105    mean_M106 mean_M107    mean_M108    mean_M109   mean_M110
#> 1 0.001146667 0.0009766667   0.00057 0.0004733333 0.0005433333 0.001193333
#>      mean_M111   mean_M112 mean_M113 mean_M114 mean_M115 mean_M116    mean_M117
#> 1 0.0005433333 0.001113333     3e-04   0.00011   0.00077   0.00106 0.0002166667
#>      mean_M118    mean_M119    mean_M120    mean_M121    mean_M122    mean_M123
#> 1 0.0007466667 0.0003066667 0.0009933333 0.0002466667 0.0003833333 0.0009833333
#>       mean_M124 mean_M125 mean_M126    mean_M127    mean_M128 mean_M129
#> 1 -0.0006533333   0.00054   0.00017 0.0008733333 0.0007866667   0.00099
#>   mean_M130    mean_M131    mean_M132 mean_M133 mean_M134    mean_M135
#> 1   0.00121 0.0003166667 0.0002466667    -3e-05   0.00054 1.333333e-05
#>      mean_M136 mean_M137    mean_M139   mean_M140    mean_M141     mean_M142
#> 1 0.0007533333   0.00074 0.0004333333 0.001176667 0.0004166667 -0.0001766667
#>     mean_M143   mean_M144    mean_M145   mean_M146    mean_M147 mean_M148
#> 1 0.001066667 0.001123333 0.0009166667 0.001543333 0.0006866667   0.00094
#>      mean_M149    mean_M150    mean_M151 mean_M152    mean_M153 mean_M154
#> 1 0.0005566667 0.0007266667 0.0004166667   0.00044 0.0008866667   0.00017
#>      mean_M155    mean_M156    mean_M157 mean_M158    mean_M159 mean_M160
#> 1 0.0005366667 0.0007533333 0.0007833333   0.00086 0.0007566667   0.00071
#>   mean_M161   mean_M162 mean_M163   mean_M164    mean_M165    mean_M166
#> 1     7e-04 0.001083333   0.00072 0.001356667 0.0007233333 0.0005233333
#>   mean_M167    mean_M168   mean_M169    mean_M170    mean_M171 mean_M172
#> 1   0.00056 0.0008633333 0.001106667 0.0006233333 0.0002966667   0.00086
#>   mean_M173 mean_M174    mean_M175    mean_M176    mean_M177 mean_M178
#> 1   0.00037   0.00034 0.0006633333 0.0006533333 0.0003866667   0.00088
#>      mean_M179 mean_M180 mean_M181    mean_M182    mean_M183   mean_M184
#> 1 0.0003533333   0.00078   0.00135 0.0005133333 0.0004333333 0.001113333
#>      mean_M185    mean_M187
#> 1 0.0002766667 0.0004133333

^{Created on 2023-02-11 with reprex v2.0.2}

Would be good if some other Forum participant can produce a dplyr-only version (without the for() loop).

nirgrahamuk · February 11, 2023, 10:00am

loop free ; use new join_by() feature

(A_ <- mutate(A,
  rn = row_number(),
  real_start = pmin(Start, End),
  real_end = pmax(Start, End)
))


(jn_AB <- left_join(A_, B,
  by = join_by(between(y$wavenumber, 
                       x$real_start,
                       x$real_end))
))

(results_df <- group_by(jn_AB, !!!syms(names(A_))) |> 
    summarise(across(starts_with("M"), mean)) |>
    ungroup() |> 
    arrange(rn))

DavoWW · February 12, 2023, 7:12am

Simply more R/dplyr magic from a certified wizard!

mzoelck · February 13, 2023, 7:08pm

Hi @nirgrahamuk,

thank you so much for your response. I have a question about your solution code:

Would this not calculate the pmin and pmax of the Start and End columns, rather than take each Start and End value as is? I need the Mean of each M column to be calculated for each set of Start and End values (so that each row of start and end values in A has a corresponding mean value for each M Column associated with it.

mzoelck · February 13, 2023, 7:14pm

Hi @DavoWW ,

To clarify, data set B has the complete set of wavenumbers (it comes from a spectrometer), while data set A will not necessarily have all of them, as the Location, Start and End points may be different depending on the criteria we are looking at i.e. we have spectrographic data for a set of characteristics, and the Locations (peaks ) with their Start and End values is likely to be different for each one, but it may also be similar (some characteristics may overlap in their peaks. Does that make sense?

nirgrahamuk · February 13, 2023, 8:50pm

Its pmin and pmax, which works parallel on the pairs. You can see that the numbers are different down the rows.

Yhe reason this is needed is that you have starts that are after the ends, so i had to sort them out to be in an order

mzoelck · February 13, 2023, 9:02pm

Hi David,

so the function works fine, for some reason the outlist function loop:

Does not seem to work, out.list is not modified, which is odd...i remains at i 1L, when it should be 38...

DavoWW · February 14, 2023, 1:17am

As @nirgrahamuk has pointed out, his code ensures that the "start" value is smaller than the "end" value before doing the filtering (which my function does not do and will cause it to give NA results where start>=end).

system · March 28, 2023, 1:17am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.