Merge Dataframes to fill missing values

clarktar · February 6, 2024, 12:04am

At first I had one dataframe with many rows. Some rows represent that same unique identifier and those rows need to be combined to represent the full suite of data for the unique identifier.

I could not figure out how to get this done and came across some examples with rows_update in dplyr that had two dataframes. I split my dataframe into two separate dataframes thinking I could get the rows_update to work. I seem to be having an error when I try to use a vector that will update multiple columns for a given row.

In this example, any value in df1 should be replaced by values found in df2 for the particular row (record, unique ID; in this example unique ID is column "x"). If df2 has NA then it is ignored and not updated in df1. But I get an error when trying to use "ind" which is a vector to represent the columns I want to actually update...

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ind = c("y", "w", "z")

df1 <- data.frame(
  stringsAsFactors = FALSE,
  x = c("A1", "A2", "A3", "A4"),
  y = c("blue", NA, "yellow", "orange"),
  w = c(NA, "green", "green", "green"),
  z = c(NA, "black", NA, "green")
)

df2 <- data.frame(
  stringsAsFactors = FALSE,
  x = c("A1", "A2", "A3", "A4"),
  y = c(NA, "red", NA, "black"),
  w = c("green", "black", "black", "black"),
  z = c(NA, NA, "pink", "pink")
)

merged <- df1 |> 
  rows_update(df2 |> filter(!is.na(ind)), by = "x")
#> Error in `filter()`:
#> ℹ In argument: `!is.na(ind)`.
#> Caused by error:
#> ! `..1` must be of size 4 or 1, not size 3.
# Error in `filter()`:
# ℹ In argument: `!is.na(ind)`.
# Caused by error:
#  ! `..1` must be of size 4 or 1, not size 3.

^{Created on 2024-02-05 with reprex v2.1.0}

dromano · February 6, 2024, 12:44am

Hi Chris,

Your original context sounds more tractable to me — would you be able to create a toy table that mimics the structure of yours, along with a table the shows what your desired output would be?

alexcray · February 6, 2024, 2:57am

Which programming language and libraries are you using (e.g., Python with Pandas, R with dplyr, SQL)? The syntax and available methods vary across tools.

clarktar · February 6, 2024, 6:36am

Apologies that my programming approach was not clear. When I pasted the reproducible example I thought it would be clear I was using R and dplyr.

I am using the latest version of R and R studio and specifically the tidyverse approach (dplyr) for wrangling my data and making it "tidy".

clarktar · February 6, 2024, 6:37am

I'll gin something up early tomorrow and post the single data frame I was first working with. Thanks for taking a interest!

filebb · February 6, 2024, 5:13pm

If I understand this correctly you want to update a value in df1 using df2 is the matching value in df2 if not NA. I don't think you can do this using rows_update (I might be wrong). It would be a nice feature to have.

The problem with your approach is that df2 |> filter(!is.na(ind)) does not match the number of columns of df2, since ind has 3 elements and df2 has 4 columns. Nevertheless (again I understood what you want to achieve) this won't lead you to your final goal.

You can obtain your results using left_join on x and then updating each df1 column with the matching df2 column "manually", if the value in df2 is not missing. Finally, since after the join duplicate column names are slightly different, you can use setNames or purrr:::set_names to restore the old names. Below my (not elegant) code:

df1 |>
    left_join(df2, by = join_by(x)) |> 
    mutate(y.x = if_else(is.na(y.y), y.x, y.y),
           w.x = if_else(is.na(w.y), w.x, w.y),
           z.x = if_else(is.na(z.y), z.x, z.y)) |> 
    select(x, y.x, w.x, z.x) |> 
    setNames(names(df1))

This is the result

   x      y     w     z
1 A1   blue green  <NA>
2 A2    red black black
3 A3 yellow black  pink
4 A4  black black  pink

clarktar · February 6, 2024, 5:20pm

Thank you for the reply! I was able to manually do this with a left join, but I have 100 columns in my actual dataset so I was looking for a more elegant way to get this done.

I suppose I could make IND a vector that matches the number of elements to the data.frame columns (in the provided example I could simply add the x column as an element in the data.frame and that should not be a problem as that is the matching identifier anyway). But I'd rather not have to "update" the unique identifier.

filebb · February 6, 2024, 5:21pm

Even if you do it, how would you use it to solve your problem? I could not figure it out.

clarktar · February 6, 2024, 6:09pm

Here is a example of the data.frame where I no longer have Two separate dataframes but all data is in one dataframe.

data.frame(
               stringsAsFactors = FALSE,
                   EvaluationID = c("CY-TR-10058_2023-07-21",
                                    "CY-TR-20319_2023-07-22","CY-TR-78315_2023-07-21",
                                    "CY-TR-28952_2023-07-20","CY-TR-15126_2023-07-18",
                                    "CY-TR-32532_2023-07-21","CY-TR-38511_2023-07-19",
                                    "CY-TR-22299_2023-07-21",
                                    "CY-TR-1467_2023-07-21","CY-TR-34500_2023-07-20",
                                    "CY-TR-12968_2023-07-19","CY-TR-69008_2023-07-18",
                                    "CY-TR-17347_2023-07-20",
                                    "CY-TR-3831_2023-07-19","CY-TR-34930_2023-07-20",
                                    "CY-TR-22299_2022-09-02","CY-TR-3831_2021-08-27",
                                    "CY-TR-28952_2021-08-28","CY-TR-20319_2021-08-28",
                                    "CY-TR-1467_2021-08-29"),
                        PointID = c("CY-TR-10058","CY-TR-20319","CY-TR-78315",
                                    "CY-TR-28952","CY-TR-15126","CY-TR-32532",
                                    "CY-TR-38511","CY-TR-22299","CY-TR-1467",
                                    "CY-TR-34500","CY-TR-12968","CY-TR-69008",
                                    "CY-TR-17347","CY-TR-3831","CY-TR-34930",
                                    "CY-TR-22299","CY-TR-3831","CY-TR-28952",
                                    "CY-TR-20319","CY-TR-1467"),
                     StreamName = c("Gold Creek","Gold Creek",NA,"Gold Creek",
                                    "Gold Creek","Gold Creek","18 pup",
                                    "Gold Creek","Gold Creek","Gold Creek",NA,NA,
                                    "Gold Creek","Gold Creek","Gold Creek",
                                    "Gold Creek","Gold Creek","Gold Creek",
                                    "Gold Creek","Gold Creek"),
                  FieldEvalDate = c("7/21/2023","8/22/2023","7/21/2023",
                                    "7/20/2023","7/18/2023","7/21/2023","7/19/2023",
                                    "7/21/2023","7/21/2023","7/20/2023",
                                    "7/19/2023","7/18/2023","7/20/2023","7/19/2023",
                                    "7/20/2023","9/2/2022","8/27/2021",
                                    "8/28/2021","8/28/2021","8/29/2021"),
             PointSelectionType = c("Targeted","Targeted","Targeted","Targeted",
                                    "Targeted","Targeted","Targeted",
                                    "Targeted","Targeted","Targeted","Targeted",
                                    "Targeted","Targeted","Targeted","Targeted",
                                    "Targeted","Targeted","Targeted",
                                    "Targeted","Targeted"),
               PctOverheadCover = c(31.8,
                                    2.4,66.4,12.6,9.7,3.6,39.8,3.7,1.7,
                                    2.5,66.7,55.9,0.1,0,1.5,2.1,1.3,10.6,
                                    0,2.3),
           PctBankOverheadCover = c(81.8,
                                    14.3,77,63.9,54.5,32.9,75.9,25.4,
                                    26.5,23,77.5,67.9,2.1,12.8,25,32.6,19,
                                    41.4,4.1,25.4),
                  VegComplexity = c(1,NA,
                                    1.53,NA,1.14,0.68,1.34,NA,NA,0.91,
                                    1.31,1.32,0.24,NA,1.19,0.48,0.96,0.75,
                                    0.45,0.68),
             VegComplexityWoody = c(0.76,
                                    NA,0.71,NA,0.96,0.45,0.81,NA,NA,
                                    0.68,0.67,0.66,0.14,NA,0.69,0.36,0.44,
                                    0.32,0.27,0.3),
  VegComplexityUnderstoryGround = c(0.59,
                                    NA,0.43,NA,0.86,0.43,0.79,NA,NA,
                                    0.58,0.67,0.63,0.14,NA,0.68,0.34,0.43,
                                    0.3,0.27,0.29),
            SpecificConductance = c(723,
                                    NA,1522,NA,663,861,299.1,NA,NA,829,
                                    402,547,670,NA,755,664.6,670.1,805.2,
                                    774.9,832.4),
                             pH = c(8.29,
                                    NA,8.32,NA,7.72,8.16,7.6,NA,NA,7.99,
                                    8.3,8.25,8.22,NA,7.94,8.24,8.05,
                                    8.11,8.28,8.37),
                    InstantTemp = c(9,NA,
                                    8.7,NA,6.33,7.46,5.8,NA,NA,11.8,
                                    2.5,6.6,7.8,NA,5.86,6.9,2.4,0.4,3.5,
                                    3.9),
                   TurbidityAvg = c(6.54,
                                    NA,3.46,NA,37.2,18.53,22.6,NA,NA,
                                    19.82,22.9,579,43.47,NA,9.65,28.64,5.75,
                                    15.06,25.94,13.77),
                      PoolCount = c(3,NA,
                                    15,NA,2,11,12,NA,NA,8,33,9,1,NA,
                                    2,1,1,2,1,0),
                       PctPools = c(5.57,
                                    NA,11.6,NA,5.87,11.62,26.67,NA,NA,
                                    18.24,26.25,9.37,1.73,NA,5.21,3.42,3,
                                    2.67,2.68,0)
)
#>              EvaluationID     PointID StreamName FieldEvalDate
#> 1  CY-TR-10058_2023-07-21 CY-TR-10058 Gold Creek     7/21/2023
#> 2  CY-TR-20319_2023-07-22 CY-TR-20319 Gold Creek     8/22/2023
#> 3  CY-TR-78315_2023-07-21 CY-TR-78315       <NA>     7/21/2023
#> 4  CY-TR-28952_2023-07-20 CY-TR-28952 Gold Creek     7/20/2023
#> 5  CY-TR-15126_2023-07-18 CY-TR-15126 Gold Creek     7/18/2023
#> 6  CY-TR-32532_2023-07-21 CY-TR-32532 Gold Creek     7/21/2023
#> 7  CY-TR-38511_2023-07-19 CY-TR-38511     18 pup     7/19/2023
#> 8  CY-TR-22299_2023-07-21 CY-TR-22299 Gold Creek     7/21/2023
#> 9   CY-TR-1467_2023-07-21  CY-TR-1467 Gold Creek     7/21/2023
#> 10 CY-TR-34500_2023-07-20 CY-TR-34500 Gold Creek     7/20/2023
#> 11 CY-TR-12968_2023-07-19 CY-TR-12968       <NA>     7/19/2023
#> 12 CY-TR-69008_2023-07-18 CY-TR-69008       <NA>     7/18/2023
#> 13 CY-TR-17347_2023-07-20 CY-TR-17347 Gold Creek     7/20/2023
#> 14  CY-TR-3831_2023-07-19  CY-TR-3831 Gold Creek     7/19/2023
#> 15 CY-TR-34930_2023-07-20 CY-TR-34930 Gold Creek     7/20/2023
#> 16 CY-TR-22299_2022-09-02 CY-TR-22299 Gold Creek      9/2/2022
#> 17  CY-TR-3831_2021-08-27  CY-TR-3831 Gold Creek     8/27/2021
#> 18 CY-TR-28952_2021-08-28 CY-TR-28952 Gold Creek     8/28/2021
#> 19 CY-TR-20319_2021-08-28 CY-TR-20319 Gold Creek     8/28/2021
#> 20  CY-TR-1467_2021-08-29  CY-TR-1467 Gold Creek     8/29/2021
#>    PointSelectionType PctOverheadCover PctBankOverheadCover VegComplexity
#> 1            Targeted             31.8                 81.8          1.00
#> 2            Targeted              2.4                 14.3            NA
#> 3            Targeted             66.4                 77.0          1.53
#> 4            Targeted             12.6                 63.9            NA
#> 5            Targeted              9.7                 54.5          1.14
#> 6            Targeted              3.6                 32.9          0.68
#> 7            Targeted             39.8                 75.9          1.34
#> 8            Targeted              3.7                 25.4            NA
#> 9            Targeted              1.7                 26.5            NA
#> 10           Targeted              2.5                 23.0          0.91
#> 11           Targeted             66.7                 77.5          1.31
#> 12           Targeted             55.9                 67.9          1.32
#> 13           Targeted              0.1                  2.1          0.24
#> 14           Targeted              0.0                 12.8            NA
#> 15           Targeted              1.5                 25.0          1.19
#> 16           Targeted              2.1                 32.6          0.48
#> 17           Targeted              1.3                 19.0          0.96
#> 18           Targeted             10.6                 41.4          0.75
#> 19           Targeted              0.0                  4.1          0.45
#> 20           Targeted              2.3                 25.4          0.68
#>    VegComplexityWoody VegComplexityUnderstoryGround SpecificConductance   pH
#> 1                0.76                          0.59               723.0 8.29
#> 2                  NA                            NA                  NA   NA
#> 3                0.71                          0.43              1522.0 8.32
#> 4                  NA                            NA                  NA   NA
#> 5                0.96                          0.86               663.0 7.72
#> 6                0.45                          0.43               861.0 8.16
#> 7                0.81                          0.79               299.1 7.60
#> 8                  NA                            NA                  NA   NA
#> 9                  NA                            NA                  NA   NA
#> 10               0.68                          0.58               829.0 7.99
#> 11               0.67                          0.67               402.0 8.30
#> 12               0.66                          0.63               547.0 8.25
#> 13               0.14                          0.14               670.0 8.22
#> 14                 NA                            NA                  NA   NA
#> 15               0.69                          0.68               755.0 7.94
#> 16               0.36                          0.34               664.6 8.24
#> 17               0.44                          0.43               670.1 8.05
#> 18               0.32                          0.30               805.2 8.11
#> 19               0.27                          0.27               774.9 8.28
#> 20               0.30                          0.29               832.4 8.37
#>    InstantTemp TurbidityAvg PoolCount PctPools
#> 1         9.00         6.54         3     5.57
#> 2           NA           NA        NA       NA
#> 3         8.70         3.46        15    11.60
#> 4           NA           NA        NA       NA
#> 5         6.33        37.20         2     5.87
#> 6         7.46        18.53        11    11.62
#> 7         5.80        22.60        12    26.67
#> 8           NA           NA        NA       NA
#> 9           NA           NA        NA       NA
#> 10       11.80        19.82         8    18.24
#> 11        2.50        22.90        33    26.25
#> 12        6.60       579.00         9     9.37
#> 13        7.80        43.47         1     1.73
#> 14          NA           NA        NA       NA
#> 15        5.86         9.65         2     5.21
#> 16        6.90        28.64         1     3.42
#> 17        2.40         5.75         1     3.00
#> 18        0.40        15.06         2     2.67
#> 19        3.50        25.94         1     2.68
#> 20        3.90        13.77         0     0.00

^{Created on 2024-02-06 with reprex v2.1.0}

As an example of what I would like to happen: Look at rows 14 and 17 ($PointID for both those rows is CY-TR-3831). You will notice the values for the rows in $PointID are the same (this is the unique identifier). Then look at the $FieldEvalDate, you will see row 17 is 8/27/2021 and row 14 is 7/19/2023. I would like to over-write row 17 column values with that of row 14 (thus "updating" the row 17 with data from 14). As an example this should result in row 17 $PctBankOverheadCover changing from 19 to 12.8, but for $VegComplexity the value in row 17 is unchanged because the value in row 14 is NA.

I do not want to update ALL the columns therefore I tried to create a vector where the elements are column names and I thought I could feed this into my filter which is what I tried to do in the original post with two data.frames.

I hope this provides some clarity on what I am trying to accomplish. This is the thread where I got the idea to make two data.frames
Merge Dataframes to Fill Missing Data in RStudio - #5 by andresrcs

filebb · February 6, 2024, 6:40pm

Thanks, this clarifies the problem. Question: how do you establish who will be updated?

I would like to over-write row 17 column values with that of row 14 (thus "updating" the row 17 with data from 14)

Is this because row 14 is before row 17?

dromano · February 6, 2024, 6:43pm

Thanks, Chris, and follow-up question: Would you want to keep all original rows, or one row per unique ID, with the most recent values for that ID in each column?

dromano · February 6, 2024, 6:51pm

Hi again, @clarktar, I just noticed the table you posted code for contains exactly one row per unique ID — is that the desired outcome, then?

clarktar · February 6, 2024, 6:53pm

In this case, if the value in $FieldEvalDate contains "2023" we want to use data in that record (row) to update values for the same $PointID where $FieldEvalDate does not contain "2023" (most of the older records are 2021, 2022).

There is only NEW data from 2023 and there is never more than one duplicate for a given $PointID

clarktar · February 6, 2024, 6:55pm

One row per unique ID ($PointID) is the goal.

clarktar · February 6, 2024, 7:03pm

Apologies @dromano that previous code for the data.frame was only the Head(). I made an edit to provide code for a larger data.frame that now includes records with duplicate $PointID to illustrate the data better.

But, yes, in the end the desired outcome will be a data.frame where there is only one record (row) per $PointID.

dromano · February 6, 2024, 7:04pm

In that case, could you post the code for the original table? Here are the steps to do that, assuming the table is called original_table:

First, execute the command:

sink("original.txt") # directs output to newly created file by that name

then the command:

dput(original_table) # copies the content of the table into file

and finally the command:

sink() # restores output back to console

After that, you can copy all the contents of the file original.txt and paste them here, between a pair of triple backticks, like this:

```
<--- {paste here]
```

(This makes your pasted code automatically copy-able.)

dromano · February 6, 2024, 7:07pm

Thanks, Chris — I see we were writing at the same time! I'll take a look at the new table you posted.

dromano · February 6, 2024, 7:25pm

click here, and run this code first, to recreate original table (converted to from data frame to tibble)

original_table <- 
structure(list(EvaluationID = c("CY-TR-10058_2023-07-21", "CY-TR-20319_2023-07-22", 
"CY-TR-78315_2023-07-21", "CY-TR-28952_2023-07-20", "CY-TR-15126_2023-07-18", 
"CY-TR-32532_2023-07-21", "CY-TR-38511_2023-07-19", "CY-TR-22299_2023-07-21", 
"CY-TR-1467_2023-07-21", "CY-TR-34500_2023-07-20", "CY-TR-12968_2023-07-19", 
"CY-TR-69008_2023-07-18", "CY-TR-17347_2023-07-20", "CY-TR-3831_2023-07-19", 
"CY-TR-34930_2023-07-20", "CY-TR-22299_2022-09-02", "CY-TR-3831_2021-08-27", 
"CY-TR-28952_2021-08-28", "CY-TR-20319_2021-08-28", "CY-TR-1467_2021-08-29"
), PointID = c("CY-TR-10058", "CY-TR-20319", "CY-TR-78315", "CY-TR-28952", 
"CY-TR-15126", "CY-TR-32532", "CY-TR-38511", "CY-TR-22299", "CY-TR-1467", 
"CY-TR-34500", "CY-TR-12968", "CY-TR-69008", "CY-TR-17347", "CY-TR-3831", 
"CY-TR-34930", "CY-TR-22299", "CY-TR-3831", "CY-TR-28952", "CY-TR-20319", 
"CY-TR-1467"), StreamName = c("Gold Creek", "Gold Creek", NA, 
"Gold Creek", "Gold Creek", "Gold Creek", "18 pup", "Gold Creek", 
"Gold Creek", "Gold Creek", NA, NA, "Gold Creek", "Gold Creek", 
"Gold Creek", "Gold Creek", "Gold Creek", "Gold Creek", "Gold Creek", 
"Gold Creek"), FieldEvalDate = c("7/21/2023", "8/22/2023", "7/21/2023", 
"7/20/2023", "7/18/2023", "7/21/2023", "7/19/2023", "7/21/2023", 
"7/21/2023", "7/20/2023", "7/19/2023", "7/18/2023", "7/20/2023", 
"7/19/2023", "7/20/2023", "9/2/2022", "8/27/2021", "8/28/2021", 
"8/28/2021", "8/29/2021"), PointSelectionType = c("Targeted", 
"Targeted", "Targeted", "Targeted", "Targeted", "Targeted", "Targeted", 
"Targeted", "Targeted", "Targeted", "Targeted", "Targeted", "Targeted", 
"Targeted", "Targeted", "Targeted", "Targeted", "Targeted", "Targeted", 
"Targeted"), PctOverheadCover = c(31.8, 2.4, 66.4, 12.6, 9.7, 
3.6, 39.8, 3.7, 1.7, 2.5, 66.7, 55.9, 0.1, 0, 1.5, 2.1, 1.3, 
10.6, 0, 2.3), PctBankOverheadCover = c(81.8, 14.3, 77, 63.9, 
54.5, 32.9, 75.9, 25.4, 26.5, 23, 77.5, 67.9, 2.1, 12.8, 25, 
32.6, 19, 41.4, 4.1, 25.4), VegComplexity = c(1, NA, 1.53, NA, 
1.14, 0.68, 1.34, NA, NA, 0.91, 1.31, 1.32, 0.24, NA, 1.19, 0.48, 
0.96, 0.75, 0.45, 0.68), VegComplexityWoody = c(0.76, NA, 0.71, 
NA, 0.96, 0.45, 0.81, NA, NA, 0.68, 0.67, 0.66, 0.14, NA, 0.69, 
0.36, 0.44, 0.32, 0.27, 0.3), VegComplexityUnderstoryGround = c(0.59, 
NA, 0.43, NA, 0.86, 0.43, 0.79, NA, NA, 0.58, 0.67, 0.63, 0.14, 
NA, 0.68, 0.34, 0.43, 0.3, 0.27, 0.29), SpecificConductance = c(723, 
NA, 1522, NA, 663, 861, 299.1, NA, NA, 829, 402, 547, 670, NA, 
755, 664.6, 670.1, 805.2, 774.9, 832.4), pH = c(8.29, NA, 8.32, 
NA, 7.72, 8.16, 7.6, NA, NA, 7.99, 8.3, 8.25, 8.22, NA, 7.94, 
8.24, 8.05, 8.11, 8.28, 8.37), InstantTemp = c(9, NA, 8.7, NA, 
6.33, 7.46, 5.8, NA, NA, 11.8, 2.5, 6.6, 7.8, NA, 5.86, 6.9, 
2.4, 0.4, 3.5, 3.9), TurbidityAvg = c(6.54, NA, 3.46, NA, 37.2, 
18.53, 22.6, NA, NA, 19.82, 22.9, 579, 43.47, NA, 9.65, 28.64, 
5.75, 15.06, 25.94, 13.77), PoolCount = c(3, NA, 15, NA, 2, 11, 
12, NA, NA, 8, 33, 9, 1, NA, 2, 1, 1, 2, 1, 0), PctPools = c(5.57, 
NA, 11.6, NA, 5.87, 11.62, 26.67, NA, NA, 18.24, 26.25, 9.37, 
1.73, NA, 5.21, 3.42, 3, 2.67, 2.68, 0)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -20L))

^{Created on 2024-02-06 with reprex v2.0.2}

Is this what you're looking for?

library(tidyverse)

original_table |> 
  # group table by unique ID
  group_by(PointID) |> 
  # arrange rows in reverse-chrnological order
  arrange(desc(FieldEvalDate)) |> 
  # fill "up", i.e., change NA to immediately preceding non-NA value
  fill(.direction = "up") |> 
  # extract the first row of each subtable (one per unique ID)
  slice(1) |> 
  # undo grouping by unique ID
  ungroup() 
#> # A tibble: 15 × 16
#>    EvaluationID           PointID    StreamName FieldEvalDate PointSelectionType
#>    <chr>                  <chr>      <chr>      <chr>         <chr>             
#>  1 CY-TR-10058_2023-07-21 CY-TR-100… Gold Creek 7/21/2023     Targeted          
#>  2 CY-TR-12968_2023-07-19 CY-TR-129… <NA>       7/19/2023     Targeted          
#>  3 CY-TR-1467_2021-08-29  CY-TR-1467 Gold Creek 8/29/2021     Targeted          
#>  4 CY-TR-15126_2023-07-18 CY-TR-151… Gold Creek 7/18/2023     Targeted          
#>  5 CY-TR-17347_2023-07-20 CY-TR-173… Gold Creek 7/20/2023     Targeted          
#>  6 CY-TR-20319_2021-08-28 CY-TR-203… Gold Creek 8/28/2021     Targeted          
#>  7 CY-TR-22299_2022-09-02 CY-TR-222… Gold Creek 9/2/2022      Targeted          
#>  8 CY-TR-28952_2021-08-28 CY-TR-289… Gold Creek 8/28/2021     Targeted          
#>  9 CY-TR-32532_2023-07-21 CY-TR-325… Gold Creek 7/21/2023     Targeted          
#> 10 CY-TR-34500_2023-07-20 CY-TR-345… Gold Creek 7/20/2023     Targeted          
#> 11 CY-TR-34930_2023-07-20 CY-TR-349… Gold Creek 7/20/2023     Targeted          
#> 12 CY-TR-3831_2021-08-27  CY-TR-3831 Gold Creek 8/27/2021     Targeted          
#> 13 CY-TR-38511_2023-07-19 CY-TR-385… 18 pup     7/19/2023     Targeted          
#> 14 CY-TR-69008_2023-07-18 CY-TR-690… <NA>       7/18/2023     Targeted          
#> 15 CY-TR-78315_2023-07-21 CY-TR-783… <NA>       7/21/2023     Targeted          
#> # ℹ 11 more variables: PctOverheadCover <dbl>, PctBankOverheadCover <dbl>,
#> #   VegComplexity <dbl>, VegComplexityWoody <dbl>,
#> #   VegComplexityUnderstoryGround <dbl>, SpecificConductance <dbl>, pH <dbl>,
#> #   InstantTemp <dbl>, TurbidityAvg <dbl>, PoolCount <dbl>, PctPools <dbl>

^{Created on 2024-02-06 with reprex v2.0.2}

clarktar · February 6, 2024, 7:41pm

I will run that code and see what the outcome is. Another way I just thought about this is anytime there is a duplicate $PointID (two records for a unique $PointID) and there are NA values in columns for the row where $FieldEvalDate contains "2023" those NAs can be "updated" or "overwritten" with the values from the row where $FieldEvalDate value is NOT "2023"

data.frame(
               stringsAsFactors = FALSE,
                                     EvaluationID = c("CY-TR-3831_2023-07-19",
                                                      "CY-TR-3831_2021-08-27"),
                                          PointID = c("CY-TR-3831",
                                                      "CY-TR-3831"),
                                       StreamName = c("Gold Creek",
                                                      "Gold Creek"),
                  FieldEvalDate = c("7/19/2023", "8/27/2021"),
             PointSelectionType = c("Targeted", "Targeted"),
               PctOverheadCover = c(0, 1.3),
           PctBankOverheadCover = c(12.8, 19),
                  VegComplexity = c(NA, 0.96),
             VegComplexityWoody = c(NA, 0.44),
  VegComplexityUnderstoryGround = c(NA, 0.43),
            SpecificConductance = c(NA, 670.1),
                             pH = c(NA, 8.05),
                    InstantTemp = c(NA, 2.4),
                   TurbidityAvg = c(NA, 5.75),
                      PoolCount = c(NA, 1),
                       PctPools = c(NA, 3)
                  )
#>            EvaluationID    PointID StreamName FieldEvalDate PointSelectionType
#> 1 CY-TR-3831_2023-07-19 CY-TR-3831 Gold Creek     7/19/2023           Targeted
#> 2 CY-TR-3831_2021-08-27 CY-TR-3831 Gold Creek     8/27/2021           Targeted
#>   PctOverheadCover PctBankOverheadCover VegComplexity VegComplexityWoody
#> 1              0.0                 12.8            NA                 NA
#> 2              1.3                 19.0          0.96               0.44
#>   VegComplexityUnderstoryGround SpecificConductance   pH InstantTemp
#> 1                            NA                  NA   NA          NA
#> 2                          0.43               670.1 8.05         2.4
#>   TurbidityAvg PoolCount PctPools
#> 1           NA        NA       NA
#> 2         5.75         1        3

^{Created on 2024-02-06 with reprex v2.1.0}

So in the above image (and dataframe) we would end up with this result:

data.frame(
  stringsAsFactors = FALSE,
  EvaluationID = c("CY-TR-3831_2023-07-19",
                   "CY-TR-3831_2021-08-27"),
  PointID = c("CY-TR-3831",
              "CY-TR-3831"),
  StreamName = c("Gold Creek",
                 "Gold Creek"),
  FieldEvalDate = c("7/19/2023", "8/27/2021"),
  PointSelectionType = c("Targeted", "Targeted"),
  PctOverheadCover = c(0, 1.3),
  PctBankOverheadCover = c(12.8, 19),
  VegComplexity = c(0.96, 0.96),
  VegComplexityWoody = c(0.44, 0.44),
  VegComplexityUnderstoryGround = c(0.43, 0.43),
  SpecificConductance = c(670.1, 670.1),
  pH = c(8.05, 8.05),
  InstantTemp = c(2.4, 2.4),
  TurbidityAvg = c(5.75, 5.75),
  PoolCount = c(1, 1),
  PctPools = c(3, 3)
)
#>            EvaluationID    PointID StreamName FieldEvalDate PointSelectionType
#> 1 CY-TR-3831_2023-07-19 CY-TR-3831 Gold Creek     7/19/2023           Targeted
#> 2 CY-TR-3831_2021-08-27 CY-TR-3831 Gold Creek     8/27/2021           Targeted
#>   PctOverheadCover PctBankOverheadCover VegComplexity VegComplexityWoody
#> 1              0.0                 12.8          0.96               0.44
#> 2              1.3                 19.0          0.96               0.44
#>   VegComplexityUnderstoryGround SpecificConductance   pH InstantTemp
#> 1                          0.43               670.1 8.05         2.4
#> 2                          0.43               670.1 8.05         2.4
#>   TurbidityAvg PoolCount PctPools
#> 1         5.75         1        3
#> 2         5.75         1        3

^{Created on 2024-02-06 with reprex v2.1.0}

And would only need to remove the "row" which is where $FieldEvalDate does not contain "2023".

clarktar · February 6, 2024, 8:19pm

Unfortunately that does not produce the correct result. If you look at the tibble result and look at row 12 (PointID CY-TR-3831) it kept all the values from the row where $FieldEvalDate = 8/27/2021. Hopefully my post preceding this one clarifies what I am after.