Supplement missing values in DF with values from other rows based on conditions

Hello,

I have a data frame containing geocoded latitude and longitude columns. These values are missing in some rows and I want to supplement them with corresponding values from other rows containing the same address.
What I want to do is basically this: if lat & long are missing, supplement it with values from other nearest row having the same id and var1 columns.

To be concrete: in my dummy DF below, I want to supplement lat and long in rows 1 and 7 with values from row 4; and missing values in row 5 with values from row 2.
Lat and long in row 6 remains empty because we have no corresponding rows with lat and long columns.

How to do this in tidyverse way please?

Many thanks in advance

 df <- tribble(
    ~id, ~var1, ~lat, ~long,
    100, "G", "", "",
    250, "O", "24.56", "21.06",
    300, "O", "31.55", "25.64",
    100, "G", "24.50", "26.88",
    250, "O", "", "",
    800, "K", "", "",
    100, "G", "", ""
  )


# A tibble: 7 × 4
     id var1  lat     long   
  <dbl> <chr> <chr>   <chr>  
1   100 G     ""      ""     
2   250 O     "24.56" "21.06"
3   300 O     "31.55" "25.64"
4   100 G     "24.50" "26.88"
5   250 O     ""      ""     
6   800 K     ""      ""     
7   100 G     ""      ""  

I used the conversion of lat and long to numbers to produce NA values and then used na.omit() to eliminate those rows. If your real data are more complex, you may have to use another strategy to filter for the rows that do not have missing values in lat and long.

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.3.3
df <- tribble(
  ~id, ~var1, ~lat, ~long,
  100, "G", "", "",
  250, "O", "24.56", "21.06",
  300, "O", "31.55", "25.64",
  100, "G", "24.50", "26.88",
  250, "O", "", "",
  800, "K", "", "",
  100, "G", "", ""
)
df <- df |> mutate(across(c(lat,long), as.numeric))
df2 <- na.omit(df)
df3 <- left_join(df, df2, by = c("id", "var1"),suffix = c(".x", ""))
df3
#> # A tibble: 7 × 6
#>      id var1  lat.x long.x   lat  long
#>   <dbl> <chr> <dbl>  <dbl> <dbl> <dbl>
#> 1   100 G      NA     NA    24.5  26.9
#> 2   250 O      24.6   21.1  24.6  21.1
#> 3   300 O      31.6   25.6  31.6  25.6
#> 4   100 G      24.5   26.9  24.5  26.9
#> 5   250 O      NA     NA    24.6  21.1
#> 6   800 K      NA     NA    NA    NA  
#> 7   100 G      NA     NA    24.5  26.9
df3 <- df3 |> select(-c("lat.x","long.x"))
df3
#> # A tibble: 7 × 4
#>      id var1    lat  long
#>   <dbl> <chr> <dbl> <dbl>
#> 1   100 G      24.5  26.9
#> 2   250 O      24.6  21.1
#> 3   300 O      31.6  25.6
#> 4   100 G      24.5  26.9
#> 5   250 O      24.6  21.1
#> 6   800 K      NA    NA  
#> 7   100 G      24.5  26.9

Created on 2024-04-26 with reprex v2.0.2

Hello @FJCC,

Thank you for your response.
I have a relatively large dataset with over 50 columns. Your solution would require to create always a copy of the dataset, perform the left join operation and then clean it. Is that assumption correct please?

If it is like above, I think that your solution would work, but still; do you see there also other ways of doing the job? Maybe some alternative based on mutate()? Or, I was thinking about doing a nest() on conditions and then fill the nested rows...? What do you thin please?
Thank you...

I don't think having 50 columns prevents you from using my method. When you make the copy of the data, keep only the columns that identify the data (id and var1 in your example) and lat and long. You can do that with the select() function. Run na.omit on that and join the result to the full data set.

Hello,

I tested it on my data frame and it works well.

Thank you for your help.

Cheers

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.