Cleaning a long and intricate dataframe

jynusmac · February 14, 2022, 11:46am

I have a long dataset with this structure:

df <- data.frame(
  stringsAsFactors = FALSE,
              Code = c("05-001-01-001-U",
                       "05-001-01-001-U","05-001-01-001-U","05-002-01-001-A",
                       "05-002-01-001-A","05-002-01-001-A","05-002-01-001-B",
                       "05-002-01-001-B","05-002-01-001-B","05-002-01-002-A",
                       "05-002-01-002-A","05-002-01-002-A","05-002-01-002-B",
                       "05-002-01-002-B","05-002-01-002-B"),
             Party = c("Party1","Party2","Party3",
                       "Party1","Party2","Party3","Party1","Party2","Party3",
                       "Party1","Party2","Party3","Party1","Party2",
                       "Party3"),
             Value = c(1L,41L,2L,7L,99L,7L,3L,
                       82L,4L,1L,76L,1L,2L,76L,2L)

I want to achieve two things:
In the column code, join all the rows with the same data: if the code is 05-001-01-001-U and there is two rows, obtain a single row with this Code and without the 'U' and without the '-' symbol.

In the rows like "05-002-01-001-A" with A or B also merge all data (six rows in this case) in one single row: 0500201001.
Then, with the party and value colum, sum each party value for his unique code. Something like this:

data.frame(
  stringsAsFactors = FALSE,
              Code = c("0500201001"),
            Party1 = c(10L),
            Party2 = c(181L),
            Party3 = c(11L)
)

I don't know if I have explained well what I want to achieve. Don't see a solution, I am grateful for any hints on how to do it.

nirgrahamuk · February 14, 2022, 1:18pm

(startdf <- data.frame(
  stringsAsFactors = FALSE,
  Code = c(
    "05-001-01-001-U",
    "05-001-01-001-U", "05-001-01-001-U", "05-002-01-001-A",
    "05-002-01-001-A", "05-002-01-001-A", "05-002-01-001-B",
    "05-002-01-001-B", "05-002-01-001-B", "05-002-01-002-A",
    "05-002-01-002-A", "05-002-01-002-A", "05-002-01-002-B",
    "05-002-01-002-B", "05-002-01-002-B"
  ),
  Party = c(
    "Party1", "Party2", "Party3",
    "Party1", "Party2", "Party3", "Party1", "Party2", "Party3",
    "Party1", "Party2", "Party3", "Party1", "Party2",
    "Party3"
  ),
  Value = c(
    1L, 41L, 2L, 7L, 99L, 7L, 3L,
    82L, 4L, 1L, 76L, 1L, 2L, 76L, 2L
  )
))

library(tidyverse)

pivot_wider(startdf,
  id_cols = "Code",
  names_from = "Party",
  values_from = "Value"
) %>%
  mutate(
    code_start = substr(
      x = Code,
      start = 1,
      stop = nchar(Code) - 2
    ),
    code_end = substr(
      x = Code,
      start = nchar(Code) - 1,
      stop = nchar(Code)
    ),
    u_or_not = code_end == "-U"
  ) %>%
  group_by(code_start, u_or_not) %>%
  summarise(across(where(is.numeric), sum),
    Code = min(code_start)) %>%
  ungroup() %>%
  relocate(Code) %>%
  select(-code_start,-u_or_not)

system · February 21, 2022, 1:18pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.