Can't fix this error, transforming long dataframe

GeraldineK · May 29, 2024, 10:29pm

jose_perez:

stack_data = function(i) {
        d = select(df, Municipio, contains(i), -Cod)
        names(d) = str_replace(names(d), i, 'col')
        d
}

df <- map_df(c('A', 'B','C', 'D', 'E', 'F', 'G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','AA','AB','AC','AD','AE','AF','AG',
               'AH','AI','AJ','AK','AL','AM'), stack_data) |>
        filter(!is.na(col1)) |>
        pivot_longer(c(-'Municipio', -'col1')) |>
        pivot_wider(names_from = c(col1, name), values_from = value) |>
        select(Municipio, contains('PSOE'), contains('PP'), contains('IZQESP'), contains('CEUS'), contains('EL PI'), contains('CCa'), contains('GBAI'),
               contains('EAJ-PNV'), contains('FO'), contains('AHORA REPÚBLICAS (ERC-EH BILDU-BNG-ARA MÉS)'), contains('JUNTS-UE'), contains('ANDALUCISTAS'), 
               contains('IE'), contains('VOX'), contains('ESCAÑOS EN BLANCO'), contains('CRT'), contains('PODEMOS'), contains('PFAC'), contains('PCTE'), contains('PCTC'), 
               contains('PCPE/PCPC'), contains('SALAMANCA-ZAMORA-LEÓN PREPAL'), contains('P-LIB'),  contains('SUMAR'),contains('PACMA'),  contains('CREE EN EUROPA'),
               contains('PMR'), contains('S.A.L.F'), contains('F'), contains('EXISTE'), contains('Cs'), contains('FE de las JONS'), contains('SAE'), contains('J.U.E.X.'), 
               contains('PH'), contains('VOLT'), contains('RECORTES CERO'), contains('PAZ SOBERANÍA REPÚBLICA'),  contains('FIEL'),contains('IZQP - UNIDOS - DEf'),  
               contains('EXTREMEÑOS, Y SUBSIDIARIAMENTE EX PREX CREX'), contains('glg'), contains('PIRATES.CAT / REBELDES'), contains('NEX')) |>
        mutate_all(~replace_na(., 0))

First off, just as an FYI, the line filter(!is.na(col1)) does not do anything in this code, the tibble stays 195x47.
Second, the reason why the pivot_longer() does not work, is because columns in a dataframe need to be the same data type. With this code you are putting the original column names in a new column - that's fine. But then you are also putting the contents of those original columns all in one column: but some are numeric (e.g. col2, col3, etc) and some are characters (e.g. colA1, colA2, etc). The function has an optional argument values_transform for which the help file says: " If not specified, the type of the columns generated from names_to will be character, and the type of the variables generated from values_to will be the common type of the input columns used to generate them.". And in your case, there is no common data type, there are different data types and pivot_longer() does not want to coerce data types from one to the other.
If you run the code without the part that transforms the numbers from character to numeric, the pivot_longer() works. However, you then get a warning for the pivot_wider(): Warning message: Values from value are not uniquely identified; output will contain list-cols. Which I am sure you don't want (every column is now a list, and you end up with 1756 columns).

Okay - so after trying to figure out what you want exactly, I think the function is maybe also not creating what you think it is creating? If I run your code as-is, up to the map_df() line only, it produces:

# A tibble: 195 × 47
   Municipio col1   col2  col3 colA1   colA2 colA3 colB1 colB2 colB3 colC1 colC2 colC3 colD1 colD2 colD3
   <chr>     <chr> <dbl> <dbl> <chr>   <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
 1 Huelmo    PSOE    248  5030 S. A. …     0     0 F         0     0 EXIS…     0     0 FE d…     0     0
 2 Busdongo  PP       44  3928 FE de …     0     0 SAE       0     0 J. U…     0     0 PH        0     0
 3 Duero     PSOE     89  8240 S. A. …     0     0 F         0     0 EXIS…     0     0 SAE       0     0
 4 Coin      PSOE   1637  3968 P-LIB       0     0 SUMAR     0     0 CREE…     0     0 PMR       0     0
 5 Anduva    PSOE    148  5016 EXISTE      0     0 SAE       0     0 J. U…     0     0 PH        0     0
 6 Huelmo    PP       72  1460 NA         NA    NA NA       NA    NA NA       NA    NA NA       NA    NA
 7 Busdongo  PSOE     33  2946 NA         NA    NA NA       NA    NA NA       NA    NA NA       NA    NA
 8 Duero     VOX       2   185 NA         NA    NA NA       NA    NA NA       NA    NA NA       NA    NA
 9 Coin      PP      924  2240 NA         NA    NA NA       NA    NA NA       NA    NA NA       NA    NA
10 Anduva    PP       77  2610 NA         NA    NA NA       NA    NA NA       NA    NA NA       NA    NA
# ℹ 185 more rows
# ℹ 31 more variables: colE1 <chr>, colE2 <dbl>, colE3 <dbl>, colF1 <chr>, colF2 <dbl>, colF3 <dbl>,
#   colG1 <chr>, colG2 <dbl>, colG3 <dbl>, colH1 <chr>, colH2 <dbl>, colH3 <dbl>, colI1 <chr>,
#   colI2 <dbl>, colI3 <dbl>, colJ1 <chr>, colJ2 <dbl>, colJ3 <dbl>, colK1 <chr>, colK2 <dbl>,
#   colK3 <dbl>, colL1 <chr>, colL2 <dbl>, colL3 <dbl>, colM1 <chr>, colM2 <dbl>, colM3 <dbl>,
#   Acol1 <chr>, Acol2 <dbl>, Acol3 <dbl>, colunicipio <chr>
# ℹ Use `print(n = ...)` to see more rows

So the characters in A1, B1, C1 become col1, and all the values in A2, B2, C2, etc go in col2, and A3, B3, C3 etc in col3? But then you also still have colA1, colA2, colA3, etc, all populated with either characters, 0s or NAs...? From the example you are showing, it looks like you don't need/want all this? Why not get rid of those? This code seems to work (see my comments with the piped code part):

# Use same dataset - not shown here, large
col_voto <- c("Cod","A2","A3","B2","B3","C2","C3","D2","D3","E2","E3","F2","F3","G2","G3","H2","H3","I2","I3","J2","J3","K2","K3","L2","L3","M2","M3","N2","N3","O2","O3",        "P2","P3","Q2","Q3","R2","R3","S2","S3","T2","T3","U2","U3","V2","V3","W2","W3","X2","X3","Y2","Y3","Z2","Z3","AA2","AA3","AB2","AB3","AC2","AC3","AD2","AD3",
              "AE2","AE3","AF2","AF3","AG2","AG3","AH2","AH3","AI2","AI3","AJ2","AJ3","AK2","AK3","AL2","AL3","AM2","AM3")

df <- df %>%
  mutate(across(all_of(col_voto), as.numeric)) 

stack_data = function(i) {
  d = select(df, Municipio, starts_with(i), -Cod)
  names(d) = str_replace(names(d), i, 'col')
  d
}

df3 <- map_df(c('A', 'B','C', 'D', 'E', 'F', 'G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','AA','AB','AC','AD','AE','AF','AG',
               'AH','AI','AJ','AK','AL','AM'), stack_data) |>
  select(Municipio, col1, col2, col3) |> # only select the columns you need
  filter(!is.na(col1)) |> # FYI does nothing, tibble stays 195x47
  filter(!is.na(Municipio)) |> # Is this maybe what you want? Somehow there are 5 rows with NA for Municipio
  pivot_longer(c(-'Municipio', -'col1')) |> # This now works since col2 and col3 are both numeric
  pivot_wider(names_from = c(col1, name), values_from = value) |>
# You now also don't need the select code below anymore
# select(Municipio, contains('PSOE'), contains('PP'), contains('IZQESP'), contains('CEUS'), contains('EL PI'), contains('CCa'), contains('GBAI'),
  #        contains('EAJ-PNV'), contains('FO'), contains('AHORA REPÚBLICAS (ERC-EH BILDU-BNG-ARA MÉS)'), contains('JUNTS-UE'), contains('ANDALUCISTAS'),
  #        contains('IE'), contains('VOX'), contains('ESCAÑOS EN BLANCO'), contains('CRT'), contains('PODEMOS'), contains('PFAC'), contains('PCTE'), contains('PCTC'),
  #        contains('PCPE/PCPC'), contains('SALAMANCA-ZAMORA-LEÓN PREPAL'), contains('P-LIB'),  contains('SUMAR'),contains('PACMA'),  contains('CREE EN EUROPA'),
  #        contains('PMR'), contains('S.A.L.F'), contains('F'), contains('EXISTE'), contains('Cs'), contains('FE de las JONS'), contains('SAE'), contains('J.U.E.X.'),
  #        contains('PH'), contains('VOLT'), contains('RECORTES CERO'), contains('PAZ SOBERANÍA REPÚBLICA'),  contains('FIEL'),contains('IZQP - UNIDOS - DEf'),
  #        contains('EXTREMEÑOS, Y SUBSIDIARIAMENTE EX PREX CREX'), contains('glg'), contains('PIRATES.CAT / REBELDES'), contains('NEX')) |>
  mutate_all(~replace_na(., 0))

The final dataframe (I called it df3 to compare with the original) is not exactly the same as the example, but it still contains all the column names that you specified in the select() part, and a few extra (your code did not have PODEMOS in it, but your example did).
Not knowing anything whatsoever about your data, this seems like what you need? If not, please give some more detail.

Edited to that this is the result:

# A tibble: 5 × 79
  Municipio PSOE_col2 PSOE_col3 PP_col2 PP_col3 VOX_col2 VOX_col3 PODEMOS_col2 PODEMOS_col3 Cs_col2
  <chr>         <dbl>     <dbl>   <dbl>   <dbl>    <dbl>    <dbl>        <dbl>        <dbl>   <dbl>
1 Huelmo          248      5030      72    1460       41      831           66         1338      21
2 Busdongo         33      2946      44    3928        4      357           20         1785       3
3 Duero            89      8240       2     185        2      185            0            0       2
4 Coin           1637      3968     924    2240      107      259          296          717     386
5 Anduva          148      5016      77    2610       15      508           15          508      18
# ℹ 69 more variables: Cs_col3 <dbl>, CEUS_col2 <dbl>, CEUS_col3 <dbl>, PH_col2 <dbl>, PH_col3 <dbl>,
#   `AHORA REPÚBLICAS (ERC-EH BILDU-BNG-ARA MÉS)_col2` <dbl>,
#   `AHORA REPÚBLICAS (ERC-EH BILDU-BNG-ARA MÉS)_col3` <dbl>, `FE de las JONS_col2` <dbl>,
#   `FE de las JONS_col3` <dbl>, `RECORTES CERO_col2` <dbl>, `RECORTES CERO_col3` <dbl>,
#   `PCPE/PCPC_col2` <dbl>, `PCPE/PCPC_col3` <dbl>, IZQESP_col2 <dbl>, IZQESP_col3 <dbl>,
#   `JUNTS-UE_col2` <dbl>, `JUNTS-UE_col3` <dbl>, PCTE_col2 <dbl>, PCTE_col3 <dbl>, PACMA_col2 <dbl>,
#   PACMA_col3 <dbl>, ANDALUCISTAS_col2 <dbl>, ANDALUCISTAS_col3 <dbl>, FO_col2 <dbl>, FO_col3 <dbl>, …