Concatenate column names

aornugent · March 20, 2020, 7:38am

Hia,

I've several spreadsheets with a repeated RSE column. .name_repair=universal fixes that, but I'd like to concatenate the previous column name so that any associations are named, not inferred by position.

I've a working example in this reprex below, but don't find it particularly satisfying to have to do in two steps. Is there a tidier approach?

Thanks,
A

library(tidyverse)
x <- tribble(~ region, ~ year,
        ~ beef_bulls_at_30_june_no, ~ rse_4,
        ~ beef_calves_at_30_june_no, ~ rse_6, 
        ~ beef_cows_at_30_june_no, ~ rse_8)


colnames(x)          
#> [1] "region"                    "year"                     
#> [3] "beef_bulls_at_30_june_no"  "rse_4"                    
#> [5] "beef_calves_at_30_june_no" "rse_6"                    
#> [7] "beef_cows_at_30_june_no"   "rse_8"
      
y <- rename_at(x, vars(contains("rse")), 
               ~ paste0(colnames(x)[parse_number(.) -1], "_rse"))

colnames(y)
#> [1] "region"                        "year"                         
#> [3] "beef_bulls_at_30_june_no"      "beef_bulls_at_30_june_no_rse" 
#> [5] "beef_calves_at_30_june_no"     "beef_calves_at_30_june_no_rse"
#> [7] "beef_cows_at_30_june_no"       "beef_cows_at_30_june_no_rse"

nirgrahamuk · March 20, 2020, 9:47am

Hi, don't be too harsh on yourself, your solution seems elegant to me !
I see a single call to rename_at that handles nicely your naming issue.
Unless, your reprex isn't fully capturing your challenge, I'd say you already have a great solution to use.
Particularly if rse is a common them for you.
The one other thing you could do would be to turn your rename_at into a function, so that you could call it and pass a different prefix, but that would be makework if you are always processing 'rse'.

dromano · March 20, 2020, 1:30pm

This is a tangent, but I was curious about the kind of data that's contained in those columns. I agree with @nirgrahamuk that your solution is very tidy (and I can't even see which is the second step you refer to!), but the names of your columns suggest that your data may be a candidate for tidying, too.

For example, they suggest you're keeping track of various types of animals at various dates, and for future analysis and visualization, it might be helpful to make these into explicit variables, maybe like this:

tribble(
  ~region, ~year, ~day, ~animal, ~type, ~value, ~rse
)

where some values might be:

tribble(
  ~animal, ~type,
  'cattle', 'bulls'
  'cattle', 'cows'
  'hogs', 'boars',
  'hogs', 'sows',
)

Would that be useful in your case?

aornugent · March 21, 2020, 12:04am

Thank you kindly- but:

I think you're right, I might've oversimplified. x is read in as data, in the code above I have to store it so that I can reference it in the colnames(x) call.

I was having trouble with the nested lambda functions ~ all using . (dot) to reference the input data, but discovered I can write out the function fully to give it another name.

The full challenge is looping over multiple datasets. In this (non-reproducible) example using map naming the df argument lets me reference both colnames(df) and the column name in parse_number(.).

path <- fs::dir_ls("~/Downloads/", glob = "*.asp")

x <- map(path, ~ rio::import(., "html") %>%
         as_tibble(.name_repair = "universal") %>%
           rename_all(snakecase::to_any_case)) %>%
  map(., function(df) {rename_at(df, vars(contains("rse")), 
            ~ paste0(colnames(df)[parse_number(.) - 1], "_rse")) %>%
    mutate_at(vars(-region), as.numeric)}) %>%
  reduce(left_join, by = c("region", "year"))

> colnames(x)
 [1] "region"                          "year"                            "off_farm_contracts"             
 [4] "off_farm_contracts_rse"          "canola_receipts"                 "canola_receipts_rse"            
 [7] "field_peas_receipts"             "field_peas_receipts_rse"         "lupins_receipts"                
[10] "lupins_receipts_rse"             "cotton_receipts"                 "cotton_receipts_rse"            
[13] "barley_receipts"                 "barley_receipts_rse"             "grain_legumes_receipts"         
[16] "grain_legumes_receipts_rse"      "oats_receipts"                   "oats_receipts_rse"              
[19] "off_farm_sharefarming"           "off_farm_sharefarming_rse"       "oilseeds_receipts"              
[22] "oilseeds_receipts_rse"           "rice_receipts"                   "rice_receipts_rse"              
[25] "sorghum_receipts"                "sorghum_receipts_rse"            "total_crop_gross_receipts"      
[28] "total_crop_gross_receipts_rse"   "wheat_receipts"                  "wheat_receipts_rse"             
[31] "total_cash_receipts"             "total_cash_receipts_rse"         "beef_cattle_sold"               
[34] "beef_cattle_sold_rse"            "sheep_sold"                      "sheep_sold_rse"                 
[37] "livestock_transfers_outward"     "livestock_transfers_outward_rse" "other_farm_income"              
[40] "other_farm_income_rse"           "other_livestock_sold"            "other_livestock_sold_rse"       
[43] "total_wool_gross_receipts"       "total_wool_gross_receipts_rse"   "sheep_and_lambs_shorn_no"       
[46] "sheep_and_lambs_shorn_no_rse"    "sheep_flock_at_30_june_no"       "sheep_flock_at_30_june_no_rse"  
[49] "sheep_purchased_no"              "sheep_purchased_no_rse"          "sheep_sold_no"                  
[52] "sheep_sold_no_rse"               "ewes_at_30_june_no"              "ewes_at_30_june_no_rse"         
[55] "lambs_at_30_june_no"             "lambs_at_30_june_no_rse"         "rams_at_30_june_no"             
[58] "rams_at_30_june_no_rse"          "wethers_at_30_june_no"           "wethers_at_30_june_no_rse"      
[61] "total_wool_sold_kg"              "total_wool_sold_kg_rse"          "total_wool_produced_kg"         
[64] "total_wool_produced_kg_rse"      "wool_cut_per_head_kg"            "wool_cut_per_head_kg_rse"

Talk about impenetrable code though. Any suggestions that improve clarity are welcome.

@dromano Appreciate the suggestion, but the 30_june suffix is really just to indicate that the data references financial years. Will definitely investigate parsing the column name when pivoting to a long format.

One last tangent- .name_repair = "universal" strips out some units ($ and %) from the column name. I can't pre-process them because all the rse columns are not unique.

Is there a way to alter this behaviour? Once the symbols are gone, there's no way to infer what unit the data are in.

dromano · March 21, 2020, 12:27am

It's a little hard understand what you're facing without seeing a little more of the actual structure of your tables -- would you be able to post, say, a representative 10 x 10 section of the output of this?

path[[1]] %>% rio::import('html')

aornugent · March 22, 2020, 9:49pm

No problem - thanks for your attention:

> x <- path[1] %>% rio::import('html') %>% head(10)
Warning message:
In rbind(tr = c(td = "Australia", td = "1990", td = "5628", td = "19",  :
  number of columns of result is not a multiple of vector length (arg 871)

> print(x[, 1:10])
      Region Year Off farm contracts ($) RSE Canola receipts ($) RSE.1 Field peas receipts ($) RSE.2 Lupins receipts ($) RSE.3
1  Australia 1990                   5628  19                 415    80                    2513    20                2663    88
2  Australia 1991                   3792  12                 929    23                    1467    13                2275    14
3  Australia 1992                   4387  17                1132    22                    2079    20                3769    15
4  Australia 1993                   6062  16                1299    21                    2086    14                4258    11
5  Australia 1994                   6613  17                3179    14                    2437    13                5928    10
6  Australia 1995                   5051  15                2955    14                    1461    14                4507    12
7  Australia 1996                   6047  14                4849    12                    2202    16                5145    11
8  Australia 1997                   7021  19                5802    16                    1850    15                6074    11
9  Australia 1998                   7180  18                7782    12                    1315    19                5219    11
10 Australia 1999                   5995  15               16647    11                    1152    18                4531    10

> as_tibble(x[, 1:10], .name_repair = "universal") %>%
+   colnames(.)
New names:
* `Off farm contracts ($)` -> Off.farm.contracts....
* `Canola receipts ($)` -> Canola.receipts....
* `Field peas receipts ($)` -> Field.peas.receipts....
* `Lupins receipts ($)` -> Lupins.receipts....
 [1] "Region"                  "Year"                    "Off.farm.contracts...."  "RSE"                     "Canola.receipts...."    
 [6] "RSE.1"                   "Field.peas.receipts...." "RSE.2"                   "Lupins.receipts...."     "RSE.3"

I can't replace the symbols before repairing names because of the repeated RSE columns:

> rename_all(x, str_replace, pattern = "\\$", replacement = "AUD")
Error: `data` must be uniquely named but has duplicate columns
Run `rlang::last_error()` to see where the error occurred.

dromano · March 22, 2020, 10:31pm

A question: Do you need to have syntactic names? Or would using the names as they are ('Off farm contracts ($)') be fine?

aornugent · March 22, 2020, 11:03pm

I'd prefer tidy names a la total_wool_sold_kg and total_wool_sold_kg_rse but I suppose I could make do with 'Total Wool Sold (kg)' and 'Total Wool Sold (kg) RSE'

dromano · March 22, 2020, 11:11pm

The untidy version should be easy, I think: Just remove .name_repair = 'universal' by .name_repair = 'minimal'. Actually, just remove .name_repair = 'universal'.

The tidy version should take a little -- but not much -- regular expression work.

aornugent · March 22, 2020, 11:36pm

Ah ha! Thanks.

.name_repair = "minimal" doesn't do much of anything, but .name_repair = "unique" will enumerate the RSE columns and leave the rest in place.

snakecase should then be able to handle the symbol replacement, but there seems to be a quirk regarding $ as a special regex character. Issue filed here: https://github.com/Tazinho/snakecase/issues/187

Appreciate your help!

dromano · March 22, 2020, 11:39pm

It looked like the enumeration came from rio::import(), so I'd guess you shouldn't need to specify any name repair, and actually, I think rio::import() allows you to output tibbles with setclass.

Does this work?

x <- map(path, ~ rio::import(., "html", setclass =  'tibble') %>%
           rename_all(str_replace_all, '\\s', '_') %>% 
           rename_all(str_to_lower) %>% 
  map(., function(df) {rename_at(df, vars(contains("rse")), 
                                 ~ paste0(colnames(df)[parse_number(.) - 1], "_rse")) %>%
      mutate_at(vars(-region), as.numeric)}) %>%
  reduce(left_join, by = c("region", "year"))

dromano · March 22, 2020, 11:47pm

Oh, and, the reason symbols get stripped out is that names in R are expected to consist only of _, ., letters and numbers (but can't start with _ or numbers); these are called 'syntactic names', and are what .name_repair = 'universal' and snakecase:to_any_case() produce. Non-syntactic names are allowed, but have to be enclosed with backticks, like you can see in this tibble output:

library(tidyverse)
tribble(
~Region, ~Year, ~`Off farm contracts ($)`, ~RSE,
'Australia', 1990,                   5628,  19,
'Australia', 1991,                   3792,   12
) 
#> # A tibble: 2 x 4
#>   Region     Year `Off farm contracts ($)`   RSE
#>   <chr>     <dbl>                    <dbl> <dbl>
#> 1 Australia  1990                     5628    19
#> 2 Australia  1991                     3792    12

^{Created on 2020-03-22 by the reprex package (v0.3.0)}
or in:

`my name` <- 'nonsyntactic'
`my name`
#> [1] "nonsyntactic"

^{Created on 2020-03-22 by the reprex package (v0.3.0)}
(forgot: @aornugent)

aornugent · March 23, 2020, 12:22am

Frustrating! Subsetting x[, 1:10] (helpfully...) enumerates the RSE columns, but I didn't notice when copying the example. Further, this enumeration gives the number of RSE columns whereas .name_repair is the column number itself, which I leverage in colnames(df)[parse_number(.) - 1].

These are the raw column names from rio

> x <- path[1] %>% rio::import(., "html") %>% head(10)

> colnames(x)
 [1] "Region"                            "Year"                              "Off farm contracts ($)"           
 [4] "RSE"                               "Canola receipts ($)"               "RSE"                              
 [7] "Field peas receipts ($)"           "RSE"                               "Lupins receipts ($)"              
[10] "RSE"                               "Cotton receipts ($)"               "RSE"                              
[13] "Barley receipts ($)"               "RSE"                               "Grain legumes receipts ($)"       
[16] "RSE"                               "Oats receipts ($)"                 "RSE"                              
[19] "Off farm sharefarming ($)"         "RSE"                               "Oilseeds receipts ($)"            
[22] "RSE"                               "Rice receipts ($)"                 "RSE"                              
[25] "Sorghum receipts ($)"              "RSE"                               "Total crop gross receipts ($)"    
[28] "RSE"                               "Wheat receipts ($)"                "RSE"                              
[31] "Total cash receipts ($)"           "RSE"                               "Beef cattle sold ($)"             
[34] "RSE"                               "Sheep sold ($)"                    "RSE"                              
[37] "Livestock transfers - outward ($)" "RSE"                               "Other farm income ($)"            
[40] "RSE"                               "Other livestock sold ($)"          "RSE"                              
[43] "Total wool gross receipts ($)"     "RSE"

setclass is a great idea but also fails:

> x <- path[1] %>% rio::import(., "html", setclass = "tibble")
Error: Column names `RSE`, `RSE`, `RSE`, `RSE`, `RSE`, … (and 15 more) must not be duplicated.
Use .name_repair to specify repair.
Run `rlang::last_error()` to see where the error occurred.

name_repair = "unique" gives me enough to work with though. It enumerates the duplicated RSE columns but leaves the descriptive columns in a non-syntactic form. From there, I can parse the special characters with str_replace and then snakecase while #187 is being resolved.

Thanks!

system · April 13, 2020, 12:22am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.