I've several spreadsheets with a repeated RSE column. .name_repair=universal fixes that, but I'd like to concatenate the previous column name so that any associations are named, not inferred by position.
I've a working example in this reprex below, but don't find it particularly satisfying to have to do in two steps. Is there a tidier approach?
Hi, don't be too harsh on yourself, your solution seems elegant to me !
I see a single call to rename_at that handles nicely your naming issue.
Unless, your reprex isn't fully capturing your challenge, I'd say you already have a great solution to use.
Particularly if rse is a common them for you.
The one other thing you could do would be to turn your rename_at into a function, so that you could call it and pass a different prefix, but that would be makework if you are always processing 'rse'.
This is a tangent, but I was curious about the kind of data that's contained in those columns. I agree with @nirgrahamuk that your solution is very tidy (and I can't even see which is the second step you refer to!), but the names of your columns suggest that your data may be a candidate for tidying, too.
For example, they suggest you're keeping track of various types of animals at various dates, and for future analysis and visualization, it might be helpful to make these into explicit variables, maybe like this:
I think you're right, I might've oversimplified. x is read in as data, in the code above I have to store it so that I can reference it in the colnames(x) call.
I was having trouble with the nested lambda functions ~ all using . (dot) to reference the input data, but discovered I can write out the function fully to give it another name.
The full challenge is looping over multiple datasets. In this (non-reproducible) example using map naming the df argument lets me reference both colnames(df) and the column name in parse_number(.).
Talk about impenetrable code though. Any suggestions that improve clarity are welcome.
@dromano Appreciate the suggestion, but the 30_june suffix is really just to indicate that the data references financial years. Will definitely investigate parsing the column name when pivoting to a long format.
One last tangent- .name_repair = "universal" strips out some units ($ and %) from the column name. I can't pre-process them because all the rse columns are not unique.
Is there a way to alter this behaviour? Once the symbols are gone, there's no way to infer what unit the data are in.
It's a little hard understand what you're facing without seeing a little more of the actual structure of your tables -- would you be able to post, say, a representative 10 x 10 section of the output of this?
> x <- path[1] %>% rio::import('html') %>% head(10)
Warning message:
In rbind(tr = c(td = "Australia", td = "1990", td = "5628", td = "19", :
number of columns of result is not a multiple of vector length (arg 871)
> print(x[, 1:10])
Region Year Off farm contracts ($) RSE Canola receipts ($) RSE.1 Field peas receipts ($) RSE.2 Lupins receipts ($) RSE.3
1 Australia 1990 5628 19 415 80 2513 20 2663 88
2 Australia 1991 3792 12 929 23 1467 13 2275 14
3 Australia 1992 4387 17 1132 22 2079 20 3769 15
4 Australia 1993 6062 16 1299 21 2086 14 4258 11
5 Australia 1994 6613 17 3179 14 2437 13 5928 10
6 Australia 1995 5051 15 2955 14 1461 14 4507 12
7 Australia 1996 6047 14 4849 12 2202 16 5145 11
8 Australia 1997 7021 19 5802 16 1850 15 6074 11
9 Australia 1998 7180 18 7782 12 1315 19 5219 11
10 Australia 1999 5995 15 16647 11 1152 18 4531 10
> as_tibble(x[, 1:10], .name_repair = "universal") %>%
+ colnames(.)
New names:
* `Off farm contracts ($)` -> Off.farm.contracts....
* `Canola receipts ($)` -> Canola.receipts....
* `Field peas receipts ($)` -> Field.peas.receipts....
* `Lupins receipts ($)` -> Lupins.receipts....
[1] "Region" "Year" "Off.farm.contracts...." "RSE" "Canola.receipts...."
[6] "RSE.1" "Field.peas.receipts...." "RSE.2" "Lupins.receipts...." "RSE.3"
I can't replace the symbols before repairing names because of the repeated RSE columns:
> rename_all(x, str_replace, pattern = "\\$", replacement = "AUD")
Error: `data` must be uniquely named but has duplicate columns
Run `rlang::last_error()` to see where the error occurred.
I'd prefer tidy names a latotal_wool_sold_kg and total_wool_sold_kg_rse but I suppose I could make do with 'Total Wool Sold (kg)' and 'Total Wool Sold (kg) RSE'
The untidy version should be easy, I think: Just remove .name_repair = 'universal' by .name_repair = 'minimal'. Actually, just remove .name_repair = 'universal'.
The tidy version should take a little -- but not much -- regular expression work.
.name_repair = "minimal" doesn't do much of anything, but .name_repair = "unique" will enumerate the RSE columns and leave the rest in place.
snakecase should then be able to handle the symbol replacement, but there seems to be a quirk regarding $ as a special regex character. Issue filed here: https://github.com/Tazinho/snakecase/issues/187
It looked like the enumeration came from rio::import(), so I'd guess you shouldn't need to specify any name repair, and actually, I think rio::import() allows you to output tibbles with setclass.
Oh, and, the reason symbols get stripped out is that names in R are expected to consist only of _, ., letters and numbers (but can't start with _ or numbers); these are called 'syntactic names', and are what .name_repair = 'universal' and snakecase:to_any_case() produce. Non-syntactic names are allowed, but have to be enclosed with backticks, like you can see in this tibble output:
library(tidyverse)
tribble(
~Region, ~Year, ~`Off farm contracts ($)`, ~RSE,
'Australia', 1990, 5628, 19,
'Australia', 1991, 3792, 12
)
#> # A tibble: 2 x 4
#> Region Year `Off farm contracts ($)` RSE
#> <chr> <dbl> <dbl> <dbl>
#> 1 Australia 1990 5628 19
#> 2 Australia 1991 3792 12
Created on 2020-03-22 by the reprex package (v0.3.0)
or in:
Frustrating! Subsetting x[, 1:10] (helpfully...) enumerates the RSE columns, but I didn't notice when copying the example. Further, this enumeration gives the number of RSE columns whereas .name_repair is the column number itself, which I leverage in colnames(df)[parse_number(.) - 1].
> x <- path[1] %>% rio::import(., "html", setclass = "tibble")
Error: Column names `RSE`, `RSE`, `RSE`, `RSE`, `RSE`, … (and 15 more) must not be duplicated.
Use .name_repair to specify repair.
Run `rlang::last_error()` to see where the error occurred.
name_repair = "unique" gives me enough to work with though. It enumerates the duplicated RSE columns but leaves the descriptive columns in a non-syntactic form. From there, I can parse the special characters with str_replace and then snakecase while #187 is being resolved.