Use SI unit prefix in a readr column

I have a table on disk that uses suffixes like "k" to mean 1e3, "M" to mean 1e6 etc...

library(readr)

df <- data.frame(real = c(1,1000,1000000),
                 commas = c("1", "1,000", "1,000,000"),
                 suffix = c("1", "1k", "1M"))

write_tsv(df, "units.tsv")

While {readr} has no trouble reading in the explicit values as integers, and reading the values with a comma separator using col_number(), I can't think of an easy way to read the suffix column correctly:

read_tsv("units.tsv")
#> Rows: 3 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): suffix
#> dbl (1): real
#> num (1): commas
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 3 × 3
#>      real  commas suffix
#>     <dbl>   <dbl> <chr> 
#> 1       1       1 1     
#> 2    1000    1000 1k    
#> 3 1000000 1000000 1M

read_tsv("units.tsv",
         col_types = cols(                                                                                                                        
           real = col_integer(),
           commas = col_number(),
           suffix = col_number()
         ))
#> # A tibble: 3 × 3
#>      real  commas suffix
#>     <int>   <dbl>  <dbl>
#> 1       1       1      1
#> 2    1000    1000      1
#> 3 1000000 1000000      1

Created on 2023-08-03 with reprex v2.0.2

Does anyone know of a way to specify this in {readr}/{vroom}?

these seem bespoke, at least in so far as k is lowercase whilst the M is uppercase, if there are set values for these, I would think to made a lookup table, and then apply it as a multiplication to the values sans prefix.
for example

library(tidyverse)
df <- data.frame(
  real = c(1, 1000, 1000000),
  commas = c("1", "1,000", "1,000,000"),
  suffix = c("1", "1k", "1M")
)

write_tsv(df, "units.tsv")

(lookup <- data.frame(
  suffix_char_part = c("k", "M"),
  mult = c(10^3, 10^6)
))

read_tsv("units.tsv",
  col_types = cols(
    real = col_integer(),
    commas = col_number(),
    suffix = col_character()
  )
) |>
  mutate(
    suffix_num_part = parse_number(suffix),
    suffix_char_part = str_replace(
      suffix,
      as.character(suffix_num_part),
      ""
    )
  ) |>
  left_join(lookup) |>
  mutate(post_mult = suffix_num_part * ifelse(!is.na(mult),
    mult, 1
  ))
# A tibble: 3 × 7
   real commas suffix suffix_num_part suffix_char_part  mult post_mult
  <int>  <dbl> <chr>            <dbl> <chr>            <dbl>     <dbl>
1   1e0    1e0 1                    1 ""                NA           1
2   1e3    1e3 1k                   1 "k"                1e3      1000
3   1e6    1e6 1M                   1 "M"                1e6   1000000

Being standard SI unit prefixes, I don't know if I'd consider them "bespoke". But apparently enough for typical R use.

I guess that's the approach that makes most sense.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.