Hi. I have financial transactions that I read from fixed width files, as that is the format that the source system provide them in. There are 131 columns in total, of which 27 columns are numerical fields. The problem is that the values have a lot of leading zeros's, but the biggest issue is that the negative sign is at the end of the string: "000000016987.8900-". I found this on the web and it works fine, but it is very slow if I run it on millions of transactions and on 27 columns.
For helping us help you, please provide a reproductible example. Take a look at the reprex as an R tool to do that.
If you are new to R, I suggest you to learn a bit more on the different tools that exits. The tidyverse ecosystem could help you work with data and gets results quickly. This online book will give you a great overview of the possibilities
About your issue, without a reproductible example, we could just see that you try to apply same function over each column of a dataframe. There are indeed some tools to help you with that. Not sure how faster it could be as you table seems quite big, but there will be other tools for that too. (R is full of such great tools )
Here some hints to look into:
You need to transform character : stringr will help you (with str_split, str_c, ...)
You need to work with data.frame and iterate through columns : dplyr will help (with functions like mutate, mutate_all, ...)
These two packages are from the tidyverse. See the website for starting place to learn
This is an example that both shows how "for loops are slow" in R, while demonstrating that avoiding for loops through the apply functions doesn't actually solve the issue of a slow inner loop (which is actually why they are generally slow). The solution is to actually vectorize the work. In this case, the vast majority of the time is devoted to the loop implied by sapply, which ends up running the inline defined function once per entry. This causes a lot of overhead, slowing the process considerably.
My option_vectorized() in the code below avoids any non-vectorized code. There's other changes that may also speed it up, but a little bit of testing implies that the sapply is the slow part of the original code. I'm sure that could be rewritten in a way that doesn't use sapply, but I wasn't willing to work that hard this morning.
suppressPackageStartupMessages(library(tidyverse))
# Create some sample data with as many rows as I'm willing to wait for
numbers <- rnorm(1e5, mean = 0, sd = 1e5)
number_strings <- formatC(abs(numbers),
width = "15",
format = "f",
flag = "0",
digits = 4) %>%
paste0(if_else(numbers < 0, "-", ""))
# The original function in question
option_strsplit <- function(input_strings) {
sapply(strsplit(input_strings, "(?=-)", perl = TRUE),
function(x) as.numeric(paste(sort(x), collapse="")))
}
# A different algorithm that avoids per-item functions
option_vectorized <- function(input_strings) {
# Find which strings have a negative sign
neg_strings <- str_detect(input_strings, "-$")
# Trim the negative sign if needed and convert to numeric
# Provides the absolute value
abs_numeric <- as.numeric(if_else(neg_strings,
str_sub(input_strings, end = -2L),
input_strings))
# Multiply the absolute value by -1 if needed
abs_numeric * (neg_strings * (-2) + 1)
}
# Benchmark 10 runs
microbenchmark::microbenchmark(
option_strsplit(number_strings),
option_vectorized(number_strings),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median
#> option_strsplit(number_strings) 2423.61685 2507.386 2693.591 2607.4278
#> option_vectorized(number_strings) 86.20587 116.448 150.889 126.4484
#> uq max neval
#> 2960.6643 3105.8842 10
#> 178.7068 288.6861 10
As @cderv mentioned, mutate_all or similar in the dplyr package is a good way to apply this to each column.
In terms of speeding things up, as nick said, vectorising the work do do stuff on a much as possible in one go is the best plan.
In the service of that goal, I would ask if there are any hyphens you do want to keep (for instance in the non-numeric columns)- do you actually need to check for which columns you are changing.
As another thought, with leading zero and trailing hyphens, do all of the entries have the same number of characters, in which case it may be possible to treat it as a splitting fixed width text problem, which would save all checking the detail of the characters. This is based on the general principle of how you think about your data determines what you can do to it.
Considering the amount of data, I'd use data.table:
library(data.table)
# Convert to data.table, unless file has been imported with fread.
setDT(idxtrades)
# move dash from last to first position
fix_dash <- function(x)
as.numeric(
ifelse(
substring(x, nchar(x)) == '-',
paste0('-', substr(x, 1, nchar(x) - 1)),
x
)
)
# list all numeric columns to recode
cols2update <- c('exchlocaltobase', 'exchclienttobase', 'exchbasetosett', 'iocexec')
# update all columns at once
idxtrades[, (cols2update) := lapply(.SD, fix_dash), .SDcols = cols2update ]
Hi @nick. Thanks for this. The option_vectorized works very well, but I'm still very new to R, so I'm trying to figure out how to apply this to only certain columns.
Hi @thoughtfulnz, I would prefer to run it only on the known numeric columns. I suspect that that all the entries have the same numebr of characters, but I will have to verify.
Hi @datamaps. Thanks for this solution. I works very well seening that I can specify the columns, however it is a fair bit slower against @nick solution if I only run it on one column.
Just thinking a little bit more, treating it as a text split problem with tidyr, they don't even need to be fixed width columns, as you can "separate" the text in the column bases on the hyphen and ignore the empty set after the split
Whoops I missed that,
if the sign might or might not be present for a given transaction, it goes back to being a fixed number of numerals problem if separating, as you want to preserve the -