Mixed Data Type Columns and Excel

seadoo · December 26, 2024, 9:25pm

The Problem

Because R requires uniform data types in columns, openxlsx, openxlsx2 and others like writexl all have the same problem when attempting to write mixed data formats to excel:
Pasted image 20241226151936

It can often be helpful for data presentation / delivery to have markers such as these rather than having 0, blank, or NA cells. It's also not desirable to have the end user manually convert all of these.

Potential Solutions

List-Cols

While excel does allow for mixed-type columns (and R does allow list-columns), it seems that both openxlsx and writexl coerce all values in a given column to a single uniform type, despite it seeming that from the source code the resulting excel workbooks are written cell-by-cell, which would allow for mixed-type columns with no issue. So, we can write off list-col as a potential solution (coerced all to char in my testing).

Excel Formula

One option was to try something like this using openxlsx, where we use the value function to evaluate each cell (Excel version of as.numeric):

data |> 
    mutate(
        across(where(is.character), \(x) {
        # If value(x) fails, then replace it with the text version
            y <- paste0("=IFERROR(VALUE(\"", x, "\"), \"", x, "\")")
            class(y) <- "formula"
            y
        })
    )

However, this resulted in a similar looking error in excel, with the same green flags for "inconsistent formula". Again, for non-technical users this might give them kind of concern about the data integrity. So also we can write this one off.

Using na.string (openxlsx2)

One option I tested was setting values to NA and then using the na.strings argument to set the marker text. This actually does work well, but it has the downside of requiring you to use the same blank marker for every column in the dataframe.

One way to get around this would be to just write the data col-wise and supply a different na.strings argument for each numeric column.

This solution correctly sets the type of numeric values in Excel while retaining the text marker:
Pasted image 20241226153339

I didn't use this solution for two reasons:

I just thought of it while writing this post
It is in theory less extensible since you'd need to use the same marker for all columns of the data (a simple solution is to write your data colwise, which is also probably easier than what I ended up doing below lol).

My solution (two-pass writing)

The final, most extensible, and most involved solution is to write to the worksheet twice - this could be done in two different orders (char vs. numeric first), and the order would depend on the nature of the data. In my case it makes more sense to write numeric values first since they would differ, while the text filler is a constant for each column.

The steps for this are as follows:

Create a modified dataframe where NA replaces all of the text marker cells
Write the modified dataframe leaving NA cells blank.
Fill in the NA gaps with the text marker, in this example "<MEAN"

Steps 1 and 2 were simple enough, but writing in all of the NA cells took some organizing. Here was my approach:

1. Get list of NA dimensions to write:

na_rows <- mod_dat |> 
    map(\(x) which(is.na(x) & is.numeric(x))) |> 
    discard(is_empty) |> 
    map(compress)

This compress function is doing a lot of work here. Writing with any excel writer package requires contiguous data - i.e., you can't skip around between rows in a single write step. This means we can't do all of our NA indices in one shot.

To minimize the number of writes required to fill each target cell with the text marker, we use a "compress" function to identify continuous sequences inside the NA indices. For example, c(1, 2, 3, 7, 10, 11) would return a list: list( c(1, 2, 3), c(7), c(10, 11) ). This greatly reduces the amount of write steps we need to perform. The compress function is below:

compress <- function(vec) {
    # unlikely since we are selecting cols ourselves
    if (length(vec) == 0) return(list()) 
    
    mainvec <- list()  
    subvec <- c(vec[1])  # set first group
    
    for (i in 2:length(vec)) {
        if (vec[i] == vec[i - 1] + 1) {
            # If consecutive, add to current group
            subvec <- c(subvec, vec[i])
        } else {
            # If not, finalize the current group
            # and put it in the next one
            mainvec <- c(mainvec, list(subvec))
            subvec <- c(vec[i])  
        }
    }
    
    # set last group - mainvec doesn't have last one if it's concurrent
    # potential for repeating the last group but no harm done for one of many
    mainvec <- c(mainvec, list(current_group))
}

Now we have a list of sequences by column that we need to write to the workbook. To do this I used map2 and called wb_dims to get the Excel notation for each subsequence. wb_dims is something that you could probably write yourself, but it's nice that it was included as a helper. It also performs the task of turning each subsequence into a range - "c(1, 2, 3, 4)" became "A2:A5", correctly omitting column headers. This returns a list with n for recycling our tag and dims for targeting which cell(s).

map2(na_rows, names(na_rows), \(rowvecs, nom) {
    rowvecs |> 
        map(\(x) {
            list(
                dims = wb_dims(x = mod_dat, cols = nom, rows = x),
                n = length(x)
            )
        }) 
    # don't need col names anymore, we have dims!
})|> unlist(recursive = FALSE)

No, wb_dims is not vectorized unfortunately:

wb_dims(x = starwars, rows = c(1:10, 50:55)
# [1] "A2:N2,A3:N3,A4:N4,A5:N5,A6:N6,A7:N7,A8:N8,A9:N9,A10:N10,A11:N11,A51:N51,A52:N52,A53:N53,A54:N54,A55:N55,A56:N56"

2. Writing at specified dimensions to overwrite with the marker

So now we have a list of dims to which we can apply our writer function and fill in the gaps:

process_sheet_wb <- function(wb, sheet, dat, tag) {
    mod_dat <- dat |> 
        mutate(
        across(where(\(x) any(x == tag, na.rm = TRUE)), as.numeric)
    )
    
    wb_edit <- wb |> 
        wb_add_worksheet(sheet = sheet) |> 
        wb_add_data_table(x = mod_dat, na.strings = "")
    
    na_dims <- create_na_dims_wb(mod_dat)
    
    na_dims |> 
        reduce(.init = wb_edit, \(init, dim) {
            init |> 
            # creating marker data to fill in 
                wb_add_data(
                    x = rep(tag, dim$n)
                    dims = dim$dims,
                )
        })
}

3. Fixing Performance Issues

After testing this function though with a 10 sheet workbook, I was finding this took over 1 min! The largest dataframe was around 9,000 rows with perhaps 6 columns, so not nearly enough to choke any actual processing.

I initially considered my compress function and wondered if for long sequences it could be a net negative on write performance - however, tested it and even rewriting it in C++ did not make a difference, so it is was definitely not an issue.

I have had issues with writing individual blocks of formatting using openxlsx in the past, so I immediately was suspicious of my many write steps - however, I had already optimized it as much as possible, so the only next place I could look was to parallelization.

I tried two different schemes: furrr and foreach + doParallel.

Both solutions were easy enough to write up, but foreach really ensures you know that you are spawning new R processes by requiring you to stipulate the libraries and data (read, functions) you need to expose to each process:

furrr

parallel_process_sheets <- function(dat_list) {
    # tried multicore vs. multisession, no change
    future::plan(multicore)
    
    wb <- wb_workbook()
    
    formatted_books <- furrr::future_map(dat_list, \(dat) {
        if (is.na(dat$tag)) {
            wb_workbook() |>
                wb_add_worksheet(sheet = dat$name) |>
                wb_add_data_table(sheet = dat$name, x = dat$data, na.strings = "")
        } else {
            wb_workbook() |> process_sheet(dat$name, dat$data, dat$tag)
        }
    })
    
    future::plan(sequential) # setting it back since plan is global
    
    formatted_books # to be joined using reduce + wb_clone_worksheet
}

foreach

foreach_process_sheets <- function(dat_list) {

    cl <- makeCluster(4)
    registerDoParallel(cl)

    formatted_books <- foreach(
        dat = dat_list,
        .packages = c("openxlsx2", "dplyr", "purrr"), # may need more
        .export = c("process_sheet", "create_na_dims", "compress")
    ) %dopar% {
        if (is.na(dat$tag)) {
            wb_workbook() |>
                wb_add_worksheet(sheet = dat$name) |>
                wb_add_data_table(sheet = dat$name, x = dat$data, na.strings = "")
        } else {
            wb_workbook() |> process_sheet(dat$name, dat$data, dat$tag)
        }
    }

    stopCluster(cl)

    formatted_books
}

These are in seconds, mind you - no improvement at all for parallelization!

expr	min	lq	mean	median	uq	max	neval
process_sheets(results)	74.78	75.40	77.56	77.62	78.41	81.58	5
parallel_process_sheets(results)	74.49	75.27	76.69	75.75	77.05	80.87	5

4. Profiling

So after trying a lot of different layouts, I decided to use profvis to see where the problem actually was, since my guess was clearly very wrong:

The highlighted row shows that of all things, our wb_dims function call is the main problem! While I could go in and see how this might be improved, I decided to fix my usage of it instead. (Additionally, it looks like it's the assert_class business that's almost the entirety of the slowdown, which may point towards a bug? Unsure)

The problem was that I was using it while I was mapping over my list of contiguous cells for each column, so it was being called a lot (likely tens of thousands of times in total). This is the map extracted from the code in step 1:
Original:

map2(na_rows, names(na_rows), \(rowvecs, nom) {
    rowvecs |> 
        map(\(x) {
            list(
	            # here is the problem!
                dims = wb_dims(x = mod_dat, cols = nom, rows = x),
                n = length(x)
            )
        }) 
})|> unlist(recursive = FALSE) # don't need col names anymore, we have dims!

Revised:

map2(na_rows, names(na_rows), \(rowvecs, nom) {
    
    # now only once per col rather than once per seq.
    col_letter <- wb_dims(x = mod_dat, cols = nom, rows = 1) |> 
	    str_extract("(?i)[a-z]+")
    rowvecs |> 
        map(\(x) {
            list(
                # making dims by hand to avoid wb_dims for each subseq
                dims = col_letter |> 
	                # + 1 to avoid colnames
                    paste0(c(min(x+1), max(x+1))) |> 
                    unique() |> 
                    paste0(collapse = ":"),
                n = length(x)
            )
        }) 
})

With the revision, we call wb_dims only once for each column rather than for each subvec of each column. Now we can ditch the parallel approach and compare things with our new scheme - again, these are in seconds. We see a >10X improvement!!

expr	min	lq	mean	median	uq	max	neval
process_sheets(results)	74.69	75.06	76.87	77.61	77.90	79.11	5
process_sheets_revised(results)	5.34	5.34	5.49	5.36	5.54	5.86	5

At this point we don't even need parallel writing - 5 s is completely manageable.

Conclusions

I think the biggest lesson learned here was to always profile first before trying to handle things you think are the problem - if I hadn't tried profvis I would have never guessed that wb_dims was the main contributor to the slowdown, and I probably would have assumed I was doing the parallelization incorrectly.

As a general rule, I have also learned that calling package functions in very long or nested loops is a bad idea - even if they are useful, they are very likely to be slower than programming it yourself in base R. All I needed to do was to take an extra 5 minutes to format the dims for openxlsx myself with two paste0 calls.

Recommendations

In the long term, I think we desperately need a multi-format excel writer. In python, object is used to represent mixed-type dataframe columns and the built-in excel writer for pandas handles this completely fine. It does not really make sense that R supports mixed-type columns via list but then no excel writer package correctly handles these mixed-types when actually writing the excel workbook.

Since it appears that at least openxlsx writes cell-by-cell, there would likely be no large performance hit to include a mixed-type option to allow Excel to infer datatypes rather than coercing each column to a uniform type prior to writing.

Thanks for reading, and if you know anything deeper about writing excel files I would love to hear ideas for making mixed-types and/or Excel type-inference deference (lol) possible!

system · March 26, 2025, 9:25pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.