Preserving underlying values when converting to factor with `haven::as_factor`

YdeB · June 20, 2024, 9:41am

I originally wrote the below as a feature request for the haven github, but read the "Getting help with haven" document before posting and was sent here. I haven't been able to implement a solution that accomplishes the below, neither with haven nor with sjlabelled. Thanks in advance for any suggestions.

"The default behavior of base::factor() is to create a factor with consecutive integers starting from 1, assigning levels to each integer. as_factor ultimately uses base::factor() to create its output. This means that the information in the values of the object of class labelled is lost. This means that it is not possible to import data, convert to factors, manipulate the data and export it again, without losing this information. This can create problems when collaborating with non-R-users or interacting with databases, and create incongruencies with e.g. separately generated codebooks.


library(haven)

x <- labelled(c(1:2, 4), c(level1 = 1, level2 = 2, level3 = 4))

as.integer(x)
#> [1] 1 2 4

as.integer(as_factor(x))
#> [1] 1 2 3

I understand that it is the expected behavior to generate values consisting of consecutive integers starting from 1 when coercing to a factor. Therefore, I request an argument to as_factor that keeps the underlying values of the input vector, assigning the labels to the level corresponding to each value. This argument could have as default the option to keep the current behavior.

Thank you for a very useful package. I apologize if I missed a similar feature request, discussion or existing solution."

nirgrahamuk · June 20, 2024, 10:10am

If you want to work with the original values as factors and retain the ability to get them out as integers you can work like this:


library(haven)

x <- labelled(c(1:2, 4), c(level1 = 1, level2 = 2, level3 = 4))

as.integer(x)
#> [1] 1 2 4
#> 
as_factor(x,levels = "values")
# [1] 1 2 4
# Levels: 1 2 4
as.integer(as.character(as_factor(x,levels = "values")))
#> [1] 1 2 4

YdeB · June 20, 2024, 10:40am

Thank you for your suggestion. I am aware of the "levels" argument to as_factor, however using the "values" option simply swaps loss of one type of information to loss of another. That is, from the loss of the information contained in the underlying values to the loss of the information in the labels. The "both" argument preserves both types of information, but not in a way that the values are available for manipulation.

nirgrahamuk · June 20, 2024, 11:14am

As both preserves the integer info you want; I wrote you a function you can use to conveniently convert to that representation when you need.


library(haven)

recover_ints <- function(myfactor_with_embedded_ints) {
  require(purrr)
  require(dplyr)
  require(readr)

  metadata <- map_dfr(
    levels(myfactor_with_embedded_ints),
    \(x){
      xs <- strsplit(x, " ")[[1]]
      data.frame(
        lvl = x,
        nicename = xs[2],
        intval = readr::parse_number(xs[1])
      )
    }
  )

  intlookup <- select(metadata, lvl, intval) |> deframe()

  the_int_values <- unname(intlookup[myfactor_with_embedded_ints])

  the_int_values
}

(x <- labelled(c(1:2, 4, 2:1), c(level1 = 1, level2 = 2, level3 = 4)))
# <labelled<double>[5]>
#   [1] 1 2 4 2 1
# 
# Labels:
#   value  label
# 1 level1
# 2 level2
# 4 level3

(xfac <- as_factor(x, levels = "both"))
# [1] [1] level1 [2] level2 [4] level3 [2] level2 [1] level1
# Levels: [1] level1 [2] level2 [4] level3

(x_ints <- recover_ints(xfac))
# [1] 1 2 4 2 1

YdeB · June 20, 2024, 1:17pm

Thank you for providing this function. It does not quite give the functionality I am after however. I attempted building on it to modify the underlying values of the factor, but it seems that accessing and modifying the values and value-level pairings is no easy task.

nirgrahamuk · June 20, 2024, 1:56pm

the underlying values of a factor have to be integers, beginning from 1 up to the number of levels of the factor. This is fundamental R data type rules; if you need something else, it may be a custom data type, but it wont be a factor....

factor returns an object of class "factor" which has a set of integer codes the length of x with a "levels" attribute of mode character and unique (!anyDuplicated(.) ) entries

dromano · June 20, 2024, 4:31pm

Hi @YdeB ,

Could you say little more about the behavior you would like to see, independent of the tools involved? For example, how and why does the object x itself fall short of what you want? Why would you need to pass it through factor() or as_factor()?

YdeB · June 21, 2024, 7:39am

Hi Dromano

Thank you for taking the time to reply.

Fundamentally what I want to be able to do, is to receive a .sas7bdat file and a corresponding .sas7bcat file from a colleague, import it to R, work with the data in a way that uses the information in the format catalog contained in the .sas7bcat file and applied to the data in the .sas7bdat file, and finally exporting the data to SAS again (exporting is a separate issue due to the proprietary nature of the SAS data format, but I believe I have a workaround by going through an SPSS file).

The haven package accomplishes all this nicely by importing the data as an intermediate data structure where variables with associated labels are imported as a vector of class haven_labelled. These are then converted to factors by as_factor, such that the labels are taken as the levels of the factor. This is useful because it allows for nice behavior in the RStudio viewer and when printing tables and figures. The problem is that illustrated in my reprex: the underlying integer of the factor is coerced to consist of consecutive integers starting from one. Thus, when you export data again, the underlying integers will be different to those you imported.

As I see it, the simplest way to fix this would be to find a way to directly access and manipulate the integer vector component of the factor changing it to an arbitrary vector of integers, and the component of the factor mapping integer values to levels. However, and as @nirgrahamuk says, it may just not be how factors work, though the documentation he quotes does not actually specificy that the integers have to be consecutive starting from one. It may still be the case though.

Alternatively, and as the "Introduction to labelled" vignette of the labelled package suggests, the simplest workaround may simple be their approach B: data cleaning and recoding before reexporting, and only then converting to factors for analysis. The haven_labelled class is however, as I understand it, somewhat fragile and may be lost in certain operations.

I hope this answers your question.

dromano · June 21, 2024, 11:34am

Thank you, @YdeB ; this does answer my question. Part of what I was wondering is whether there were constraints on what you were hoping to achieve that might make the problem less tractable. However, the goal of importing from SAS, doing some work, and then exporting the result so another SAS (or at SPSS) user could receive the data as you intended, seems like one the crowd could help with.

Would you be able to share an example of a table you've created in SAS that, once imported and worked on, you are not able to get into a form you would want to export to SAS to share with a colleague? The reprex you shared is a specific example of a behavior you are suggesting may be a stumbling block to achieving your goal, but to get the most out of the community of users here, it would be best to share an offending table along with the code you've used to try to get it into shape for export, as well as information about how the result falls short of what you need.

YdeB · June 21, 2024, 11:56am

I will unfortunately not be able to share a concrete example from my work, as the data I work with is confidential. However, the reprex I shared above constitutes the only problem I have when exporting to SAS, by exporting with `haven::write_sav, and importing the resulting file to SAS Enterprise Guide by using the task designed to do so. The problem is that the underlying values connected to each label have changed.

dromano · June 21, 2024, 12:17pm

Using the actual data is not necessary, just data that illustrates the issue — could you create a simple toy table in SAS?

YdeB · June 24, 2024, 7:49am

I have attached code below for creating a toy table. I am not very familiar with SAS, so it may not be the prettiest code:

/* Create the format */
proc format library=work;
    value levels
        1 = 'level1'
        2 = 'level2'
        4 = 'level3';
run;

/* Create the dataset */
data toydata;
    input value;
    datalines;
1
2
4
;
run;

/* Attach the format to the dataset */
data toydata;
    set toydata;
    format value levels.;
run;

/* Export the dataset as a sas7bdat file */
libname mylib 'C:\YourPath'; /* Change this to your desired directory */
data mylib.toydata;
    set mydata;
run;

/* Save the format catalog as a sas7bcat file */
proc catalog catalog=work.formats;
    copy out=mylib.toyformats;
run;

This is what happens when you load it into R:

library(haven)

x <- read_sas(data_file = "C:/toydata.sas7bdat",
              catalog_file = "C:/toyformats.sas7bcat")

as.integer(x[[1]])
#> [1] 1 2 4
as.integer(as_factor(x[[1]]))
#> [1] 1 2 3

As you can see, the result is the same as in the above reprex as the vector created is the same, just within a tibble.

YdeB · June 24, 2024, 9:00am

Oh, and perhaps more realistically I should say that rather than a jump from 2 to 4, the issue I am facing in my data is consecutive integers denoting meaningful answers and then a jump to e.g. 88 and 99 for "Don't know" and "Prefer not to answer".

dromano · June 24, 2024, 9:54am

Thank you, @YdeB . The last step is to share the table you inported into R by running dput(x) immediately after running:

x <- read_sas(data_file = "C:/toydata.sas7bdat",
              catalog_file = "C:/toyformats.sas7bcat")

and then copying and pasting the output here. Could you do that?

YdeB · June 24, 2024, 10:51am

Here you go:

library(haven)

x <- read_sas(data_file = "C:/toydata.sas7bdat",
              catalog_file = "C:/toyformats.sas7bcat")

dput(x)
#> structure(list(value = structure(c(1, 2, 4), format.sas = "LEVELS", class = c("haven_labelled", 
#> "vctrs_vctr", "double"), labels = c(level1 = 1, level2 = 2, level3 = 4
#> ))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
#> -3L))

dromano · June 24, 2024, 12:28pm

Thanks, @YdeB , and just to clarify, in order for folks here to be able to easily copy and paste your dput() output, it's most helpful for folks here if you paste the output of the dput() function itself, like this:

structure(list(value = structure(c(1, 2, 4), format.sas = "LEVELS", class = c("haven_labelled", 
"vctrs_vctr", "double"), labels = c(level1 = 1, level2 = 2, level3 = 4
))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L))

(I'm not sure if you ran the code through the reprex() function, but it includes the #> at the beggining of each line of the dput() output.)

YdeB · June 24, 2024, 12:33pm

That makes sense. I actually had some trouble with the reprex() function, but formatted it in a similar way so as to signal that it was output and not code.

dromano · June 24, 2024, 12:59pm

Here's some of what can be extracted from your toy data:

structure(list(value = structure(c(1, 2, 4), format.sas = "LEVELS", class = c("haven_labelled", 
"vctrs_vctr", "double"), labels = c(level1 = 1, level2 = 2, level3 = 4
))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L)) -> toy_sas

# printing show value column contains a labelled vector
library(tidyverse) # contains `pull()` function for extracting column vector
library(labelled)  # allows `print()` to recognize labelled vector
toy_sas |> 
  pull(value)
#> <labelled<double>[3]>
#> [1] 1 2 4
#> 
#> Labels:
#>  value  label
#>      1 level1
#>      2 level2
#>      4 level3

# label-value pairing (levels) can be extracted as a named vector
toy_sas |> 
  pull(value) |> 
  val_labels()
#> level1 level2 level3 
#>      1      2      4

^{Created on 2024-06-24 with reprex v2.0.2}

So now the question is, what's an example of how you'd like to manipulate this data before sharing with a colleague? That way, we can explore whether the original label-value pairings are necessarily lost in the process.

YdeB · June 24, 2024, 1:37pm

Thank you. I am aware of the structure and contents of objects of class haven_labelled.

I want to convert categorical variables to factors, with the labels becoming the levels of the factor and the attached integer values of the factor remaining the same as in the imported data. The first part is achievable with haven::as_factor, the second is seemingly not.

After this conversion to a factor I will check the data for logical inconsistencies, drop certain observations, possibly impute values to missing values, remove certain variables, create new variables from existing ones, and change the values of some variables to contain less information so as not to divulge sensitive information.

Following these manipulations I will re-export the data to SAS-readable files.

YdeB · June 24, 2024, 1:53pm

And I should add: Simply evaluating whether the label-value pairings are stripped as a result of any one operation is not of particular interest to me. It is not the intended use of the haven package and thus the stability of the behavior is not a concern of the package developers. This makes any use of the package relying on such behaviors fragile.

Confer Conversion semantics • haven :

x1 <- labelled(
  sample(1:5), 
  c(Good = 1, Bad = 5)
)
x1
#> <labelled<integer>[5]>
#> [1] 4 3 2 5 1
#> 
#> Labels:
#>  value label
#>      1  Good
#>      5   Bad
x2 <- labelled(
  c("M", "F", "F", "F", "M"), 
  c(Male = "M", Female = "F")
)
x2
#> <labelled<character>[5]>
#> [1] M F F F M
#> 
#> Labels:
#>  value  label
#>      M   Male
#>      F Female
[...]

The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels:
as_factor(x1)
#> [1] 4    3    2    Bad  Good
#> Levels: Good 2 3 4 Bad
zap_labels(x1)
#> [1] 4 3 2 5 1

as_factor(x2)
#> [1] Male   Female Female Female Male  
#> Levels: Female Male
zap_labels(x2)
#> [1] "M" "F" "F" "F" "M"