Preserving underlying values when converting to factor with `haven::as_factor`

Part of what has been motivating my questions is: Why convert to factors in the first place if you may be able to achieve what you intend with labelled vectors instead? Or put another way, how does not converting to factors get in the way of what you want to achieve? If the goal is to share a file with a non-R user, how does not using factors affect them?

As I mentioned previously:

[Conversion to factors with haven::as_factor] is useful because it allows for nice behavior in the RStudio viewer and when printing tables and figures.

This nice behavior consists of displaying the levels in the RStudio viewer and using levels in tables and figures.

Furthermore, and as mentioned in my most recent post, haven_labelled vectors are not intended to be used when performing data manipulation or analysis. They are an intermediate data structure:

The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels

Additionally, the data that I want to convert is categorical. Categorical variables are most accurately represented by factors in R. Therefore, it is the expected form for categorical variables, and using them with other packages would yield the expected results.

Now, the central question as I see it is the following: is it possible at all in R to have non-consecutive integers as the underlying values of a factor or not? If it is, then it should be possible in some way to convert to factors and back again without loss of information. If it is not, then I will have to work around this restriction, likely using the labelled or sjlabelled packages that build upon haven's haven_labelled class and try to make haven_labelled objects useful in their own right. This would involve using the approach B that is mentioned in Introduction to labelled.

Now, I would prefer being able to convert to factor, as that is the native form for categorical variables in R. I recognize that it may simply not be possible to use non-consecutive integers as the underlying values of a factor in R. I opened this topic to try to figure out if it is possible. If it is not, I will work around this issue using the alternative I outlined above.

no, its not

indeed, its not possible.

If the same nice behavior can be achieved by using labelled vectors in combination with factors, I don't see why either point matters in practice, and I also don't see why using labelled vectors in combination with factors should lead necessarily to the loss of any information.

I'm afraid can't agree with either the characterization that forms the premise of your conclusion, or that your conclusion follows from your premise. My own perspective is that factors are simply a useful tool that aids in the treatment of categorical variables, and I see no danger of arriving at unexpected results despite abridging their use.

It is of course your prerogative to circumscribe the tools and approaches you prefer to use; as one of the folks here who try to devise solutions to the problems users encounter, I am interested in fleshing out where the gaps between tools are, so to speak, and how to address them, and the treatment of labelled data is an area of particular interest.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.