Importing SAS value labels into R

I am struggling to import SAS data labels into R.

Are there tips and tricks for importing multi- gigabyte Sas files with hundreds of categorical variables into R in a way that preserves Variable value labels? The variable name labels always import just fine, but I am really struggling to get the value labels in.

The value labels themselves can be very long strings of texts with frequently appearing special characters (to list a few examples , < > * ( : " `[). On top of that the value labels for each variable can have over 100 labels.

The main command I use is:
DF<-haven::read_sas(sasfilepath.sas7bdat, catalog_file=sascatalogfilepath.bcat)
followed by changing the variables I want into factors with labelled::as_factor

However, 0 to <5% of the variables that should have value labels are in fact labeled when I attempt to use the catalog_file command in read_sas.

Have you run into similar situation and if so what libraries/techniques have you used to solve the issue?

Hi @RLearner96,

Sometimes labels are there, but hidden. Would you be able to share a portion or toy version of your data so folks here could help troubleshoot?

I think you're referring to SAS formats. Value labels is a Stata term. I don't know how to include them but knowing the correct terminology might help you out with further searching.

Good Point, Yes I mean the SAS formats, in particular the text explanation of various codings that are defined in a sas bcat file that in turn was originally generated from a PROC FORMAT block in SAS.

@dromano Thank you for the offer. Unfortunately, I can't share the data and when I went to create a toy dataset I realized haven::read_sas("datafile.sas7bdat", catalog_file="formats.sas7bcat") is not correctly labeling the few variables I thought it was labeling correctly.

I fear the answer is that I need to create a R script to parse out all the formatting in a chunk of SAS code and turn that into factor statements in R.

However, since the SAS formatting statements are pretty standard it seems like this is a common issue that someone else has already solved.

a tiny toy example of exactly that might go a long way.

That would be actually be very useful to see, too — could you share a toy data set along with the bdat and bcat files, as well as a screenshots of what the labeling should actually look like?

I think the problem is that formats are stored separate from data. When I read in datasets from SAS to R, you can see what format is used but typically formats are established with a statement in SAS. For example, a variable Region might be coded as integers 1, 2, 3, 4 and relate to Northeast, Midwest, South, and West. A format would be declared as follows:

proc format;
value regionf 1="Northeast" 2="Midwest" 3="South" 4="West";
run;

data dat_with_format;
   set dat;
   format Region Regionf.; *this applies the format RegionF to the variable Region;
run;

You could then save this dataset in SAS and when reading into haven you'd be able to see the attribute that the format "regionf" is used but it would not actually include the definition of the coding.

@StatSteph - Yes that is a good summary of what is happening. I thought the bcat file was supposed to fix that but it only seems to fix a few variables before stopping.

Below is my best attempt at a toy example. In reality the dataset has 300+ variables and over 1 million rows.
Of the 300 variables 6 had their formatted value labels assigned. However the assignment appears incorrect as shown in the attr and str results below.

I can't look into the bcat file - neither sas or R can open it and I cannot export it to view in a different editor. I also don't have access to the original bdat file.

In a SAS program the formatting is set as
proc format;
VALUE $fgender (notsorted)
'1' = 'Male'
'2' = 'Female' ;

VALUE $fVARIABLE2
'1'= "ABC"
'2'= "XYZ"
'3'="123"
'4'="WOWZA" ;

RUN;

psuedocode
let
data be in file data.sas7bdat
data formats be in formatdata.sas7bcat

file data contains variables GENDER, VARIABLE2

In R read in
df<-haven::read_sas(data.sas7bdat, catalog_file = formatdata.sas7bcat)

str(df$GENDER)
chr+lbl [1:100] 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,1, 1, 1, ......
@ label : chr "Gender"
@ format.sas: chr "$FGENDER"
@ labels : Named chr [1:2] "" "\001"
..- attr(*, "names")= chr [1:2] "Male" "Female"

str(df$VARIABLE2)
chr [1:100] "1" "1" "3" "2" "1" "2" "2" "1" "1" "1" "4" "1" "2" .....

  • attr(*, "label")= chr "Variable 2"
  • attr(*, "format.sas")= chr "$VARIABLE2"

Convert to factor
df$GENDER<-haven::as_factor(df$GENDER)

df$VARIABLE2<-haven::as.factor(df$VARIABLE2)

Look at factorization
table(df$GENDER)
Male Female 1 2
0 0 50 50
attributes(df$GENDER)
$levels
[1] "Male" "Female" "1" "2"

$class
[1] "factor"

$label
[1] "GENDER"

attributes(df$VARIABLE2)
$levels
[1] "1" "2" "3" "4"

$class
[1] "factor"

$label
[1] "Variable 1"

Could you share this table by running:

dput(head(df, 100))

and pasting the output here, between a pair of triple backticks, like this?

```
paste output here
```

I selected the variables that I used as examples and the variable, called VARIABLE below, that was just before the labels went awry. Some editing has been done on contents but all special characters were preserved

structure(list(SEX = structure(c("2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2"), label = "Sex", format.sas = "$FSEX", labels = c(Male = "", 
Female = "\001"), class = c("haven_labelled", "vctrs_vctr", "character"
)), VARIABLE = structure(c("", "4", "3", "4", "4", "4", 
"4", "2", "2", "3", "3", "", "4", "1", "4", "2", "", "2", "4", 
"", "2", "4", "3", "3", "2", "1", "4", "1", "4", "4", "3", "", 
"", "2", "4", "", "", "", "4", "4", "4", "", "4", "4", "4", "4", 
"4", "3", "", "4", "3", "4", "2", "3", "3", "1", "1", "1", "4", 
"2", "3", "3", "2", "4", "", "4", "4", "", "4", "", "4", "", 
"", "", "4", "3", "4", "4", "4", "", "", "4", "4", "3", "3", 
"", "3", "2", "2", "1", "4", "3", "4", "4", "4", "2", "4", "2", 
"3", "4"), label = "Percent No 2000", format.sas = "$FVARIABLE", labels = c(`>= 26%` = "", 
`26-28.7%` = "\001", `12-14.8%` = "\002", `< 12%` = "\003", `Not Available` = "\004"
), class = c("haven_labelled", "vctrs_vctr", "character")), VARABLE1 = structure(c("1", 
"1", "1", "2", "1", "2", "2", "1", "1", "1", "1", "1", "2", "6", 
"1", "1", "6", "1", "2", "2", "2", "1", "1", "6", "2", "1", "1", 
"1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "2", "1", "1", "1", "1", "1", "1", "1", "5", "2", "2", "2", 
"2", "4", "4", "2", "4", "2", "2", "2", "2", "2", "1", "2", "1", 
"2", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "2", "1", 
"1", "1", "2", "1", "1", "2", "2", "2", "2", "6", "1", "", "6", 
"3", "3", "2", "3", "2", "6", "6", "3"), label = "Variable 1", format.sas = "$FVARIABLE1")), row.names = c(NA, 
-100L), class = c("tbl_df", "tbl", "data.frame"))

I did notice that in your proc format code : there is $fgender which corresponds with your haven produced attribute : @ format.sas: chr "$FGENDER"
but otherwise the proc format code had $fVARIABLE2 and I dont think this would match $VARIABLE2 because of the lack of f between $ and V of Variable2

Yes - that is the variable where the labeling stops working. I don't know why.

Looking at the dput output it's where the variables transition from haven_labeled to tbl_df

If your catalog is corrupt, you said you cant even open it in SAS, then perhaps its corrupted, and you can build a new one; from your proc format codes ?

We just rebuilt the bcat file on Monday. Can you usually open a bcat file in SAS?

This is the same file that SAS uses to bring in formats too, correct? It appears to work in SAS to correctly label data even if I cannot open the bcat file to see it.

Apologies, I have never coded in SAS before a few days ago and it is very different than languages I have experience with!

I guess you wouldnt strictly speaking 'open it' but you can use proc catalog to investigate its contents.

SAS Help Center: Displaying Contents, Changing Names, and Changing a Description

Thank you, but I'm still not sure I understand what is working as expected and what is not — are you able to do the following?

  1. Select a table in SAS that is small enough to illustrate the desired labeling, and provide a screenshot that shows what the desired labeling is. (Important so that folks here can see what the correct output should look like.)
  2. Use read_sas() to import that table, then use dput() on the imported table and share the output here.

It's important that the only place you do anything manually is in SAS to create a toy table that shows the correct labels in SAS, but not once imported, and that no further manipulation happens after you use read_sas() to import the table. That way folks here can see the effect of read_sas() itself.