Debug problem in haven::read_dta

jimvine · December 8, 2020, 12:40pm

Hello. I have a problem with using haven::read_dta but cannot produce a reproducible example as it occurs when I am reading a data file that I cannot share.

So, instead, I am seeking advice on how I might go about tracking down the cause of the problem.

I am reading a pretty large Stata .dta file (over 10,000 observations on over 1000 variables). It mostly seems to work, most of the data seems to be present, and I do not see any warnings / error messages on loading it.

However, at least 2 of my columns are not loading correctly.

In each of the 2 problematic columns, about half of the values should be "-8" for inapplicable, and the rest should be strings with about 8 to 30 characters each, mostly letters, numbers and spaces, and the odd other character like brackets ( ) and pipes |. Many of these text values will occur more than once, but some will be unique.

All of the "-8" values load correctly. However, most of the other values appear as NA. But, oddly, not all of them. In one column I get one entry that appears 5 times. Looking at the data when loaded using readstata13, this seems to be the correct number of times that entry occurs. It also happens to be the entry of the first row in the file. So, I have 5 of one correct entry, several thousand correct "-8" entries, and several thousand incorrect NA entries, each of which should be a variety of text.

In the other column I have something similar: the "-8" values are all present and correct; most of the remainder are incorrect; but there are two values that appear (one about 50 times, one about 2 times).

The dta file loads correctly into Stata and also if I use readstata13::read.dta13 to read it into R these columns both look OK.

Thanks in advance for any pointers about how I might investigate this issue, and apologies for not being able to share a reproducible example.

system · December 29, 2020, 12:40pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.