I am using sample code from the National Health and Nutrition Examination Survey (NHANES) to see if I can generate the same results with their publicly available data. Specifically, I am trying to get the average number of calories from sugar sweetened beverages consumed by individuals in the dataset (years 2011-2014).
The code uses demographic data (age, RIDAGEYR
) as well as dietary recall data (DR1IFDCD
, are food codes that indicate a sugar sweetened beverage, and DR1DRSTZ
is a status variable which indicates a valid dietary recall). Finally, each individual has a unique identified, SEQN
. Finally, there are also survey design variables (SDMVSTRA
,SDMVPSU
, and WTDRD1
).
I am using R, but their sample code is written in SUDAAN. I am ending up with a mean of 294 kcals, and in their publication they report 179.
Here is my version of their code for the relevant sections only:
# load packages
library(tidyverse)
library(survey)
# download publicly available data
demo_g <-download.file("https://wwwn.cdc.gov/nchs/nhanes/2011-2012/DEMO_G.XPT", tf <- tempfile(), mode="wb")
DEMO_G <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU")]
demo_h <-download.file("https://wwwn.cdc.gov/nchs/nhanes/2013-2014/DEMO_H.XPT", tf <- tempfile(), mode="wb")
DEMO_H <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU")]
#bind demo data into one df
demo <- bind_rows(DEMO_G,DEMO_H)
rm(DEMO_G,DEMO_H)
download.file("https://wwwn.cdc.gov/Nchs/Nhanes/2011-2012/DR1IFF_G.XPT", tf <- tempfile(), mode="wb")
DIET_G <- foreign::read.xport(tf)[,c("SEQN","WTDRD1","DR1IFDCD","DR1IKCAL","DR1DRSTZ")]
download.file("https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DR1IFF_H.XPT", tf <- tempfile(), mode="wb")
DIET_H <- foreign::read.xport(tf)[,c("SEQN","WTDRD1","DR1IFDCD","DR1IKCAL","DR1DRSTZ")]
#bind diet data into one df
diet <- bind_rows(DIET_G,DIET_H)
rm(DIET_G,DIET_H)
# merge diet and demo data
diet_demo <- merge(diet, demo, by= "SEQN")
# create a list of food codes that are considered sugar sweetened beverages (SSBs)
sampcode <- c(42401010,42403010,42404010,64200100,64201010,64201500,64202010,64203020,64204010,64205010,64210010,64213010,64215010,64221010,92101600,92101800,92101920,92101930,92101950,92101960,92121000,92121010,92121020,92130000,92130001,92130020,92301060,92301130,92301160,92301190,92302200,92302400,92302600,92302800,92305000,92305040,92305050,92305800,92306020,92306040,92306090,92306100,92400000,92410110,92410310,92410315,92410330,92410340,92410360,92410390,92410410,92410510,92410550,92410610,92410710,92410810,92411510,92417010,92431000,92432000,92433000,92510610,92510650,92510720,92510730,92511010,92511250,92512050,92512090,92512110,92530410,92530510,92530610,92530950,92531030,92541010,92542000,92550030,92550110,92550350,92552030,92560000,92582100,92582110,92582120,92610010,92610110,92611010,92611100,92611510,92611600,92612010,92613010,92803000,92804000,94100300,94210200,11561010,92101650,92101670,92101925,92101935,92161005,92162005,92204000,92306200,92306610,92307500,92411520,92513000,92550610,92613510,95310200,95310400,95310500,95310550,95310555,95310560,95310600,95310700,95310750,95310800,95311000,95320200,95320500,95321000,95342000,74302000,92101820,92101921,92101923,92101926,92101928,92101931,92101933,92101936,92101938,92101955,92101965,92101970,92101975,92102000,92102010,92102020,92102030,92102040,92102050,92102060,92102070,92102080,92102090,92102100,92102110,92102450,92102600,92102601,92102602,92102610,92102611,92102612,92121001,92121050,92130001,92305910,92307400,92307510,92308000,92308030,92308500,92308530,92309000,92309010,92309500,92510955,92510960,92511015,92513010,92550200,92550360, 92550370,92550380,92610020,92610030)
# get a total number of calories from sugar sweetened beverages by participant
repro <- diet_demo %>%
filter((DR1IFDCD %in% sampcode) & (RIDAGEYR >=20) & (DR1DRSTZ == 1)) %>%
group_by(SEQN) %>% summarize(ssbkcals = sum(DR1IKCAL))
# merge the new ssbkcals variable back into the main data frame then slice it to one row per participant. Finally, use the existing survey weight variable `WTDRD1` to create a new survey weight (analytic guidelines stipulate that you divide by the number of cycles used, here there are two cycles, i.e. 2011-12 and 2013-14).
repro_kcals <- merge(repro, diet_demo, by= "SEQN") %>% group_by(SEQN) %>% slice(1) %>% ungroup() %>% mutate(WT4YR = WTDRD1*1/2)
# set up the complex survey design, with PSUs nested within Strata and the newly derived survey weight
design <- svydesign(data = repro_kcals, id=~SDMVPSU, strata=~SDMVSTRA, weights=~WT4YR, nest=TRUE)
#get mean
svymean(~ssbkcals, design)
Any guidance would be much appreciated.