Adding Survey Weights and Bootstrapping

Hi!
My data file has a column for the person-level weights (which have already been calculated for each respondent) and I was wondering how one goes about linking or "attaching" the person-level weights to the the variables of interest on Rstudio? I also have a bootstrap file, which I am assuming I would link afterwards...

Any help would be greatly appreciated as this is my first time using Rstudio and adding weights/bootstrapping.

@StatSteph your replies to some of the other posts that I have been reading have been super helpful and I was wondering if you might have some advice for me regarding my question posted?

I'm not sure what you mean by attaching. Do you mean merging? Do you have two data.frames with weights on one and responses on another? It would be very helpful to give more information and provide a reproducible example.

See FAQ: What's a reproducible example (`reprex`) and how do I create one? and FAQ: How to do a minimal reproducible example ( reprex ) for beginners

I have one file imported which contains the data. In that file, one of the columns is the person-level weights (PUMFWGHT) for each unique ID. I have tried using svredesign to merge the person-level weights to the variables of interest but have not had much luck. Not sure if this makes sense.... All my variables are categories (snapshot below)

ID PUMFWGHT Age Sex Ethic Identity
123, 25, 1, 2, 1
124, 10, 3, 1, 3
125, 14, 4 , 1, 2
126 , 23, 3 , 2 , 1

This is still not a reproducible example. You mentioned having a file with the bootstrap weights on it too. What does that look like? Please refer to the documentation on how to share your data (maybe using dput, for example) rather than how you pasted it here.

A very simple and very effective way to supply some data is to use the dput() command.

dput(mydata)

and then simply copy the output and paste it here. If you have a very large data set then a sample should be fine. To supply us with 100 rows of your data set do

dput(head(mydata , 100))

where mydata is the name of your dataframe or tibble.

Thank you both!
Is this better?

dput(APS[1:20, c(1:7)])
structure(list(PUMFID = c(30000L, 30001L, 30003L, 30005L, 30006L,
30007L, 30009L, 30011L, 30012L, 30013L, 30014L, 30015L, 30016L,
30017L, 30019L, 30020L, 30021L, 30022L, 30023L, 30026L), PUMFWGHT = c(25.3256,
10.257, 23.1295, 10.759, 42.7018, 10.9332, 25.9796, 25.5107,
43.9171, 21.1154, 60.9822, 19.3566, 48.5176, 51.538, 23.9373,
64.5496, 169.0191, 11.9627, 33.3693, 84.0422), GEO_PC = c(3L,
2L, 1L, 4L, 1L, 4L, 1L, 1L, 1L, 2L, 3L, 2L, 1L, 3L, 3L, 3L, 1L,
2L, 1L, 3L), GEO_INU = c(3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), AGE_YRSG = c(2L,
2L, 2L, 2L, 6L, 3L, 1L, 2L, 6L, 3L, 6L, 2L, 2L, 5L, 2L, 5L, 1L,
4L, 6L, 1L), SEX = c(1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L), PROXY = c(2L, 2L, 2L,
1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L,
1L)), row.names = c(NA, 20L), class = "data.frame")

This is the bootstrap file
dput(aps_2017_pumf_bsw_eng[1:10, c(1:5)])
structure(list(PUMFID = structure(c(30000, 30001, 30003, 30005,
30006, 30007, 30009, 30011, 30012, 30013), label = "Public Use Microdata file identification number"),
PUMFWGHT = structure(c(25.3256, 10.257, 23.1295, 10.759,
42.7018, 10.9332, 25.9796, 25.5107, 43.9171, 21.1154), label = "Survey weight of a person, for Public Use Microdata file"),
WRPP0001 = c(22.7792, 10.5725, 21.5607, 10.2069, 37.9878,
10.7163, 22.7959, 24.6361, 51.068, 24.7117), WRPP0002 = c(16.9974,
10.374, 16.6935, 9.8463, 56.3002, 13.0115, 35.7703, 37.7712,
59.1396, 13.2218), WRPP0003 = c(22.6349, 9.9881, 19.8387,
10.7217, 47.0602, 9.9361, 28.2887, 24.6291, 46.5755, 19.2514
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
), label = "APS_2017_PUMF_BSW_ENG")

Given

dta <- data.frame(
  PUMFID = c(
    30000, 30001, 30003, 30005, 30006, 30007, 30009, 30011, 30012, 30013,
    30014, 30015, 30016,
    30017, 30019, 30020, 30021, 30022, 30023, 30026
  ),
  PUMFWGHT = c(
    25.3256, 10.257, 23.1295, 10.759, 42.7018, 10.9332, 25.9796,
    25.5107, 43.9171, 21.1154, 60.9822, 19.3566, 48.5176, 51.538,
    23.9373, 64.5496, 169.0191, 11.9627, 33.3693, 84.0422
  ),
  GEO_PC = c(3, 2, 1, 4, 1, 4, 1, 1, 1, 2, 3, 2, 1, 3, 3, 3, 1, 2, 1, 3),
  GEO_INU = c(3, 3, 3, 1, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3),
  AGE_YRSG = c(2, 2, 2, 2, 6, 3, 1, 2, 6, 3, 6, 2, 2, 5, 2, 5, 1, 4, 6, 1),
  SEX = c(1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 1),
  PROXY = c(2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1)
)

booted <- data.frame(
  PUMFID = c(
    30000, 30001, 30003, 30005, 30006, 30007, 30009, 30011, 30012, 30013
  ),
  PUMFWGHT = c(
    25.3256, 10.257, 23.1295, 10.759, 42.7018, 10.9332, 25.9796, 25.5107, 
    43.9171, 21.1154
  ),
  WRPP0001 = c(
    22.7792, 10.5725, 21.5607, 10.2069, 37.9878, 10.7163, 22.7959, 
    24.6361, 51.068, 24.7117
  ),
  WRPP0002 = c(
    16.9974, 10.374, 16.6935, 9.8463, 56.3002, 13.0115, 35.7703, 37.7712,
    59.1396, 13.2218
  ),
  WRPP0003 = c(
    22.6349, 9.9881, 19.8387, 10.7217, 47.0602, 9.9361, 28.2887, 24.6291, 
    46.5755, 19.2514
  )
)

dta
#>    PUMFID PUMFWGHT GEO_PC GEO_INU AGE_YRSG SEX PROXY
#> 1   30000  25.3256      3       3        2   1     2
#> 2   30001  10.2570      2       3        2   1     2
#> 3   30003  23.1295      1       3        2   2     2
#> 4   30005  10.7590      4       1        2   2     1
#> 5   30006  42.7018      1       3        6   1     2
#> 6   30007  10.9332      4       2        3   1     2
#> 7   30009  25.9796      1       3        1   1     1
#> 8   30011  25.5107      1       3        2   2     2
#> 9   30012  43.9171      1       3        6   2     2
#> 10  30013  21.1154      2       3        3   1     2
#> 11  30014  60.9822      3       3        6   1     2
#> 12  30015  19.3566      2       3        2   1     2
#> 13  30016  48.5176      1       3        2   2     2
#> 14  30017  51.5380      3       3        5   2     2
#> 15  30019  23.9373      3       3        2   2     2
#> 16  30020  64.5496      3       3        5   1     2
#> 17  30021 169.0191      1       3        1   2     1
#> 18  30022  11.9627      2       3        4   1     2
#> 19  30023  33.3693      1       3        6   2     2
#> 20  30026  84.0422      3       3        1   1     1
booted
#>    PUMFID PUMFWGHT WRPP0001 WRPP0002 WRPP0003
#> 1   30000  25.3256  22.7792  16.9974  22.6349
#> 2   30001  10.2570  10.5725  10.3740   9.9881
#> 3   30003  23.1295  21.5607  16.6935  19.8387
#> 4   30005  10.7590  10.2069   9.8463  10.7217
#> 5   30006  42.7018  37.9878  56.3002  47.0602
#> 6   30007  10.9332  10.7163  13.0115   9.9361
#> 7   30009  25.9796  22.7959  35.7703  28.2887
#> 8   30011  25.5107  24.6361  37.7712  24.6291
#> 9   30012  43.9171  51.0680  59.1396  46.5755
#> 10  30013  21.1154  24.7117  13.2218  19.2514

the next step is to put a little formality on the question.

Every R problem can be thought of with advantage as the interaction of three objects— an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra— f(x) = y. Any of the objects can be composites.

In this case, we have two objects, dta and booted. To find, is an f

merge the person-level weights to the variables of interest

Unclear is person-level weights and variables of interest

Using the f(x) = y paradigm, what is x and y? From that f can be suggested.

1 Like

Here's how I would set this up. Note you only included 10 records from the bootstrap data so I first set the data to 10 rows but you wouldn't need this with all the data.

library(tidyverse)
library(survey)
#> Loading required package: grid
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
#> Loading required package: survival
#> 
#> Attaching package: 'survey'
#> The following object is masked from 'package:graphics':
#> 
#>     dotchart
library(srvyr)
#> 
#> Attaching package: 'srvyr'
#> The following object is masked from 'package:stats':
#> 
#>     filter

APS <- data.frame(
   PUMFID = c(
      30000, 30001, 30003, 30005, 30006, 30007, 30009, 30011, 30012, 30013,
      30014, 30015, 30016,
      30017, 30019, 30020, 30021, 30022, 30023, 30026
   ),
   PUMFWGHT = c(
      25.3256, 10.257, 23.1295, 10.759, 42.7018, 10.9332, 25.9796,
      25.5107, 43.9171, 21.1154, 60.9822, 19.3566, 48.5176, 51.538,
      23.9373, 64.5496, 169.0191, 11.9627, 33.3693, 84.0422
   ),
   GEO_PC = c(3, 2, 1, 4, 1, 4, 1, 1, 1, 2, 3, 2, 1, 3, 3, 3, 1, 2, 1, 3),
   GEO_INU = c(3, 3, 3, 1, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3),
   AGE_YRSG = c(2, 2, 2, 2, 6, 3, 1, 2, 6, 3, 6, 2, 2, 5, 2, 5, 1, 4, 6, 1),
   SEX = c(1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 1),
   PROXY = c(2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1)
)

aps_2017_pumf_bsw_eng <- data.frame(
   PUMFID = c(
      30000, 30001, 30003, 30005, 30006, 30007, 30009, 30011, 30012, 30013
   ),
   PUMFWGHT = c(
      25.3256, 10.257, 23.1295, 10.759, 42.7018, 10.9332, 25.9796, 25.5107, 
      43.9171, 21.1154
   ),
   WRPP0001 = c(
      22.7792, 10.5725, 21.5607, 10.2069, 37.9878, 10.7163, 22.7959, 
      24.6361, 51.068, 24.7117
   ),
   WRPP0002 = c(
      16.9974, 10.374, 16.6935, 9.8463, 56.3002, 13.0115, 35.7703, 37.7712,
      59.1396, 13.2218
   ),
   WRPP0003 = c(
      22.6349, 9.9881, 19.8387, 10.7217, 47.0602, 9.9361, 28.2887, 24.6291, 
      46.5755, 19.2514
   )
)

# merge the data with the replicate weights
dat_wrep <- APS %>%
   slice(1:10) %>% # for your example, you only included 10 rows of bootstrap weights so need this to be 1:1 merge
   left_join(select(aps_2017_pumf_bsw_eng, -PUMFWGHT), by="PUMFID")

my_design <- dat_wrep %>%
   as_survey_rep(weights=PUMFWGHT, type="bootstrap", repweights=starts_with("WRPP"))

# Example analysis
# See https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html for more
my_design %>%
   survey_count(SEX)
#> # A tibble: 2 x 3
#>     SEX     n  n_se
#>   <dbl> <dbl> <dbl>
#> 1     1  136.  8.06
#> 2     2  103. 11.2

Created on 2021-03-05 by the reprex package (v1.0.0)

1 Like

Thank you! This makes sense and I tried it but received the following error (I also uploaded a picture of my environment, not sure if this helps in anyway):

booted <- aps_2017_pumf_bsw_eng
dat_wwghts <- APS %>%
slice(1:1) %>%
left_join(select(booted, -PUMFWGHT), by="PUMFID")
APSwboot <- dat_wwghts
as_survey_rep(weights=PUMFWGHT, type="bootstrap", repweights= starts_with("WRPP"))
Error in as_survey_rep(weights = PUMFWGHT, type = "bootstrap", repweights = starts_with("WRPP")) : **
** object 'PUMFWGHT' not found

This error indicates that the variable PUMFWGHT isn't on your dataset. It was there on the example so I can't really help.

Also don't use slice(1:1) as that just selects the first row of your data. Remove the line with the slice function which was just used for the example.

1 Like

@StatSteph thanks for all your help! I will keep looking into this to see if I can fix it.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.