Navigating Complex Survey Designs in Predictive Modeling with Tidymodels: Seeking Guidance

As a PhD student embarking on a research project using the Youth Risk Behavior Surveillance (YRBS) dataset, I found myself grappling with a fundamental challenge—how to appropriately handle the complexities of a complex survey design within the tidymodels framework.

In my pursuit of developing predictive algorithms to identify factors associated with suicidality among adolescents in the United States, I delved into the tidymodels ecosystem. Inspired by Dr. Kuhn's tutorial on incorporating case weights using the hardhat package, I began to grasp the importance of accounting for factors beyond the case weights, such as primary sampling units (PSUs) and strata, to obtain accurate inferential results.

However, my research emphasis lies primarily on predictive performance rather than inferential statistics. This prompted a crucial dilemma—should I proceed with modeling while overlooking the complexities of the survey design, or should I endeavor to incorporate the design variables and weights despite their potential computational challenges? Especially because I am not entirely sure how to incorporate them using tidymodels.

Any thought or guidance?

Thank you!!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.