Bayesian Profile Regression - Syntax and Data Formatting

Hi All,

I've been attempting to use the PReMiuM package, yet I am struggling with some basics. Would really appreciate any feedback

Predominantly I am struggling with some of the arguments in the regression, alongside the format of the data required for the regression.

Overall, my data is a standard data frame format with a continuous dependent variable constrained on the unit interval, and my covariates as categorical (some ordinal, and some binary).

If I was to run something like a random forest in R, I would do:

rf <- randomForest(y ~, var1 + var2 +...+ varN, data = data, mtry = 3, n.trees = 500).

In this regard, I present the arguments for y and x to random forest. However, I've noticed that with the PReMiuM regression that the data must be pre-formatted perhaps?

For example,


inputs <- generateSampleDataFile(clusSummaryVarSelectBernoulliDiscrete())

runInfoObj <- profRegr(yModel = inputs$yModel, xModel = inputs$xModel,
                       nSweeps = 10000, nBurn = 20000, seed = seed,
                       data = inputs$inputData, output = "output",
                       covNames = inputs$covNames, nClusInit = 20,
                       run = TRUE)

Here, it seems that 'inputs' is already encoded:

$covNames [1] "Variable1"  "Variable2"  "Variable3"  "Variable4"  "Variable5"  "Variable6"  "Variable7"  "Variable8"  [9] "Variable9"  "Variable10"

$xModel[1] "Discrete"
$yModel[1] "Bernoulli"
$nCovariates[1] 10

Is it possible for someone to get input data in such a format where it is subsetted into input$xModel, input$yModel, input$nCovariates, etc?

All feedback would be appreciated.

I dont know PReMiuM package but the randomForest syntax you show is wrong...

  1. the listing of independent variables by commas is incorrect, if you are listing them explicitly as part of the formula syntax you would chain them together with + symbol.
  2. you should not repeatedly quote the data source from which the data is plucked if indeed you are passing the data.frame as a param.

Therefore the preferred syntax for RF would be

randomForest(y ~ var1 + var2 + var3 + var4,
                    data = data, mtry = 3, n.trees = 500)

Yes @nirgrahamuk , of course you're right on that. My point was more that I present Y and X's to the regression, rather than having to pre-encode my dataframe.

I think overall, the format is as follows:

runInfoObj <- profRegr(yModel = "Normal", xModel = "Discrete", 
                       nSweeps = 10, nBurn = 20000, seed = 1234,
                       data = trial, output = "output", outcome = "y",
                       covNames = c("x's"), nClusInit = 20, 
                       run = TRUE)

Therefore, I define the distributions; rather than the variables.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.