Find p value to compare two groups

Hello everyone,

I think this is an easy topic but I am working on Rstudio only since a few days and cannot find a solution.

I have a table (table "study") in which I have relevant information about smokers and non-smokers (column "smokers", in which "0" is for no-smokers and "1" is for smokers) in order to compare them. I also have a score (column "score") from 1 to 6 (integral numbers).

So my supervisor told me that I have to compare smokers and non-smokers for every each score value (score =1, score=2, etc.), finding the p-value.

Could maybe someone in simple terms explain me how to do that in Rstudio?

Thank you very much

1 Like

See the FAQ: How to do a minimal reproducible example reprex for beginners. It would look something like the snippet below (which is generated by the sample function, so it is just random, so it wouldn't be helpful to use it to compare each score group based on smoking).

A p-value is the probability that some test statistic that is calculated would be expected to be at least as extreme as it is. A "small" p-value is used to evaluate what is termed the "null" hypothesis at a given level of the unfortunately-named "significance." Do not confuse statistically significant with meaningful. Honesty to one's self requires choosing a level of significance, called \alpha, which is conventionally set at 0.05, before running the test.

An \alpha of 0.05 is some evidence and, given the nature of the particular data, may be as low as the associations permit. I call it passing the laugh test—ok, maybe there's something here. But reflect. Take four 5-shot revolvers and put a bullet in one of them and place them on a table. Have someone re-arrange them out of your sight. Pick one up and put it to your head. Would you pull the trigger knowing that there is only a single chance in 20 that you won't live to tell the tale?

OK, so you run a test an get a test statistic with a p-value of 0.02, for example. That tells you that you don't have to reject the null hypothesis. (We'll get to that.) That's called failing to reject the null hypothesis. But if the result is 0.08, for example, then one is said to reject the null hypothesis and accept the alternative hypothesis (i.e., the opposite of the null hypothesis).

OK, so different statistical tests use different statistical measures and null hypotheses. A simple one is the t.test, shown below. The null hypothesis in the example is that there is no difference between the mean number of smokers in the scored == 1 group and the scored == 2 group. How do you interpret the output?

Here's a recent article on selecting statistical tests

my_data <- data.frame(
  smoked =
    c(1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1),
  scored =
    c(4, 1, 3, 3, 4, 5, 3, 1, 3, 1, 5, 6, 2, 6, 3, 6, 6, 4, 1, 1, 3, 3, 5, 5, 4, 4, 6, 3, 3, 5, 2, 4, 1, 2, 3, 3, 1, 1, 2, 6, 1, 3, 6, 1, 1, 6, 1, 4, 4, 2, 1, 1, 5, 5, 2, 2, 3, 5, 1, 2, 5, 3, 6, 3, 2, 4, 4, 3, 1, 1, 1, 6, 5, 4, 6, 3, 1, 2, 1, 3, 5, 1, 3, 5, 1, 3, 2, 6, 6, 3, 1, 3, 2, 1, 3, 1, 4, 4, 3, 6))

head(my_data)
#>   smoked scored
#> 1      1      4
#> 2      0      1
#> 3      0      3
#> 4      1      3
#> 5      1      4
#> 6      0      5

with(my_data, t.test(smoked[scored == 1], smoked[scored == 2]))
#> 
#>  Welch Two Sample t-test
#> 
#> data:  smoked[scored == 1] and smoked[scored == 2]
#> t = -0.77865, df = 20.573, p-value = 0.445
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.5143832  0.2343832
#> sample estimates:
#> mean of x mean of y 
#>      0.36      0.50

If I understand it right you compare those with a score of 1 to those who have score of 2.

But I want to make 6 different comparisons:
-Those with score 1: smokers vs non smokers
-Those with score 2: smokers vs non smokers
-etc.

So, let's subset the data to select only the smoking variable within group 1

my_data <- data.frame(
  smoked =
    c(1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1),
  scored =
    c(4, 1, 3, 3, 4, 5, 3, 1, 3, 1, 5, 6, 2, 6, 3, 6, 6, 4, 1, 1, 3, 3, 5, 5, 4, 4, 6, 3, 3, 5, 2, 4, 1, 2, 3, 3, 1, 1, 2, 6, 1, 3, 6, 1, 1, 6, 1, 4, 4, 2, 1, 1, 5, 5, 2, 2, 3, 5, 1, 2, 5, 3, 6, 3, 2, 4, 4, 3, 1, 1, 1, 6, 5, 4, 6, 3, 1, 2, 1, 3, 5, 1, 3, 5, 1, 3, 2, 6, 6, 3, 1, 3, 2, 1, 3, 1, 4, 4, 3, 6))

group1 <- my_data[which(my_data$scored == 1),1]
smokers <- sum(group1)
nonsmokers <- length(group1) - smokers
smokers
#> [1] 9
nonsmokers
#> [1] 16

Unless there is some other data, such as age perhaps, the only thing to be wrung from the data is that it splits 9/16 smoker/nonsmoker.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.