This is probably a very basic statistics question. I am looking at the moment to check if a feature i have in my dataset would make a good attribute for trying to predict a binary outcome. I have read about correspondence analysis which seems useful for when you have lots of factors however I am currently looking at chi^2 analysis using permutation
I want to check if there is an association between my variable and the binary outcome I'm trying to predict. To this end I have set up the example below where i am trying to see if there is an association between the sex of a student and what school they go to. This is obviously a nonsense example.
My understanding of permutation analysis is as follows (with respect to chi squared test)
Generate the chi squared statistic against my data
Generate permutations by perturbing one of the columns so there is now a random association between the two columns
After each permutation generate my test statistic
Visualize to inspect the results
A p value can be obtained by getting the proportion of test results from the perturbed data that were greater than my test statistic
Using the below as an example I have a couple of questions
Why does my p-value for the mathematical method differ so greatly from the computational method
How do I interpret the probability generated from the last line
library(lavaan)
library(infer)
library(tidyverse)
library(janitor)
mydf <- HolzingerSwineford1939 %>%
mutate(sex = ifelse(sex==1, 'M', 'F'))
# Step 1 is to calculate your test statistic
# Check for Chi.test that proportions are the same
# Checking to see if there is no relationship between the rows and the columns
tabyl(mydf, school, sex)
actual_score <- mydf %>% chisq_test(school ~ sex)
# The categories are independant
actual_score
# Step 2 Permutate 5000 chi scores
chisq_null <- mydf %>%
specify(school ~ sex, success = 'Pasteur') %>% # alt: response = origin, explanatory = season
hypothesize(null = "independence") %>%
generate(reps = 5000, type = "permute") %>%
calculate(stat = "Chisq", order = c("M", "F"))
# Visualize - This is very slow and takes a while to visualise
visualize(chisq_null, method = "both", obs_stat = actual_score$statistic, direction = "greater")
# Attempt to get the p_value from the permutations
chisq_null %>% summarise(p_val = mean(stat > actual_score$statistic))
The theoretical p-value I get is 0.788 versus 0.651via permutation. Those are not that different. The difference is accountable due to the differences in the test; the Chi^2 distribution is valid if the assumed distribution of the data are valid. The permutation test uses the empirical distribution and would be insensitive to whether that assumption is true or not.
It is a measure of how consistent the data are with the distribution under the null hypothesis. It is basically a tail probability. If the data are consistent with the null hypothesis, the tail probability is usually large. If the data are indicative of a relationship, the observed value is inconsistent with the null hypothesis and should be in the tail of the distribution. For that reason, the tail probability should be small.
Looking at the results of visualize, the observed is on the left of that distribution (hence the large tail probability).
If a relationship did exist from the permutation methods the probability(p-value) generated by the line of code chisq_null %>% summarise(p_val = mean(stat > actual_score$statistic)) would be around 0.05 which would indicate that given my data I would only get my test the result of my test statistic less than 5% of the time.
The visualize function is showing me the distribution of the scores of my permuted scores in gray the bar chart and the red line is my score from the chi-squared test. In an ideal world if there is a relationship assuming I set the direction = "greater" in the visualize function, I would want the red line very far from the large distribution of permutations (gray bar). This is quantified with probability score in the previous point
If I am attempting to model something and I have a training and a test set, permutation methods are the way to go when trying to establish if categorical variables are related to the dependent variable as I am only interested in the relationship within the current sample and permutation methods make no assumptions about the population distribution?
Again thank you very much for your time. I feel you should send me a bill at this stage
I would only use the training set to derive if there is potential relationship. Also, if you are going to put this variable into a model, you should consider including that step within resampling to avoid circular logic (the variable is important in the entire training set so that model will never have the ability to say that it is not useful). By re-using the same data for the selection and modeling, you might be overfitting the terms in the model.
See Algorithm #1in this chapter and the example. More will be written about this in the forthcoming chapter 10.
No parametric distributional assumptions. There are some assumptions related to how the data were sampled/collected and others IIRC. From wikipedia:
The basic premise is to use only the assumption that it is possible that all of the treatment groups are equivalent, and that every member of them is the same before sampling began (i.e. the slot that they fill is not differentiable from other slots before the slots are filled).