Hi guys, trying to test some data witha a multiple regression model.
So, let's assume i have this data
"Fam.ID" "Income" "Consumption" "Status" "Qualification" "Family num" "Age" "Sex"
1 46200 40600 1 4 2 57 2
1 46200 40600 1 4 2 60 1
2 32340 30600 1 4 2 55 2
2 32340 30600 1 3 2 62 1
3 25200 20400 1 3 3 55 1
3 25200 20400 1 3 3 52 2
3 25200 20400 2 4 3 21 1
3 34100 33600 2 3 4 29 2
4 9880 10800 4 3 1 77 2
5 11950 11400 4 2 1 75 2
6 41100 20800 1 3 4 48 2
7 25596 8900 4 2 1 83 1
8 27400 18000 1 2 6 53 1
8 27400 18000 1 3 6 49 2
8 27400 18000 2 3 6 29 1
8 27400 18000 2 3 6 26 1
8 27400 18000 2 3 6 15 2
8 27400 18000 2 3 6 13 2
9 13500 13300 1 2 2 70 1
9 13500 13300 1 2 2 61 2
Of course the data is much bigger, this is just a sample. Anyhow, i need to analyze it so i use
data<-read.table("dataTest.txt",sep=" ",header=T)
str(data)
but i need them to be factors, because for example sex=1 is male and viceversa.
data$Status=factor(data$Status)
data$Qualification=factor(data$Qualification)
data$Family.num=factor(data$Family.num)
data$Sex=factor(data$Sex)
attach(data)
Are the factors correct ? or maybe Age should have been and Family.num shouldn't ? Then i try to get a correlation matrix.
cor(data)
But I get an error: ('X' must be numeric).
Ok but then how do i get a correlation matrix which doesn't consider 4 columns of data ? i can get R to not compute on one with the [,-1], but can i do it for multiple ones ? or should i create another object with just selected columns ?
Another question is about the lm command itself. What should i do about the first column (the family id one) ?
should the command be
reg1=lm(Income~.,data=data[,-1])
or should it be
reg2=lm(Income~Consumption+Status+Qualification+Family.num+Age+Sex,data=data[,-1])
or is it the same thing ? and what does
reg3=lm(Income~1,data=data[,-1])
do? and should the [,-1] even be there ?
so many question guys, thanks in advance to any who'll help me !
another doubt is about the factor$ itself. If i tell R that they are factor and not numeric, he correctly states the levels, for example
$ Qualification: Factor w/ 3 levels
but then if i try to get the summary, i get one less intecept for each factor:
Status2 2.255e+03 3.399e+02 6.635 3.34e-11 ***
Status3 -2.664e+03 5.933e+02 -4.490 7.15e-06 ***
Status4 -1.094e+03 4.467e+02 -2.450 0.014290 *
where is Status1 ? did it read it as a dummy variable ? so many doubts lol