Model fit on data like Age, education , gender where are character .I want to predict the enrolment
Use as.numeric()
to convert age to a number. Use as.factor()
to convert gender to a factor. Education could go either way, depending on how it is coded.
data
learner_id enrolled_at unenrolled_at role fully_participated_at** purchased_statement_at gender country age_range**
1 21003360-6f14-45d4-a135-f9707ec640b1 2016-12-09 16:14:52 UTC 2018-10-15 07:51:16 UTC learner female UA 18-25
2 4be21e55-bb69-46ec-85ec-71f3cd41a1cc 2016-12-21 19:27:00 UTC 2018-09-04 13:08:00 UTC learner male LT 26-35
fit_1 <- lm(enrolzesorNo ~ . , data = dataenrolment )
and predict next zears
Is fit_1$fitted.values
what you want? You may have to explain where you're having a problem more slowly.
I have the below data
glimpse(enrolment4)
Rows: 398
Columns: 13
learner_id <chr> "46a4c71e-8819-4c5a-8164-2f786186b9fb", "091df104-705f-4ee2-a67b-9d90043c4f56", "357e0aed-eb55-4a0d-ac9b-ce5007d855d9", "09ef787d-6b55-4ed~ enrolled_at "2017-11-09 11:18:18 UTC", "2017-10-03 21:22:04 UTC", "2017-12-19 15:46:51 UTC", "2017-10-19 14:26:58 UTC", "2017-12-29 01:04:57 UTC", "20~
unenrolled_at <chr> "2018-10-19 10:31:59 UTC", "2018-10-08 16:27:33 UTC", "2018-09-29 20:27:55 UTC", "", "", "2018-08-11 12:44:38 UTC", "2018-08-10 11:49:07 U~ role "learner", "learner", "learner", "learner", "learner", "learner", "learner", "learner", "learner", "learner", "learner", "learner", "learn~
fully_participated_at <chr> "", "", "", "", "2018-09-12 06:54:19 UTC", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "2017-12-28 00:43:38 UTC", ~ purchased_statement_at "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""~
gender <chr> "female", "female", "male", "male", "male", "female", "female", "female", "male", "female", "female", "male", "male", "male", "female", "f~ country "GR", "GB", "EG", "GB", "CN", "GB", "UA", "NG", "NG", "MK", "AU", "BA", "IN", "NG", "EG", "GB", "ZA", "ZW", "NG", "PK", "GB", "KE", "NG", ~
age_range <chr> "26-35", ">65", "18-25", ">65", "36-45", "56-65", "26-35", "36-45", "26-35", "18-25", "46-55", "18-25", "18-25", "18-25", "18-25", "56-65"~ highest_education_level "university_masters", "university_degree", "secondary", "university_degree", "university_masters", "Unknown", "university_degree", "univer~
employment_status <chr> "full_time_student", "retired", "full_time_student", "retired", "working_full_time", "working_part_time", "self_employed", "working_full_t~ employment_area "accountancy_banking_and_finance", "Unknown", "it_and_information_services", "Unknown", "accountancy_banking_and_finance", "teaching_and_e~
$ detected_country "GR", "GB", "EG", "GB", "CN", "GB", "NO", "NG", "NG", "MK", "AU", "BA", "IN", "ZA", "EG", "GB", "GB", "ZW", "NG", "PK", "GB", "US", "NG", ~
I am wondering if i can use any of the fit model approaches like Regression ,Bayes,Lasso and predict if i will have in the future new enrolments?
If yes , how this could be
Sure. As you outline, there are any number of techniques you can use. R will implement the techniques, but it won't guide you in which one you want.
My suggestion is to start by learning a little about regression, Linear regression - Wikipedia, (if you already know about this, my apologies) and then try out the R command lm()
.
Thanks,
I use th lm() having variables integers and compare one of the variables (Iris data)
I am wondering how can i handle the enrolment data , and deciding for a variable that does not exist and i need to predict in the future ( like if i will have in the next year new enrolments )
If you run your regression with something like
results <- lm()
you can then get predictions from
predict(results, newData)
where newData has values of your independent variables for the periods for which you wish to predict your dependent variable.
Thanks for your response.
below are my code.
I try to predict if i have new enrolments?
modelenrolment4 <- enrolment4
glimpse(modelenrolment4)
modelenrolment4$age_range <- as.factor(modelenrolment4$age_range )
modelenrolment4$gender <- as.factor(modelenrolment4$gender )
modelenrolment4$detected_country <- as.factor(modelenrolment4$detected_country )
modelenrolment4$enrolled_at<- as.factor(modelenrolment4$enrolled_at )
results <- lm( enrolled_at ~ age_range+gender+ detected_country,data=modelenrolment4)
ind = sample(2, nrow(modelenrolment4), replace=TRUE, prob=c(0.6, 0.4))
df.traindata = modelenrolment4[ind==1,]
df.validate = modelenrolment4[ind==2,]
predict(results, df.traindata)
the result of the above is below and i do nt undertand
predict(results, df.traindata)
2 6 8 10 14 15 17 19 21 23 25 29 32 33 34 36 37
175.36468 172.24691 171.28723 303.69985 240.47798 248.33454 189.25315 188.29347 192.37092 188.29347 271.65744 172.24691 175.36468 189.25315 265.34078 265.34078 151.94129
38 39 44 45 46 47 49 56 57 58 59 60 61 65 66 68 69
287.51108 239.00000 192.56108 253.36850 249.98660 222.16425 172.24691 192.56108 278.48926 221.56241 210.67448 260.98201 254.98614 204.55617 136.00000 113.49688 234.13653
70 71 74 76 78 80 81 82 84 86 87 88 89 90 91 93 95
164.20141 121.41779 166.46292 218.17605 143.66156 274.00000 185.87578 193.88074 86.29627 141.94773 175.55484 221.78034 236.17041 156.99511 234.13653 317.49463 210.88698
97
In traindata
you have set new values of the independent variables using randomly selected values from the original data. predict()
then applies the estimated coefficients from lm()
to the new values and gives predicted outcomes for enrolled_at
.
predict(results, df.traindata)
but i do not undestand the output . How it predict the new enrolments ?
sorry for my questions but i try to understand the oreductions
lm()
estimates a number for each age range, gender, and country. predict()
adds these together for the values of those variables in traindata
.
You might look at summary(results)
to see the estimated numbers and use head(traindata,1)
to see the first set of values for traindata
and then see if applying the former to the latter gets you the first predicted value.
When i type summary(results) i get
summary(results)
Call:
lm(formula = enrolled_at ~ age_range + gender + detected_country,
data = modelenrolment4)
Residuals:
Error in quantile.default(resid) : (unordered) factors are not allowed
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors
For > head(df.traindata,1)**
learner_id enrolled_at unenrolled_at role fully_participated_at purchased_statement_at gender country age_range
2 091df104-705f-4ee2-a67b-9d90043c4f56 2017-10-03 21:22:04 UTC 2018-10-08 16:27:33 UTC learner female GB >65
highest_education_level employment_status employment_area detected_country
2 university_degree retired Unknown GB
Curious. Since summary()
is not showing any coefficients, it appears that something is wrong in your specification of the regression. It isn't obvious to me what the problem is.
> results <- lm( enrolled_at ~ age_range+gender+ country,data=modelenrolment4)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
> results <- lm( enrolled_at ~ age_range ,data=modelenrolment4)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
Ia there any command to check the quality of my data on these fields ....
From what you posted earlier, it looks like enrolled_at
is a character string representing a time. The left-hand side variable can't be a character string, and in general can't be a factor (although there are exceptions).
Are you trying to predict the time people enroll?
yes thsi is want i want in relation with variablsd of age,gender, country
You probably want to convert enrolled_at
to an R time variable. Sometimes, this can be difficult. Take a look at the {lubridate}
package.
see my email
results <- lm( age_range~ gender ,data=modelenrolment4)
summary(results)
ind = sample(2, nrow(modelenrolment4), replace=TRUE, prob=c(0.6, 0.4))
df.traindata = modelenrolment4[ind==1,]
df.validate = modelenrolment4[ind==2,]
head(df.traindata,1)
predict(results, df.traindata) ###
i get the below results
summary(results)
Call:
lm(formula = age_range ~ gender, data = modelenrolment4)
Residuals:
Min 1Q Median 3Q Max
-3.4416 -1.2725 -0.2725 1.5584 2.7275
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6106 0.2767 16.665 <2e-16 ***
gender -0.1690 0.1646 -1.027 0.305
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.638 on 396 degrees of freedom
Multiple R-squared: 0.002657, Adjusted R-squared: 0.0001389
F-statistic: 1.055 on 1 and 396 DF, p-value: 0.3049
ind = sample(2, nrow(modelenrolment4), replace=TRUE, prob=c(0.6, 0.4))
df.traindata = modelenrolment4[ind==1,]
df.validate = modelenrolment4[ind==2,]
predict(results, df.traindata)
2 6 8 10 14 15 17 19 21 23 25 29 32 33 34 36 37 38 39
4.441553 4.441553 4.441553 4.441553 4.272515 4.441553 4.272515 4.272515 4.272515 4.272515 4.272515 4.441553 4.441553 4.272515 4.272515 4.272515 4.272515 4.272515 4.272515
44 45 46 47 49 56 57 58 59 60 61 65 66 68 69 70 71 74 76
4.272515 4.272515 4.441553 4.272515 4.441553 4.272515 4.272515 4.272515 4.272515 4.272515 4.272515 4.441553 4.272515 4.441553 4.272515 4.272515 4.441553 4.441553 4.272515
78 80 81 82 84 86 87 88 89 90 91 93 95 97 98 102 103 106 107
4.272515 4.441553 4.441553 4.441553 4.272515 4.272515 4.441553 4.441553 4.272515 4.441553 4.272515 4.272515 4.272515 4.272515 4.441553 4.441553 4.272515 4.272515 4.272515
110 111 112 113 114 115 116 117 118 119 120 121 122 125 127 128 129 131 132
4.4415