I have student-level data where a single column lists for each student their unique major(s), minor(s), and any certificate(s) they may have earned while graduating.
The raw data look like this:
mydf <- structure(list(id = c("21996", "21996", "21996", "21996", "21996",
"25294", "25294", "25294", "25294", "25294", "62410", "62410",
"62410", "62410", "62410", "62656", "62656", "62656", "62656",
"62656", "00073", "00073", "00073", "00073", "00073", "31337",
"31337", "31337", "31337", "31337"), dtype = c("BB", "BB", "BB",
"CT", "OR", "BS", "CT", "OR", "OR", "OR", "BA", "BA", "CT", "CT",
"OR", "BC", "CT", "OR", "OR", "OR", "BS", "OR", "OR", "OR", "OR",
"BS", "CT", "OR", "OR", "OR"), code = c("6120", "6126", "6158",
"LEAD", "COMS", "6472", "DIAB", "2121", "3311", "6360", "4201",
"4409A", "ISLM", "LJCU", "4211", "5363", "SALE", "6125", "6127",
"BSAD", "7269", "4231", "4233", "7270", "BSAD", "6380", "SALE",
"6127", "6181", "BSAD")), row.names = c(NA, -30L), class = c("tbl_df",
"tbl", "data.frame"))
For each unique id I would like to create columns listing their first major, second major, third major, and so on, followed by their first minor, second minor, and so on, and closing with their first certificate, second c certificate, and so on.
dtype values of CT represent a certificate
dtype values of OR represent a minor
!dtype %in% c("CT, "OR) represent a major
Each unique value of code represents the unique major/minor/certificate
The resulting data frame will be as follows:
out.df <- structure(list(id = c(21996, 25294, 62410, 62656, 73, 31337),
major1 = c(6120, 6472, 4201, 5363, 7269, 6380), major2 = c("6126",
NA, "4409A", NA, NA, NA), major3 = c(6158, NA, NA, NA, NA,
NA), major4 = c(NA, NA, NA, NA, NA, NA), minor1 = c("COMS",
"2121", "4211", "6125", "4231", "6127"), minor2 = c(NA, 3311,
NA, 6127, 4233, 6181), minor3 = c(NA, "6360", NA, "BSAD",
"7270", "BSAD"), minor4 = c(NA, NA, NA, NA, "BSAD", NA),
cert1 = c("LEAD", "DIAB", "ISLM", "SALE", NA, "SALE"), cert2 = c(NA,
NA, "LJCU", NA, NA, NA), cert3 = c(NA, NA, NA, NA, NA, NA
), cert4 = c(NA, NA, NA, NA, NA, NA)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Looks to me like I should be able to create placeholder columns that represent majors, minors, certs 1 through 4, and then pull in unique values of code for each dtype, inserting NA if the student has only one major, one minor, etc. However, the secret sauce is eluding me. I have a feeling that I can combine tidyr and dplyr to solve this but am drawing blanks. As usual, any and all leads are mucho mucho appreciated.