How to code variables into multiple categories

I am analysing data about people who join three specific groups. The raw data has the full group name, which I have recoded to single letters (A, X, and Y) through e.g.

Data$org[Data$org=='full A group name']='A'

However, many people are members of more than one of these organisations - I want to code these as being in multiple categories, but have struggled to do so.

I tried things like:

Data$org[Data$org=='full X group name,full Y group name']='X', 'Y'
Data$org[Data$org=='full X group name,full Y group name']='X'&&'Y'
Data$org[Data$org=='full X group name,full Y group name']='X'+'Y'

But none of these have worked - could someone please help me out on how to do this? I have limited experience with rstudio and very little with other statistical software or coding in general.

Hi,

The best way to do this I think is to make the data from a "long" format into a wide. This way, you'll be able to do additional analysis for people in different organisations.

Here is an example:

library("dplyr")
library("tidyr")
#Generate fake data
data = data.frame(person = sample(1:25, 50, replace = T), 
                  org = sample(LETTERS[1:5], 50, replace = T)) %>% 
  group_by(person, org) %>%   summarise()

#Make sure you label every known instance as present in the organisation
data$present = T

#Make the table wide, if a person is not belonging to an organisation, set to false
data = spread(data, org, present, fill = F)


Let me know if this helps
PJ

Hi PJ, thanks for the reply, I tried this:

 #> data=data.frame(person=sample(1:51,43,replace=T)org=sample(LETTERS[1:3,43,replace=T))

and got this error message:

 #>  Error: unexpected symbol in "data=data.frame(person=sample(1:51,43,replace=T)org"

Not sure what I am doing wrong, but as I said I am fairly unfamiliar with all this and not sure I completely understood your comment.

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

1 Like

Hi,

You have to run the whole block of code after data = ..., until the summarise (it's one chunk of code on multiple lines). The reason for this is that the random generator can create duplicates and the group_by gets rid of them. But this is not important as you can replace the data variable with your real data.

In your example in the response, you accidentally deleted the ] near LETTERS. This is the correct version (without the grouping though):

data=data.frame(person=sample(1:51,43,replace=T), org=sample(LETTERS[1:3],43,replace=T))

Grtz

Hi PJ,

So I re-followed your instructions, using it as one chunk of code and including the [, like so:

#> Data=data.frame(person=sample(1:51,43,replace=T)org=sample(LETTERS[1:3],43,replace=T))%>%group_by(person,org)%>%summarise()

But I am receiving this error message:

#> Error: unexpected symbol in "Data=data.frame(person=sample(1:51,43,replace=T)org"

Not sure what the unexpected symbol would be?

Thanks for all your help,
Q

I've tried to follow the reprex guide, although I am a little confused by the dataset bit. So this is an example of my data set with two important variables - lettered classification of the respondents and the organisations they are involved in.

 #>
data.frame(stringsAsFactors=FALSE,
         org = c(NA, NA, NA, "Organisation A",
                 "Organisation A",
                 "Organisation A", "Organisation A",
                 "Organisation A",
                 "Organisation A", "Organisation A",
                 "Organisation A",
                 "Organisation A", "Organisation A",
                 "Organisation A", "Organisation A",
                 "Organisation A",
                 "Organisation A", "Organisation A", NA,
                 "Organisation X,Organisation A",
                 "Organisation X,Organisation A",
                 "Organisation X,Organisation A",
                 "Organisation A", "Organisation X,
                Organisation A", "Organisation X,
                Organisation A", "Organisation X,
                Organisation A", NA, "Organisation X", "Organisation X",
                 "Organisation X"),
       Class = c(NA, NA, NA, "A", "A", "N/A", "B", "C1", "B", "D", "B", "B",
                 "D", "C1", "B", "D", "E", "B", NA, "B", "C1", "C2", "A",
                 "N/A", "S", "B", NA, "B", "S", "B")
)

Now I want to try and code the organisations each respondent is involved with (A,X, etc), but as some are involved in multiple, I need them to be coded as both A and X. I originally tried code like this:

 #>

and it wouldn't work. Following PJ, I have now been trying this:

#>Data=data.frame(person=sample(1:51,43,replace=T)org=sample(LETTERS[1:3],43,replace=T))%>%group_by(person,org)%>%summarise()

But have been receiving this error message:

 #>Error: unexpected symbol in "Data=data.frame(person=sample(1:51,43,replace=T)org"

I am wondering if it is easier to just copy the data of the respondents who are part of more than one org and code one as one org and one as the other.

Is this the result you are looking for?

library(tidyverse)

# Sample data
Data <- data.frame(stringsAsFactors=FALSE,
           org = c(NA, NA, NA, "Organisation A",
                   "Organisation A",
                   "Organisation A", "Organisation A",
                   "Organisation A",
                   "Organisation A", "Organisation A",
                   "Organisation A",
                   "Organisation A", "Organisation A",
                   "Organisation A", "Organisation A",
                   "Organisation A",
                   "Organisation A", "Organisation A", NA,
                   "Organisation X,Organisation A",
                   "Organisation X,Organisation A",
                   "Organisation X,Organisation A",
                   "Organisation A", "Organisation X,
                Organisation A", "Organisation X,
                Organisation A", "Organisation X,
                Organisation A", NA, "Organisation X", "Organisation X",
                   "Organisation X"),
           Class = c(NA, NA, NA, "A", "A", "N/A", "B", "C1", "B", "D", "B", "B",
                     "D", "C1", "B", "D", "E", "B", NA, "B", "C1", "C2", "A",
                     "N/A", "S", "B", NA, "B", "S", "B")
)

# Recoding
Data %>% 
    mutate(org = str_extract_all(string = org,
                                 pattern = "(?<=Organisation\\s)[A-Z]"))
#>     org Class
#> 1    NA  <NA>
#> 2    NA  <NA>
#> 3    NA  <NA>
#> 4     A     A
#> 5     A     A
#> 6     A   N/A
#> 7     A     B
#> 8     A    C1
#> 9     A     B
#> 10    A     D
#> 11    A     B
#> 12    A     B
#> 13    A     D
#> 14    A    C1
#> 15    A     B
#> 16    A     D
#> 17    A     E
#> 18    A     B
#> 19   NA  <NA>
#> 20 X, A     B
#> 21 X, A    C1
#> 22 X, A    C2
#> 23    A     A
#> 24 X, A   N/A
#> 25 X, A     S
#> 26 X, A     B
#> 27   NA  <NA>
#> 28    X     B
#> 29    X     S
#> 30    X     B

Created on 2019-07-13 by the reprex package (v0.3.0)

About the error you are getting with pieterjanvc's code, you are just missing a comma here replace=T)org=sample it should be replace=T), org=sample

Data <- data.frame(person=sample(1:51,43,replace=T),
                   org=sample(LETTERS[1:3],43,replace=T)) %>% 
    group_by(person,org) %>% 
    summarise()

Thanks Andres, have done this but getting error codes that rstudio cannot find the function "%>%" - am I missing a relevant package?

Yes you need library("dplyr")

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.