How to code variables into multiple categories

Researcher_Q · July 12, 2019, 2:02pm

I am analysing data about people who join three specific groups. The raw data has the full group name, which I have recoded to single letters (A, X, and Y) through e.g.

Data$org[Data$org=='full A group name']='A'

However, many people are members of more than one of these organisations - I want to code these as being in multiple categories, but have struggled to do so.

I tried things like:

Data$org[Data$org=='full X group name,full Y group name']='X', 'Y'
Data$org[Data$org=='full X group name,full Y group name']='X'&&'Y'
Data$org[Data$org=='full X group name,full Y group name']='X'+'Y'

But none of these have worked - could someone please help me out on how to do this? I have limited experience with rstudio and very little with other statistical software or coding in general.

pieterjanvc · July 12, 2019, 4:35pm

Hi,

The best way to do this I think is to make the data from a "long" format into a wide. This way, you'll be able to do additional analysis for people in different organisations.

Here is an example:

library("dplyr")
library("tidyr")
#Generate fake data
data = data.frame(person = sample(1:25, 50, replace = T), 
                  org = sample(LETTERS[1:5], 50, replace = T)) %>% 
  group_by(person, org) %>%   summarise()

#Make sure you label every known instance as present in the organisation
data$present = T

#Make the table wide, if a person is not belonging to an organisation, set to false
data = spread(data, org, present, fill = F)

Let me know if this helps
PJ

Researcher_Q · July 12, 2019, 5:34pm

Hi PJ, thanks for the reply, I tried this:

 #> data=data.frame(person=sample(1:51,43,replace=T)org=sample(LETTERS[1:3,43,replace=T))

and got this error message:

 #>  Error: unexpected symbol in "data=data.frame(person=sample(1:51,43,replace=T)org"

Not sure what I am doing wrong, but as I said I am fairly unfamiliar with all this and not sure I completely understood your comment.

andresrcs · July 12, 2019, 6:08pm

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

pieterjanvc · July 12, 2019, 6:48pm

Hi,

You have to run the whole block of code after data = ..., until the summarise (it's one chunk of code on multiple lines). The reason for this is that the random generator can create duplicates and the group_by gets rid of them. But this is not important as you can replace the data variable with your real data.

In your example in the response, you accidentally deleted the ] near LETTERS. This is the correct version (without the grouping though):

data=data.frame(person=sample(1:51,43,replace=T), org=sample(LETTERS[1:3],43,replace=T))

Grtz

Researcher_Q · July 13, 2019, 12:17pm

Hi PJ,

So I re-followed your instructions, using it as one chunk of code and including the [, like so:

#> Data=data.frame(person=sample(1:51,43,replace=T)org=sample(LETTERS[1:3],43,replace=T))%>%group_by(person,org)%>%summarise()

But I am receiving this error message:

#> Error: unexpected symbol in "Data=data.frame(person=sample(1:51,43,replace=T)org"

Not sure what the unexpected symbol would be?

Thanks for all your help,
Q

Researcher_Q · July 13, 2019, 12:46pm

I've tried to follow the reprex guide, although I am a little confused by the dataset bit. So this is an example of my data set with two important variables - lettered classification of the respondents and the organisations they are involved in.

 #>
data.frame(stringsAsFactors=FALSE,
         org = c(NA, NA, NA, "Organisation A",
                 "Organisation A",
                 "Organisation A", "Organisation A",
                 "Organisation A",
                 "Organisation A", "Organisation A",
                 "Organisation A",
                 "Organisation A", "Organisation A",
                 "Organisation A", "Organisation A",
                 "Organisation A",
                 "Organisation A", "Organisation A", NA,
                 "Organisation X,Organisation A",
                 "Organisation X,Organisation A",
                 "Organisation X,Organisation A",
                 "Organisation A", "Organisation X,
                Organisation A", "Organisation X,
                Organisation A", "Organisation X,
                Organisation A", NA, "Organisation X", "Organisation X",
                 "Organisation X"),
       Class = c(NA, NA, NA, "A", "A", "N/A", "B", "C1", "B", "D", "B", "B",
                 "D", "C1", "B", "D", "E", "B", NA, "B", "C1", "C2", "A",
                 "N/A", "S", "B", NA, "B", "S", "B")
)

Now I want to try and code the organisations each respondent is involved with (A,X, etc), but as some are involved in multiple, I need them to be coded as both A and X. I originally tried code like this:

#>

and it wouldn't work. Following PJ, I have now been trying this:

#>Data=data.frame(person=sample(1:51,43,replace=T)org=sample(LETTERS[1:3],43,replace=T))%>%group_by(person,org)%>%summarise()

But have been receiving this error message:

 #>Error: unexpected symbol in "Data=data.frame(person=sample(1:51,43,replace=T)org"

I am wondering if it is easier to just copy the data of the respondents who are part of more than one org and code one as one org and one as the other.

andresrcs · July 13, 2019, 1:26pm

Is this the result you are looking for?

library(tidyverse)

# Sample data
Data <- data.frame(stringsAsFactors=FALSE,
           org = c(NA, NA, NA, "Organisation A",
                   "Organisation A",
                   "Organisation A", "Organisation A",
                   "Organisation A",
                   "Organisation A", "Organisation A",
                   "Organisation A",
                   "Organisation A", "Organisation A",
                   "Organisation A", "Organisation A",
                   "Organisation A",
                   "Organisation A", "Organisation A", NA,
                   "Organisation X,Organisation A",
                   "Organisation X,Organisation A",
                   "Organisation X,Organisation A",
                   "Organisation A", "Organisation X,
                Organisation A", "Organisation X,
                Organisation A", "Organisation X,
                Organisation A", NA, "Organisation X", "Organisation X",
                   "Organisation X"),
           Class = c(NA, NA, NA, "A", "A", "N/A", "B", "C1", "B", "D", "B", "B",
                     "D", "C1", "B", "D", "E", "B", NA, "B", "C1", "C2", "A",
                     "N/A", "S", "B", NA, "B", "S", "B")
)

# Recoding
Data %>% 
    mutate(org = str_extract_all(string = org,
                                 pattern = "(?<=Organisation\\s)[A-Z]"))
#>     org Class
#> 1    NA  <NA>
#> 2    NA  <NA>
#> 3    NA  <NA>
#> 4     A     A
#> 5     A     A
#> 6     A   N/A
#> 7     A     B
#> 8     A    C1
#> 9     A     B
#> 10    A     D
#> 11    A     B
#> 12    A     B
#> 13    A     D
#> 14    A    C1
#> 15    A     B
#> 16    A     D
#> 17    A     E
#> 18    A     B
#> 19   NA  <NA>
#> 20 X, A     B
#> 21 X, A    C1
#> 22 X, A    C2
#> 23    A     A
#> 24 X, A   N/A
#> 25 X, A     S
#> 26 X, A     B
#> 27   NA  <NA>
#> 28    X     B
#> 29    X     S
#> 30    X     B

^{Created on 2019-07-13 by the reprex package (v0.3.0)}

About the error you are getting with pieterjanvc's code, you are just missing a comma here replace=T)org=sample it should be replace=T), org=sample

Data <- data.frame(person=sample(1:51,43,replace=T),
                   org=sample(LETTERS[1:3],43,replace=T)) %>% 
    group_by(person,org) %>% 
    summarise()

Researcher_Q · July 13, 2019, 3:54pm

Thanks Andres, have done this but getting error codes that rstudio cannot find the function "%>%" - am I missing a relevant package?

pieterjanvc · July 13, 2019, 4:07pm

Yes you need library("dplyr")

system · August 3, 2019, 4:07pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.