Create two-variable table with similar categories

Hendrina · July 25, 2019, 8:02am

I am trying to create a table (for chi-square test) that presents "unadjstatus" and "status" values like so

	unadjstatus	status
low	4	5
ok	2	1

using the data below

data.frame(stringsAsFactors=FALSE,
       unadjstatus = c("low", "low", "low", "low", "ok", "ok"),
              inf1 = c("norm", "inflammed", "norm", "inflammed", "inflammed",
                       "norm"),
            status = c("low", "low", "low", "low", "ok", "low"),

I have tried most of the basic functions such as

 x=count(men,c("status","unadjstatus"))
Error: Column `c("status", "unadjstatus")` must be length 215 (the number of rows) or one, not 2
> x
  status unadjstatus freq
1    low         low   5
2     ok         low    2
3     ok          ok   0

> apply(men[cols],count)
Error in match.fun(FUN) : argument "FUN" is missing, with no default
> apply(men[cols],2,count)
$status
    x freq
1 low  5
2  ok   1

$unadjstatus
    x freq
1 low  4
2  ok   2

I am obviously missing something because it's either presenting the frequency separately, or counting them together. Any suggestions on a function that could work? I also tried ddply() and cbind() but kept getting similar results

Yarnabrina · July 25, 2019, 1:28pm

I'm providing two solutions, but I think there may be a better way.

library(magrittr)

sample_dataset <- data.frame(unadjstatus = c("low", "low", "low", "low", "ok", "ok"),
                             inf1 = c("norm", "inflammed", "norm", "inflammed", "inflammed", "norm"),
                             status = c("low", "low", "low", "low", "ok", "low"),
                             stringsAsFactors = FALSE)

sample_dataset %>%
  dplyr::summarise_at(.vars = -2,
                      .funs = ~ list(table(.))) %>%
  tidyr::unnest()
#>   unadjstatus status
#> 1           4      5
#> 2           2      1

vapply(X = sample_dataset[-2],
       FUN = table,
       FUN.VALUE = c("low" = 0, "ok" = 0))
#>     unadjstatus status
#> low           4      5
#> ok            2      1

grosscol · July 25, 2019, 1:46pm

The trailing comma in your data frame example should be a ).

Sometimes when I have to aggregate and count, I find it convenient to work in a long data format. I've left the gather and spread operations split out so you can easily see how the data changes during the intermediate steps.

library(tidyr)
library(dplyr)
library(tibble)

df <- data.frame(stringsAsFactors=FALSE,
           unadjstatus = c("low", "low", "low", "low", "ok", "ok"),
           inf1 = c("norm", "inflammed", "norm", "inflammed", "inflammed",
                    "norm"),
           status = c("low", "low", "low", "low", "ok", "low"))

# Make long data from given wide data
df_gathered  <- tidyr::gather(df, key='category', value='val')

# Filter out unused category.  Group and count remaining
count_data <- df_gathered %>% 
  filter(category != "inf1") %>%
  group_by(category, val)  %>% 
  summarize(count=n())

# Spread back to wide data
df_spread <- tidyr::spread(count_data, key='category', value='count')

# Use val column to make column names
result <- tibble::column_to_rownames(df_spread, var="val")

result

    status unadjstatus
low      5           4
ok       1           2

Hendrina · July 25, 2019, 2:14pm

Yes, that was my error. The data.frame is much longer than the sample (125 variables) and I decided to remove part of it to avoid having a lengthy question.

Hendrina · July 25, 2019, 2:43pm

Thanks for the solutions, both methods work. My data.frame has more variables than I posted (125) and I was trying to figure out how I can remove the unwanted variables in the code. the simplest here was re-ordering variable 2 and three and removing those like so (note: the counts here are for the entire data.frame)

> try%>%dplyr::summarise_at(.vars=-3:-125, .funs = ~list(table(.))) %>%tidyr::unnest()
# A tibble: 2 x 2
   unadjstatus status
low         146    142
ok           69     73

However, if I needed the variable order maintained how would I do it? I tried these below but both didn't work

> vapply(try[-2,-4:-125,],table, FUN.VALUE = c("low"=0, "ok"=0))
    unadjstatus status inf1
low         145    141   34
ok           69     73  180

> men%>%dplyr::summarise_at(.vars=-2,-4:-125, .funs = ~list(table(.))) %>%tidyr::unnest()
Error: All nested columns must have the same number of elements.
Call `rlang::last_error()` to see a backtrace

Yarnabrina · July 25, 2019, 3:40pm

Can you please share your data set (or a small part of it) in a copy-paste friendly format?

In case you don't know how to do it, there are many options, which include:

If you have stored the data set in some R object, dput function is very handy.
In case the data set is in a spreadsheet, check out the datapasta package. Take a look at this link.

You can also share it using Google Drive, Dropbox, etc.

And, please note the following:

I used negative indexing to remove the 2nd column. For your full dataset, you can use positive indexing, and it'll be easier.
Your 2nd attempt is wrong. You can't provide indices in that way. Putting those in c should work, but using positive indexing will be easier.
Please define what you mean by maintaining variable order.

Hendrina · July 25, 2019, 4:41pm

I can only share bits of it (like in the original question) not the whole dataset because it's actual observations from a certain population. I can add more variables however, I was able to figure everything out because of your help as explained below

I tried positive indexing and it worked in both methods by putting them in c (I should have thought of that to begin with)

I meant if I wanted to apply the function on variables that are not next to each other e.g. variable 1 and variable 3, without rearranging them to column 1 and 2 . So the solution you offered about positive indexing actually works for all the questions I had here.

Thanks a lot for the help

Leon · July 25, 2019, 5:21pm

Hi @Hendrina,

Would this be what you are looking for?

# Load libraries
library('tidyverse')

# Define data
d = tibble(unadjstatus = c("low", "low", "low", "low", "ok", "ok"),
           inf1 = c("norm", "inflammed", "norm", "inflammed", "inflammed", "norm"),
           status = c("low", "low", "low", "low", "ok", "low"))

# Count and join
d_c = full_join(d %>%
                  select(-inf1) %>%
                  count(unadjstatus) %>%
                  rename(level = unadjstatus, unadjstatus = n),
                d %>%
                  select(-inf1) %>%
                  count(status) %>%
                  rename(level = status, status = n),
                by = "level")

Yielding:

> d_c
# A tibble: 2 x 3
  level unadjstatus status
  <chr>       <int>  <int>
1 low             4      5
2 ok              2      1

and then:

> column_to_rownames(d_c, "level")
    unadjstatus status
low           4      5
ok            2      1

Hope it helps!

system · August 1, 2019, 5:21pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.