[R Newbie] - Error: Having issues with running regression due to "differing number of rows"

DP398 · August 11, 2023, 3:08pm

I am a researcher running a binomial regression (and coding and doing statistics) for the first time ever for work - it's been an experience! I took over this project for work midway through, so did not develop the initial coding myself. I've never coded before so I've been learning R as I go. My apologies if I've not laid out the issue below as I should have or missed out any critical information, I'm really very much a novice at all of this.

The problem: I've had to expand the dataset R is pulling from, and am getting a bunch of errors due to an apparent mismatch of rows. However I can't figure out what I need to do next to fix this.

The initial dataset was 1,276 individuals (rows), each responding to a selection from 188 questions (columns). I have since been asked to add responses to 8 further questions to this initial dataset, meaning 196 questions (columns) for the final dataset. Overall, there have only have only ever been 9 columns, and that remains unchanged. However, I am having an issue with adjusting my code to account for the addition of these new columns.

Any ideas welcome with respect to what might be causing the mismatch of rows!

The details:

For example, my first code, which would run:

Ans_Data = read_xlsx("DSM Data 15.2.23 IB v4.xlsx",
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A12:GG1290", col_names = F, col_types = c("text",rep("numeric",188)))
Question_Data = t(read_xlsx("DSM Data 15.2.23 IB v4.xlsx",
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A1:GG10", col_names = T))

colnames(Question_Data) = Question_Data[1,] 
Question_Data = Question_Data[-1,] 
Question_Data = data.table(Question_Data)

Ans_Data_2 = Ans_Data %>%
  pivot_longer(cols = colnames(Ans_Data)[2:189])

for (i in 1:1278) {
  if (i==1) {
    Question_Data_2 = rbind(Question_Data,Question_Data)
  } else {
    Question_Data_2 = rbind(Question_Data_2,Question_Data)
  }
}

Ans_Data_3 = cbind(Ans_Data_2, Question_Data_2)

However, my updated code:

Ans_Data = read_xlsx("DSM Data 15.2.23 DP v5.xlsx",
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A12:GO1287", col_names = F,col_types = c("text",rep("numeric",196)))
Question_Data = t(read_xlsx("DSM Data 15.2.23 DP v5.xlsx", 
  sheet = "CHANGED Tab 2 - AR weighted",
  range = "A1:GO10", col_names = T))

colnames(Question_Data) = Question_Data[1,] 
Question_Data = Question_Data[-1,] 
Question_Data = data.table(Question_Data)

Ans_Data_2 = Ans_Data %>%
  pivot_longer(cols = colnames(Ans_Data)[2:197])

for (i in 1:1278) {
  if (i==1) {
    Question_Data_2 = rbind(Question_Data,Question_Data)
  } else {
    Question_Data_2 = rbind(Question_Data_2,Question_Data)
  }
}

Ans_Data_3 = cbind(Ans_Data_2, Question_Data_2)

produces the following error:

Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 250096, 250684

technocrat · August 11, 2023, 9:42pm

The mismatch between number of columns is to blame (you'll also run across this in other contexts as "non-conformable arrays"). The fix is straightforward and works for the corresponding case of mismatched rows.

qd <- data.frame(
  Question_ID = c(
    "sawd4_batch2", "sawd3_batch3",
    "sand4_batch", "samd3", "samd32", "bwpx_batch", "bwd3", "bwd32",
    "bmd3_batch5", "bm3_batch2"
  ), `Media Item Subtype` = c(
    "Image",
    "Image", "Image", "Image", "Image", "Image", "Image", "Image",
    "Image", "Image"
  ), `Contains Synthetic Media?` = c(
    "Yes (Fully Synthetic)",
    "Yes (Fully Synthetic)", "Yes (Fully Synthetic)", "Yes (Fully Synthetic)",
    "Yes (Fully Synthetic)", "Yes (Fully Synthetic)", "Yes (Fully Synthetic)",
    "Yes (Fully Synthetic)", "Yes (Fully Synthetic)", "Yes (Fully Synthetic)"
  ), `Real/Fake Image` = c(
    "Fake", "Fake", "Fake", "Fake", "Fake",
    "Fake", "Fake", "Fake", "Fake", "Fake"
  ), `Real/Fake Audio` = c(
    NA_character_,
    NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
    NA_character_, NA_character_, NA_character_, NA_character_
  ),
  `Real/Fake Video` = c(
    NA_character_, NA_character_, NA_character_,
    NA_character_, NA_character_, NA_character_, NA_character_,
    NA_character_, NA_character_, NA_character_
  ), `Type of Image` = c(
    "Human",
    "Human", "Human", "Human", "Human", "Human", "Human", "Human",
    "Human", "Human"
  ), `Human or Non Human` = c(
    "Human", "Human",
    "Human", "Human", "Human", "Human", "Human", "Human", "Human",
    "Human"
  ), `Language Type` = c(
    NA_character_, NA_character_,
    NA_character_, NA_character_, NA_character_, NA_character_,
    NA_character_, NA_character_, NA_character_, NA_character_
  )
)

ad <- data.frame(...1 = c(
  "53987712fdf99b68e3a45021", "545cee6dfdf99b7f9e3254ce",
  "5484739ffdf99b0379939c95", "5588ee6ffdf99b304dd48297", "558943fafdf99b5ccd435cb3",
  "5589c7cefdf99b18bd86cf31", "558a035bfdf99b2d75651378", "558a327cfdf99b2d75651681",
  "558bbd56fdf99b2127e1f359", "5591827dfdf99b4fccbdfb21"
), ...2 = c(
  NA,
  NA, NA, 1, NA, NA, 1, NA, NA, NA
), ...3 = c(
  NA, NA, NA, 0, NA,
  NA, 0, NA, NA, NA
), ...4 = c(
  NA, NA, NA, 1, NA, NA, 0, NA, NA,
  NA
), ...5 = c(NA, NA, NA, 1, NA, NA, 0, NA, NA, NA), ...6 = c(
  NA,
  NA, NA, 1, NA, NA, 1, NA, NA, NA
), ...7 = c(
  NA, NA, NA, 0, NA,
  NA, 0, NA, NA, NA
), ...8 = c(
  NA, NA, NA, 1, NA, NA, 0, NA, NA,
  NA
), ...9 = c(NA, NA, NA, 0, NA, NA, 0, NA, NA, NA), ...10 = c(
  NA,
  NA, NA, 0, NA, NA, 0, NA, NA, NA
))

dim(qd)
#> [1] 10  9
dim(ad)
#> [1] 10 10

qd[,10] <- NA

^{Created on 2023-08-11 with reprex v2.0.2}

I'm guessing that your regular work involves programming in a C-like language, such as Python. Here are some R tips.

Unless you are working with more than about 7 table-like objects, keep the names short (qd for "question data").
Likewise with variables. Save the descriptive names for presentation table headers. For this, I'd do

colnames(qd) <- paste0("q",1:10)

Code binary data as TRUE/FALSE
Use as.factor() to convert categorical data
If you have dates from Excel, they will probably come over as character strings.
If you have an Excel that was prepared with presentation in mind, any $1,000.00 will come over as a character string.
Always check the import with str() to see other variables that may need adjustment.
R has a global environment and a local environment. Local environments are important in functions. Any object that is missing will be looked for in the global environment; any object that is defined in the local environment won't escape to the global environment unless it is explicitly returned.
The retriculate package provides an interface to Python if you are more familiar with that and are in a hurry.

Finally, don't take for granted that this is a problem in logistic regression. I don't see any continuous variable.

DP398 · August 16, 2023, 4:10pm

Thanks @technocrat !

Once I fixed the column mismatch there were no errors! I did a quick audit to confirm it was pulling all the necessary data by having R calculate some averages, and can confirm it matched my manual calculation I did directly in the excel dataset.

Actually, I've never coded before full stop - taking this project on after all the initial coding had been done by someone else has been my first foray into any kind of programming! So this has been quite the learning experience to say the least. I appreciate the R coding tips though as I suspect I'll be doing more of this for work in the future.

system · August 23, 2023, 4:10pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.