bind_rows() changes some column names

mcneills · September 1, 2022, 3:59am

I have some data frames that I want to bind row-wise together. The individual data frames have come from Excel and, because the Excel files contain several columns with the same column name the data when read by R has column names renamed to be unique. This is expected. Rather than provide the Excel files, I can generate the data as follows:

df1 <- data.frame(`A...1`=rnorm(5),
                  `A...2`=rnorm(5),
                  B=rnorm(5))
df2 <- data.frame(`A...1`=rnorm(5),
                  B=rnorm(5))

Thus, the data from the data frames looks like the following:

> df1
       A...1      A...2          B
1 -0.2520014  1.6274111  0.6183493
2  0.5152256 -0.9451730  1.1118546
3  0.2925951  1.1416916  0.3140801
4  0.6923316  1.5487287 -1.0121066
5 -1.2140627 -0.3060724  0.3947781
> df2
       A...1          B
1 -0.1887456 -0.9406678
2 -1.0122872 -0.8838331
3  0.2787877 -1.4449848
4 -0.2589786 -0.1398463
5 -1.1378909  0.2089307
>

Now I would like to combine these two data frames row-wise, as shown below. The resultant combined data frame is a bit strange:

> library("tidyverse")
> df <- bind_rows(df1,df2)
New names:
* A...1 -> A
> df
        A...1      A...2          B          A
1  -0.2520014  1.6274111  0.6183493         NA
2   0.5152256 -0.9451730  1.1118546         NA
3   0.2925951  1.1416916  0.3140801         NA
4   0.6923316  1.5487287 -1.0121066         NA
5  -1.2140627 -0.3060724  0.3947781         NA
6          NA         NA -0.9406678 -0.1887456
7          NA         NA -0.8838331 -1.0122872
8          NA         NA -1.4449848  0.2787877
9          NA         NA -0.1398463 -0.2589786
10         NA         NA  0.2089307 -1.1378909
>

The thing that is odd is that bind_rows() renames the column in the second data frame named "A...1" to "A", which then means the combined data frame doesn't have the correct data in the correct column.

It seems that bind_rows() is checking the column names for the second data frame (but not the first). In bind_cols() there is an option to determine the level of checking of the column names. In bind_rows() it seems as if the following is being done:

> vctrs::vec_as_names(c("A...1","B"),repair="unique")
New names:
* A...1 -> A
[1] "A" "B"
>

Changing the repair option is not possible in bind_rows() as there is no similar option as there is for bind_cols().

So, two questions:

Should I have done the row-wise binding of the data frames in a different way? I don't have control over the original Excel files, so the names might be manipulated on reading.
Is this a bug, or at least a shortcoming, of bind_rows()?

Stephen

FactOREO · September 1, 2022, 5:05am

Hello,

Iam not sure how to handle is problem within bind_rows, but for the sake of a quick solution - you might try data.table::rbindlist(). You can choose to bind by name and fill nonmatching columns with NA. This should work even if your names are those "strange" A...1 and so on.

An other option would be to create distinct and valid names beforehand, to not run into this issue.

Good luck to find a solution with bind_rows, if I have time I will try to find a solution for this as well.

Kind regards

mcneills · September 1, 2022, 9:36am

Thank you, the data.table solution is a good one! I have found work-around in the meantime, but it is not pretty.

Stephen

system · September 22, 2022, 9:37am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.