how to bind two tables with unequal rows

xyz0o · August 20, 2024, 4:28pm

I want to bind two tables based on column 1 and 2. however there are rows that are missing. I wan all bins to be included in the merged table.
I tried:

combined <- full_join(TE_eu, gene_eu, by = c("chromosome", "start_bin"))

and noticed that it only includes 'start_bin' column from the first data frame, and the bins that exist in table 2 are missing e.g. 30000, 40000

I tried using bind_rows, but it doesn't quite fit with my purpose as it basically just adds the two table in one file and does not merge the rows.

Column 1	Column 2	Column 3
chromosome	start_bin	TE_density
2L_chromosome	0	0.9
2L_chromosome	5000	0.8
2L_chromosome	10000	0.5
2L_chromosome	15000	0.2
2L_chromosome	20000	0.1
2L_chromosome	25000	0.08
2L_chromosome	35000	0.02

Column 1	Column 2	Column 3
chromosome	start_bin	gene_density
2L_chromosome	10000	0.09
2L_chromosome	15000	1.00
2L_chromosome	20000	0.1
2L_chromosome	25000	0.02
2L_chromosome	30000	0.8
2L_chromosome	40000	0.9

my desired result:

Column 1	Column 2	Column 3	Column 4
chromosome	start_bin	TE_density	gene_density
2L_chromosome	0	0.9	0.00
2L_chromosome	5000	0.8	0.00
2L_chromosome	10000	0.5	0.09
2L_chromosome	15000	0.2	1.00
2L_chromosome	20000	0.1	0.1
2L_chromosome	25000	0.08	0.02
2L_chromosome	30000	0.00	0.8
2L_chromosome	35000	0.02	0.00
2L_chromosome	40000	0.00	0.9

GeraldineK · August 20, 2024, 5:47pm

Can you post the output that you get for combined with the current line of code?
Because for me this line works perfectly on these 2 tables (assuming they are regular dataframes?).
However, instead of zeros, you will get NAs where there are no data for that row/column, so you will have to replace them with zeros. Be careful with that though, as there is a different meaning behind those (NA = not measured/do not have that info, 0 = it was measured and the actual result was 0).

xyz0o · August 21, 2024, 7:22am

sure, this is the output that I get

bin 35000 and 40000 are missing from it, while they exist in gene_eu file (sorry I am banned from posting a second image showing my gene_eu data file)

AlexisW · August 23, 2024, 5:13pm

Aren't you doing something different from your original question? When I just copy-paste your example data, it seems to work:

library(tidyverse)

tab1 <- "chromosome     start_bin   TE_density
2L_chromosome   0   0.9
2L_chromosome   5000    0.8
2L_chromosome   10000   0.5
2L_chromosome   15000   0.2
2L_chromosome   20000   0.1
2L_chromosome   25000   0.08
2L_chromosome   35000   0.02" |>
  read_tsv(show_col_types = FALSE)


tab2 <- "chromosome     start_bin   gene_density
2L_chromosome   10000   0.09
2L_chromosome   15000   1.00
2L_chromosome   20000   0.1
2L_chromosome   25000   0.02
2L_chromosome   30000   0.8
2L_chromosome   40000   0.9" |>
  read_tsv(show_col_types = FALSE)

tab_res <- "chromosome  start_bin   TE_density  gene_density
2L_chromosome   0   0.9     0.00
2L_chromosome   5000    0.8     0.00
2L_chromosome   10000   0.5     0.09
2L_chromosome   15000   0.2     1.00
2L_chromosome   20000   0.1     0.1
2L_chromosome   25000   0.08    0.02
2L_chromosome   30000   0.00    0.8
2L_chromosome   35000   0.02    0.00
2L_chromosome   40000   0.00    0.9" |>
  read_tsv(show_col_types = FALSE)


joint <- full_join(tab1, tab2) |>
  mutate(across(ends_with("density"), \(x) replace_na(x, 0))) |>
  arrange(chromosome, start_bin)
#> Joining with `by = join_by(chromosome, start_bin)`


all.equal(joint,
          tab_res |> as_tibble())
#> [1] TRUE

^{Created on 2024-08-23 with reprex v2.1.0}

xyz0o · August 26, 2024, 8:13am

I am trying to use your code, in which part of the library is replace_na?

Joining with by = join_by(chromosome, start_bin)
Error in mutate():
In argument: across(ends_with("density"), function(x) replace_na(x, 0)).
Caused by error in across():
! Can't compute column TE_density.
Caused by error in replace_na():
! could not find function "replace_na"
Run rlang::last_trace() to see where the error occurred.

AlexisW · August 27, 2024, 9:57pm

It's tidyr::replace_na(), part of {tidyverse}. But you can also do redefine it yourself if needed:

my_replace_na <- function(x, replace){
  x[is.na(x)] <- replace
  x
}

system · November 25, 2024, 9:58pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.