how to bind two tables with unequal rows

I want to bind two tables based on column 1 and 2. however there are rows that are missing. I wan all bins to be included in the merged table.
I tried:

combined <- full_join(TE_eu, gene_eu, by = c("chromosome", "start_bin"))

and noticed that it only includes 'start_bin' column from the first data frame, and the bins that exist in table 2 are missing e.g. 30000, 40000

I tried using bind_rows, but it doesn't quite fit with my purpose as it basically just adds the two table in one file and does not merge the rows.

Column 1 Column 2 Column 3
chromosome start_bin TE_density
2L_chromosome 0 0.9
2L_chromosome 5000 0.8
2L_chromosome 10000 0.5
2L_chromosome 15000 0.2
2L_chromosome 20000 0.1
2L_chromosome 25000 0.08
2L_chromosome 35000 0.02
Column 1 Column 2 Column 3
chromosome start_bin gene_density
2L_chromosome 10000 0.09
2L_chromosome 15000 1.00
2L_chromosome 20000 0.1
2L_chromosome 25000 0.02
2L_chromosome 30000 0.8
2L_chromosome 40000 0.9

my desired result:

Column 1 Column 2 Column 3 Column 4
chromosome start_bin TE_density gene_density
2L_chromosome 0 0.9 0.00
2L_chromosome 5000 0.8 0.00
2L_chromosome 10000 0.5 0.09
2L_chromosome 15000 0.2 1.00
2L_chromosome 20000 0.1 0.1
2L_chromosome 25000 0.08 0.02
2L_chromosome 30000 0.00 0.8
2L_chromosome 35000 0.02 0.00
2L_chromosome 40000 0.00 0.9

Can you post the output that you get for combined with the current line of code?
Because for me this line works perfectly on these 2 tables (assuming they are regular dataframes?).
However, instead of zeros, you will get NAs where there are no data for that row/column, so you will have to replace them with zeros. Be careful with that though, as there is a different meaning behind those (NA = not measured/do not have that info, 0 = it was measured and the actual result was 0).

1 Like

sure, this is the output that I get
image

bin 35000 and 40000 are missing from it, while they exist in gene_eu file (sorry I am banned from posting a second image showing my gene_eu data file)

Aren't you doing something different from your original question? When I just copy-paste your example data, it seems to work:

library(tidyverse)

tab1 <- "chromosome     start_bin   TE_density
2L_chromosome   0   0.9
2L_chromosome   5000    0.8
2L_chromosome   10000   0.5
2L_chromosome   15000   0.2
2L_chromosome   20000   0.1
2L_chromosome   25000   0.08
2L_chromosome   35000   0.02" |>
  read_tsv(show_col_types = FALSE)


tab2 <- "chromosome     start_bin   gene_density
2L_chromosome   10000   0.09
2L_chromosome   15000   1.00
2L_chromosome   20000   0.1
2L_chromosome   25000   0.02
2L_chromosome   30000   0.8
2L_chromosome   40000   0.9" |>
  read_tsv(show_col_types = FALSE)

tab_res <- "chromosome  start_bin   TE_density  gene_density
2L_chromosome   0   0.9     0.00
2L_chromosome   5000    0.8     0.00
2L_chromosome   10000   0.5     0.09
2L_chromosome   15000   0.2     1.00
2L_chromosome   20000   0.1     0.1
2L_chromosome   25000   0.08    0.02
2L_chromosome   30000   0.00    0.8
2L_chromosome   35000   0.02    0.00
2L_chromosome   40000   0.00    0.9" |>
  read_tsv(show_col_types = FALSE)


joint <- full_join(tab1, tab2) |>
  mutate(across(ends_with("density"), \(x) replace_na(x, 0))) |>
  arrange(chromosome, start_bin)
#> Joining with `by = join_by(chromosome, start_bin)`


all.equal(joint,
          tab_res |> as_tibble())
#> [1] TRUE

Created on 2024-08-23 with reprex v2.1.0

I am trying to use your code, in which part of the library is replace_na?

Joining with by = join_by(chromosome, start_bin)
Error in mutate():
:information_source: In argument: across(ends_with("density"), function(x) replace_na(x, 0)).
Caused by error in across():
! Can't compute column TE_density.
Caused by error in replace_na():
! could not find function "replace_na"
Run rlang::last_trace() to see where the error occurred.

It's tidyr::replace_na(), part of {tidyverse}. But you can also do redefine it yourself if needed:

my_replace_na <- function(x, replace){
  x[is.na(x)] <- replace
  x
}