I'm trying to follow this answer that join two tables with range: dplyr - Join tables based on multiple ranges in R - Stack Overflow
I want to join two tables where one has some ranges and another are numbers and I want to annotate the numbers by matching which range those numbers are in from another dataframe.
However the length of the table surprisingly increased. I expect it to stay the same as 2832. What's happening? How to trouble shoot this? The result doesn't change if I use fuzzy_left_join
> head(gene_list_selected)
chr start_pos end_pos gene_name
1 1 55013806 55100417 ACOT11
2 1 55074849 55089200 FAM151A
3 1 55107412 55175940 MROH7
4 1 55107412 55208328 MROH7-TTC4
5 1 55181494 55208328 TTC4
6 1 55222570 55230226 PARS2
> head(df)
rsid pos
1 1:55013860:C:T 55013860
2 1:55013957:G:A 55013957
3 1:55014013:C:T 55014013
4 1:55014095:C:T 55014095
5 1:55014099:C:T 55014099
6 1:55014100:G:A 55014100
> nrow(gene_list_selected)
[1] 21
> nrow(df)
[1] 2832
> df_with_gene_name<-df %>%
+ fuzzy_inner_join(gene_list_selected, by = c("pos"="start_pos","pos"="end_pos"), match_fun = list(`>=`, `<=`))
>
> nrow(df_with_gene_name)
[1] 3298
I want the end result to keep rsid, pos, from, start while having additional columns of one-hot coded binary table of gene_name.
> result
rsid, chr, pos, from, start, ACOT11, FAM151A, MROH7, MROH7-TTC4, TTC4, PARS2
1:55013860:C:T 1 55013860 55013806 55100417 1 0 0 0 0 0