You'll get better help by including a reproducible example, called a reprex
The message select()' returned 1:many mapping between keys and columns
is normal. In fact one of examples in the documentation uses syntax identical to yours
ids <- bitr(x, fromType="SYMBOL", toType=c("UNIPROT", "ENSEMBL"), OrgDb="org.Hs.eg.db")
(where both of the toType
arguments are returned by keytypes(org.Hs.eg.db)
as are yours
The second message arises from the contents of your data2
object. Confirm that you have created data.df
with
head(data.df)
> data.df <- bitr(data2, fromType="SYMBOL", toType=c("ENTREZID", "ENSEMBL"), OrgDb="org.Hs.eg.db")
'select()' returned 1:many mapping between keys and columns
head(data.df)
SYMBOL ENTREZID ENSEMBL
1 GPX3 2878 ENSG00000211445
2 GLRX 2745 ENSG00000173221
3 LBP 3929 ENSG00000129988
4 CRYAB 1410 ENSG00000109846
5 DEFB1 1672 ENSG00000164825
6 DEFB1 1672 ENSG00000284881
I used the documentation for data2
data2
[1] "GPX3" "GLRX" "LBP" "CRYAB" "DEFB1" "HCLS1" "SOD2" "HSPA2" "ORM1"
[10] "IGFBP1" "PTHLH" "GPC3" "IGFBP3" "TOB1" "MITF" "NDRG1" "NR1H4" "FGFR3"
[19] "PVR" "IL6" "PTPRM" "ERBB2" "NID2" "LAMB1" "COMP" "PLS3" "MCAM"
[28] "SPP1" "LAMC1" "COL4A2" "COL4A1" "MYOC" "ANXA4" "TFPI2" "CST6" "SLPI"
[37] "TIMP2" "CPM" "GGT1" "NNMT" "MAL" "EEF1A2" "HGD" "TCN2" "CDA"
[46] "PCCA" "CRYM" "PDXK" "STC1" "WARS" "HMOX1" "FXYD2" "RBP4" "SLC6A12"
[55] "KDELR3" "ITM2B"
In my version of data2, none of the input geneIDs fails to find a mapping. The warning shows that 19.54% of yours fail.
I would have to know much more than I ever will to be able to guess whether this is due to the nature of the beast (the inputIDs) or whether some of them may be malformed, mis-transcribed or simply outside of the reference database.
In sum, there doesn't appear to be anything wrong with your code; the trouble springs from your data. In your position, I would take random samples of, say 25% without replacement and give the result to your function in place of data2 and see how often you get the warning and whether the percentages vary. If you consistently find around 20% fail to map rate, you can be confident that the gene IDs are scattered throughout, and the challenge will be to identify them.
Lets say you take six samples, a,b,c,d,e,f`` that produce
18.04, 19.25, 18.97, 19.01 and 21.2` in the warnings.
Do setdiff
on each pair to find the unions, a',b',c',d',e',f'
and run those through the function and note the differences in results. Proceeding that way will help you narrow down the possible offenders, subset them out of data2 and repeat, eventually to allow you to build a list of known problematic geneIDs to ether be censored, corrected or, if this is an expected result for the type of gene set you're working for, to consult the documentation for any functions for parameter tuning on any modeling you're planning.