failed to Biological Id Translator

technocrat · December 26, 2018, 7:09am

You'll get better help by including a reproducible example, called a reprex

The message select()' returned 1:many mapping between keys and columns is normal. In fact one of examples in the documentation uses syntax identical to yours

ids <- bitr(x, fromType="SYMBOL", toType=c("UNIPROT", "ENSEMBL"), OrgDb="org.Hs.eg.db")

(where both of the toType arguments are returned by keytypes(org.Hs.eg.db) as are yours

The second message arises from the contents of your data2 object. Confirm that you have created data.df with

head(data.df)
> data.df <- bitr(data2, fromType="SYMBOL", toType=c("ENTREZID", "ENSEMBL"), OrgDb="org.Hs.eg.db")
'select()' returned 1:many mapping between keys and columns
 head(data.df)
  SYMBOL ENTREZID         ENSEMBL
1   GPX3     2878 ENSG00000211445
2   GLRX     2745 ENSG00000173221
3    LBP     3929 ENSG00000129988
4  CRYAB     1410 ENSG00000109846
5  DEFB1     1672 ENSG00000164825
6  DEFB1     1672 ENSG00000284881

I used the documentation for data2

data2
[1] "GPX3"    "GLRX"    "LBP"     "CRYAB"   "DEFB1"   "HCLS1"   "SOD2"    "HSPA2"   "ORM1"   
[10] "IGFBP1"  "PTHLH"   "GPC3"    "IGFBP3"  "TOB1"    "MITF"    "NDRG1"   "NR1H4"   "FGFR3"  
[19] "PVR"     "IL6"     "PTPRM"   "ERBB2"   "NID2"    "LAMB1"   "COMP"    "PLS3"    "MCAM"   
[28] "SPP1"    "LAMC1"   "COL4A2"  "COL4A1"  "MYOC"    "ANXA4"   "TFPI2"   "CST6"    "SLPI"   
[37] "TIMP2"   "CPM"     "GGT1"    "NNMT"    "MAL"     "EEF1A2"  "HGD"     "TCN2"    "CDA"    
[46] "PCCA"    "CRYM"    "PDXK"    "STC1"    "WARS"    "HMOX1"   "FXYD2"   "RBP4"    "SLC6A12"
[55] "KDELR3"  "ITM2B"

In my version of data2, none of the input geneIDs fails to find a mapping. The warning shows that 19.54% of yours fail.

I would have to know much more than I ever will to be able to guess whether this is due to the nature of the beast (the inputIDs) or whether some of them may be malformed, mis-transcribed or simply outside of the reference database.

In sum, there doesn't appear to be anything wrong with your code; the trouble springs from your data. In your position, I would take random samples of, say 25% without replacement and give the result to your function in place of data2 and see how often you get the warning and whether the percentages vary. If you consistently find around 20% fail to map rate, you can be confident that the gene IDs are scattered throughout, and the challenge will be to identify them.

Lets say you take six samples, a,b,c,d,e,f`` that produce18.04, 19.25, 18.97, 19.01 and 21.2` in the warnings.

Do setdiff on each pair to find the unions, a',b',c',d',e',f' and run those through the function and note the differences in results. Proceeding that way will help you narrow down the possible offenders, subset them out of data2 and repeat, eventually to allow you to build a list of known problematic geneIDs to ether be censored, corrected or, if this is an expected result for the type of gene set you're working for, to consult the documentation for any functions for parameter tuning on any modeling you're planning.