Performance difference, python vs reticulate

mfoos · July 6, 2019, 1:13am

I'm preparing a talk about R speeds and I'm including a section benchmarking a handful of methods/packages purely on the task of reading in a 1.6GB tab-delimited file. I thought it would be interesting to include using reticulate to read the file with Python's pandas package.

Out of curiosity, also ran it straight from python. I wasn't surprised that there was a difference in how long it took, but I was surprised that reticulate (~50seconds, using microbenchmark, avg of 5 passes) was so much slower than python (~35 seconds, using timeit.timeit, avg of 5 passes). Could anyone explain why the overhead is so high? I am curious myself, but also anticipating questions from the audience. I used python 3.6.8 for both, reticulate 1.12. Thanks!

Reticulate version

use_python("/anaconda3/bin/python")

reticulate_bench <- microbenchmark(reticulate_tab <- pd$read_table(filepath_or_buffer = "Brain_Amygdala.truncated.txt", sep = "\t"),
               times = 5,
               setup = pd <- import("pandas"))

Python REPL version

import pandas

pandas_bench = timeit.timeit('pandas_tab = pandas.read_table("Brain_Amygdala.truncated.txt", "\t")', number=5, setup='import pandas')

system · July 27, 2019, 1:13am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.