Is there a currently recommended Python data science stack? Most importantly, for local dataframe wrangling, does Posit recommend pandas + siuba, polars, or perhaps something else? And what about plotting - is plotnine the preferred option? I'm asking because some Python developers (e.g., Wes) are now working for Posit, which might indicate a natural preference for certain packages over others.
Hello! Thank you for reaching out. Let us noodle on this for a bit.
For now, take a look at Emily Riederer's "Python Rgonimics" talk at posit::conf for some recommendations! https://www.youtube.com/watch?v=ILxK92HDtvU
Cool, thanks!
For dataframe wrangling, I think the fundamental point of using dplyr is SQL code translation. I personally used it on several occasions with BigQuery and Athena, and it worked beautifully. Python's siuba is supposed to mimic the same workflow with an almost identical syntax. Strangely enough, the repository seems dead - has posit changed its mind?
Re project managers: the video mentions pdm, but I think a more modern alternative is uv as it is faster, avoids the bootstrapping problem (being written in Rust) and manages the installation of Python too.
I'd love if someone could compile a table that maps R's packages & functions to their Python alternatives, like a really modern one Maybe user @x1o should do it.
Compiled this table, not sure if it's any useful. I guess the right path would be to learn
- The standard library
- Numpy
- Scipy, in particular stats
- Pandas / polars
- Statsmodels
- Scikit-learn
- uv
and then other packages depending on the target domain.
R | Python |
---|---|
base & utils | |
Vector algebra (%*% , matrix() , solve() , ...) |
numpy |
Descriptive statistics (mean() , sd() , ...) |
numpy, scipy.stats, statistics |
System / OS (Sys.* , sessionInfo() , ...) |
os, sys, subprocess, platform |
Debugging (browser() , debug() , ...) |
breakpoint(), pdb.runcall, pdb.set_trace |
stat | |
Probability distributions (d*() , p*() , ...) |
scipy.stats |
RNG (r*() , sample() , ...) |
random, numpy.random, scipy.stats |
Linear / Generalised models (lm() , anova() , ...) |
statsmodels, sklearn.linear_model |
Statistical tests (t.test , wilcox.test , ...) |
statsmodels.stats |
Time series (arima() , acf() , ...) |
statsmodels.tsa |
Smoothing / interplation (smooth() , spline() , approx() , ...) |
scipy.interpolate, statsmodels.nonparametric |
Clustering (kmeans() , hclust() , ...) |
scipy.cluster, sklearn.cluster |
Optimisation (optim() , nlm() , ...) |
scipy.optimize |
Spectral analysis (fft() , filter() , ...) |
numpy.fft, scipy.signal |
Density estimation (density() , ...) |
scipy.stats.gaussian_kde, statsmodels.nonparametric |
Contingency tables (xtabs() , ...) |
pandas.crosstab, statsmodels.stats.contingency_tables |
Power analysis (power.t.test() , ...) |
statsmodels.stats.power |
Factor analysis (princomp() , factanal() , ...) |
sklearn.decomposition, factor_analyzers |
tidyverse | |
cli | rich, click, tqdm |
dplyr, tidyr | pandas, polars, siuba, ibis |
dbplyr | siuba, sqlalchemy, ibis |
forcats | pandas.Categorical, enum |
ggplot2 | plotnine, seaborn |
httr | requests, httpx, urllib |
jsonlite | json, ujson, orjson, pandas |
lubridate | pandas, pendulum |
modelr | scikit-learn |
purrr | itertools, functools, map(), zip(), ... |
readr | pandas, polars |
rvest & xml2 | bs4, xml, lxml, requests, scrapy |
stringr | str, re |
tidymodels | |
tidymodels | scikit-learn, pycaret |
Other | |
arrow | pyarrow |
data.table | dask |
furrr | multiprocessing, joblib, ray, dask, concurrent, asyncio |
slider | pandas.rolling, numpy.lib.stride_tricks, bottleneck, pandas-ta |
testthat | pytest, unittest |
devtools | uv, pip, importlib |
roxygen | sphinx, mkdocs |
plumber | fastapi, flask |
targets | dagster, airflow, prefect, kedro, mlflow, wandb, dvc |
fs | pathlib |
logger | loguru, logging |
zoo / xts / tsibble | pandas |
forecast | Nixtla, darts |
tidyquant | quantlib |
TTR, quantmod | pandas-ta |
quantstrat | zipline |
In their Academy courses Posit teaches pandas, plotnine, and statsmodels (I'm an Academy mentor - but I am not speaking on behalf of Posit here).
But I think there is widespread agreement among Academy mentors that polars is better (but re-designing a course takes time).
Personally I have a preference for seaborn (for exploratory visualizations).
And I would add on scikit-learn for supervised and unsupervised learning, and PyTorch for AI. Though that's research; I read somewhere that Tensorflow is more popular with developers (?).
If you are also doing mapping and/or spatial analysis, geopandas.
I would also add on R's data.table
maps to Python's dask
, for working with large csv (non-parquet) datasets.
Thank you for your input!
Re polars & pandas: I personally like polars' syntax more, and it used to be faster than pandas. However, with recent improvements pandas isn't that much slower (I only have anecdotal evidence to support this...), and the syntax argument doesn't seem decisive:
- Just because there's so much pandas code around, it helps being able to understand the pandas API
- The strongest selling point of dplyr isn't the elegance of the syntax - which polars to some extent mimics - but the possibility to translate it into SQL and run virtually everywhere. polars IIRC only offers a local number crunching backend.