Python data science stack recommended by posit

x1o · June 6, 2025, 5:27am

Is there a currently recommended Python data science stack? Most importantly, for local dataframe wrangling, does Posit recommend pandas + siuba, polars, or perhaps something else? And what about plotting - is plotnine the preferred option? I'm asking because some Python developers (e.g., Wes) are now working for Posit, which might indicate a natural preference for certain packages over others.

ivelasq3 · June 9, 2025, 3:30pm

Hello! Thank you for reaching out. Let us noodle on this for a bit.

For now, take a look at Emily Riederer's "Python Rgonimics" talk at posit::conf for some recommendations! https://www.youtube.com/watch?v=ILxK92HDtvU

x1o · June 10, 2025, 11:37am

Cool, thanks!

For dataframe wrangling, I think the fundamental point of using dplyr is SQL code translation. I personally used it on several occasions with BigQuery and Athena, and it worked beautifully. Python's siuba is supposed to mimic the same workflow with an almost identical syntax. Strangely enough, the repository seems dead - has posit changed its mind?

Re project managers: the video mentions pdm, but I think a more modern alternative is uv as it is faster, avoids the bootstrapping problem (being written in Rust) and manages the installation of Python too.

I'd love if someone could compile a table that maps R's packages & functions to their Python alternatives, like a really modern one Maybe user @x1o should do it.

x1o · June 11, 2025, 4:51pm

Compiled this table, not sure if it's any useful. I guess the right path would be to learn

The standard library
Numpy
Scipy, in particular stats
Pandas / polars
Statsmodels
Scikit-learn
uv

and then other packages depending on the target domain.

R	Python
base & utils
Vector algebra (`%*%`, `matrix()`, `solve()`, ...)	numpy
Descriptive statistics (`mean()`, `sd()`, ...)	numpy, scipy.stats, statistics
System / OS (`Sys.*`, `sessionInfo()`, ...)	os, sys, subprocess, platform
Debugging (`browser()`, `debug()`, ...)	breakpoint(), pdb.runcall, pdb.set_trace
stat
Probability distributions (`d()`, `p()`, ...)	scipy.stats
RNG (`r*()`, `sample()`, ...)	random, numpy.random, scipy.stats
Linear / Generalised models (`lm()`, `anova()`, ...)	statsmodels, sklearn.linear_model
Statistical tests (`t.test`, `wilcox.test`, ...)	statsmodels.stats
Time series (`arima()`, `acf()`, ...)	statsmodels.tsa
Smoothing / interplation (`smooth()`, `spline()`, `approx()`, ...)	scipy.interpolate, statsmodels.nonparametric
Clustering (`kmeans()`, `hclust()`, ...)	scipy.cluster, sklearn.cluster
Optimisation (`optim()`, `nlm()`, ...)	scipy.optimize
Spectral analysis (`fft()`, `filter()`, ...)	numpy.fft, scipy.signal
Density estimation (`density()`, ...)	scipy.stats.gaussian_kde, statsmodels.nonparametric
Contingency tables (`xtabs()`, ...)	pandas.crosstab, statsmodels.stats.contingency_tables
Power analysis (`power.t.test()`, ...)	statsmodels.stats.power
Factor analysis (`princomp()`, `factanal()`, ...)	sklearn.decomposition, factor_analyzers
tidyverse
cli	rich, click, tqdm
dplyr, tidyr	pandas, polars, siuba, ibis
dbplyr	siuba, sqlalchemy, ibis
forcats	pandas.Categorical, enum
ggplot2	plotnine, seaborn
httr	requests, httpx, urllib
jsonlite	json, ujson, orjson, pandas
lubridate	pandas, pendulum
modelr	scikit-learn
purrr	itertools, functools, map(), zip(), ...
readr	pandas, polars
rvest & xml2	bs4, xml, lxml, requests, scrapy
stringr	str, re
tidymodels
tidymodels	scikit-learn, pycaret
Other
arrow	pyarrow
data.table	dask
furrr	multiprocessing, joblib, ray, dask, concurrent, asyncio
slider	pandas.rolling, numpy.lib.stride_tricks, bottleneck, pandas-ta
testthat	pytest, unittest
devtools	uv, pip, importlib
roxygen	sphinx, mkdocs
plumber	fastapi, flask
targets	dagster, airflow, prefect, kedro, mlflow, wandb, dvc
fs	pathlib
logger	loguru, logging
zoo / xts / tsibble	pandas
forecast	Nixtla, darts
tidyquant	quantlib
TTR, quantmod	pandas-ta
quantstrat	zipline

GeraldineK · June 12, 2025, 10:32pm

In their Academy courses Posit teaches pandas, plotnine, and statsmodels (I'm an Academy mentor - but I am not speaking on behalf of Posit here).
But I think there is widespread agreement among Academy mentors that polars is better (but re-designing a course takes time).

Personally I have a preference for seaborn (for exploratory visualizations).
And I would add on scikit-learn for supervised and unsupervised learning, and PyTorch for AI. Though that's research; I read somewhere that Tensorflow is more popular with developers (?).

If you are also doing mapping and/or spatial analysis, geopandas.

GeraldineK · June 12, 2025, 10:37pm

I would also add on R's data.table maps to Python's dask, for working with large csv (non-parquet) datasets.

x1o · June 13, 2025, 9:38am

Thank you for your input!

Re polars & pandas: I personally like polars' syntax more, and it used to be faster than pandas. However, with recent improvements pandas isn't that much slower (I only have anecdotal evidence to support this...), and the syntax argument doesn't seem decisive:

Just because there's so much pandas code around, it helps being able to understand the pandas API
The strongest selling point of dplyr isn't the elegance of the syntax - which polars to some extent mimics - but the possibility to translate it into SQL and run virtually everywhere. polars IIRC only offers a local number crunching backend.