Python data science stack recommended by posit

Is there a currently recommended Python data science stack? Most importantly, for local dataframe wrangling, does Posit recommend pandas + siuba, polars, or perhaps something else? And what about plotting - is plotnine the preferred option? I'm asking because some Python developers (e.g., Wes) are now working for Posit, which might indicate a natural preference for certain packages over others.

Hello! Thank you for reaching out. Let us noodle on this for a bit. :thinking:

For now, take a look at Emily Riederer's "Python Rgonimics" talk at posit::conf for some recommendations! https://www.youtube.com/watch?v=ILxK92HDtvU

Cool, thanks!

For dataframe wrangling, I think the fundamental point of using dplyr is SQL code translation. I personally used it on several occasions with BigQuery and Athena, and it worked beautifully. Python's siuba is supposed to mimic the same workflow with an almost identical syntax. Strangely enough, the repository seems dead - has posit changed its mind?

Re project managers: the video mentions pdm, but I think a more modern alternative is uv as it is faster, avoids the bootstrapping problem (being written in Rust) and manages the installation of Python too.

I'd love if someone could compile a table that maps R's packages & functions to their Python alternatives, like a really modern one :slight_smile: Maybe user @x1o should do it.

Compiled this table, not sure if it's any useful. I guess the right path would be to learn

  1. The standard library
  2. Numpy
  3. Scipy, in particular stats
  4. Pandas / polars
  5. Statsmodels
  6. Scikit-learn
  7. uv

and then other packages depending on the target domain.

R Python
base & utils
Vector algebra (%*%, matrix(), solve(), ...) numpy
Descriptive statistics (mean(), sd(), ...) numpy, scipy.stats, statistics
System / OS (Sys.*, sessionInfo(), ...) os, sys, subprocess, platform
Debugging (browser(), debug(), ...) breakpoint(), pdb.runcall, pdb.set_trace
stat
Probability distributions (d*(), p*(), ...) scipy.stats
RNG (r*(), sample(), ...) random, numpy.random, scipy.stats
Linear / Generalised models (lm(), anova(), ...) statsmodels, sklearn.linear_model
Statistical tests (t.test, wilcox.test, ...) statsmodels.stats
Time series (arima(), acf(), ...) statsmodels.tsa
Smoothing / interplation (smooth(), spline(), approx(), ...) scipy.interpolate, statsmodels.nonparametric
Clustering (kmeans(), hclust(), ...) scipy.cluster, sklearn.cluster
Optimisation (optim(), nlm(), ...) scipy.optimize
Spectral analysis (fft(), filter(), ...) numpy.fft, scipy.signal
Density estimation (density(), ...) scipy.stats.gaussian_kde, statsmodels.nonparametric
Contingency tables (xtabs(), ...) pandas.crosstab, statsmodels.stats.contingency_tables
Power analysis (power.t.test(), ...) statsmodels.stats.power
Factor analysis (princomp(), factanal(), ...) sklearn.decomposition, factor_analyzers
tidyverse
cli rich, click, tqdm
dplyr, tidyr pandas, polars, siuba, ibis
dbplyr siuba, sqlalchemy, ibis
forcats pandas.Categorical, enum
ggplot2 plotnine, seaborn
httr requests, httpx, urllib
jsonlite json, ujson, orjson, pandas
lubridate pandas, pendulum
modelr scikit-learn
purrr itertools, functools, map(), zip(), ...
readr pandas, polars
rvest & xml2 bs4, xml, lxml, requests, scrapy
stringr str, re
tidymodels
tidymodels scikit-learn, pycaret
Other
arrow pyarrow
data.table dask
furrr multiprocessing, joblib, ray, dask, concurrent, asyncio
slider pandas.rolling, numpy.lib.stride_tricks, bottleneck, pandas-ta
testthat pytest, unittest
devtools uv, pip, importlib
roxygen sphinx, mkdocs
plumber fastapi, flask
targets dagster, airflow, prefect, kedro, mlflow, wandb, dvc
fs pathlib
logger loguru, logging
zoo / xts / tsibble pandas
forecast Nixtla, darts
tidyquant quantlib
TTR, quantmod pandas-ta
quantstrat zipline

In their Academy courses Posit teaches pandas, plotnine, and statsmodels (I'm an Academy mentor - but I am not speaking on behalf of Posit here).
But I think there is widespread agreement among Academy mentors that polars is better (but re-designing a course takes time).

Personally I have a preference for seaborn (for exploratory visualizations).
And I would add on scikit-learn for supervised and unsupervised learning, and PyTorch for AI. Though that's research; I read somewhere that Tensorflow is more popular with developers (?).

If you are also doing mapping and/or spatial analysis, geopandas.

I would also add on R's data.table maps to Python's dask, for working with large csv (non-parquet) datasets.

1 Like

Thank you for your input!

Re polars & pandas: I personally like polars' syntax more, and it used to be faster than pandas. However, with recent improvements pandas isn't that much slower (I only have anecdotal evidence to support this...), and the syntax argument doesn't seem decisive:

  • Just because there's so much pandas code around, it helps being able to understand the pandas API
  • The strongest selling point of dplyr isn't the elegance of the syntax - which polars to some extent mimics - but the possibility to translate it into SQL and run virtually everywhere. polars IIRC only offers a local number crunching backend.
1 Like