Reticulate-based py_to_r() fails in dplyr mutate framework

Hi Posit users.

Working with protected health data so apologies in advance for not being able to provide examples.

I have some data that is behaving strangely in the conversion from pandas dataframe to arrow table after being saved as a .parquet file.

Thankfully, it's easily solved by just opening the .parquet in pyarrow and then just flipping it back to a pandas dataframe.

Like so:

library(tidyverse)
library(janitor)
library(arrow)
library(reticulate)
pd <- import("pandas")
pa <- import("pyarrow", convert = FALSE)
pq <- import("pyarrow.parquet")

arrowtable <- r_to_py(pq$read_table(Parquetfilevariable))
pandasframe <- clean_names(py_to_r(arrowtable$to_pandas()))

However, once I take that dataframe into R, it seems to take issue with pandas environment variables.

Of a ~6k x 7 dataframe, two of the columns appear as complex environment variables. Like so:

<environment: 0x556b61edee48>
<environment: 0x556b62459e40>
<environment: 0x556b5c1f18c0>

Now I have noticed that these variables can be converted to something I can use in single cases:

> do.call(py_to_r,pandasframe$environmentvar2[1])
b'AB

> as.character(do.call(py_to_r,pandasframe$place_of_service[1]))
[1] "AB"

And that they have their own special class:

> class(pandasframe$environmentvar2)

[1] "list"

> class(do.call(py_to_r,pandasframe$environmentvar2[1]))

[1] "python.builtin.bytes"  "python.builtin.object"

Which explains why the py_to_r() works, but when I try to make a dplyr call to do column or rowwise operations to apply this to all of the cases, errors, or runs but fails to convert the data.

> Rframe <- pandasframe |> mutate(across(c(environmentvar1,environmentvar2), ~ do.call(py_to_r,.)))
Error in `mutate()`:
ℹ In argument: `across(...)`.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
Caused by error:
! unused arguments (<environment>, <environment>, <environment>, <environment>, <environment>, <environment>, <environment>, <environment>, <environment>)
Run `rlang::last_trace()` to see where the error occurred.
---
Backtrace:
    ▆
 1. ├─dplyr::mutate(...)
 2. ├─dplyr:::mutate.data.frame(...)
 3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
 4. │   ├─base::withCallingHandlers(...)
 5. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 6. │     ├─base::withCallingHandlers(...)
 7. │     └─mask$eval_all_mutate(quo)
 8. │       └─dplyr (local) eval()
 9. └─base::do.call(py_to_r, environmentvar1)

> Rframe <- pandasframe |> mutate(across(c(environmentvar1,environmentvar2), ~ py_to_r(.)))
> Rframe$environmentvar2[1]
[[1]]
b'AB'

> class(Rframe$environmentvar2)
[1] "list"

> class(Rframe$environmentvar2[1])
[1] "list"

And rowwise() on either just outright errors on both but for different reasons.

> pandasframe |> rowwise() |> mutate(across(c(environmentvar1,environmentvar2), ~ do.call(py_to_r,.)))
Error in `mutate()`:
ℹ In argument: `across(...)`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
Caused by error in `do.call()`:
! second argument must be a list
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `mutate()`:
ℹ In argument: `across(...)`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
Caused by error in `do.call()`:
! second argument must be a list
---
Backtrace:
     ▆
  1. ├─dplyr::mutate(...)
  2. ├─dplyr:::mutate.data.frame(...)
  3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  4. │   ├─base::withCallingHandlers(...)
  5. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─mask$eval_all_mutate(quo)
  8. │       └─dplyr (local) eval()
  9. └─base::do.call(py_to_r, environmentvar1)
 10.   └─base::stop("second argument must be a list")

> pandasframe |> rowwise() |> mutate(across(c(environmentvar1,environmentvar2), ~ py_to_r(.)))
Error in `mutate()`:
ℹ In argument: `across(c(environmentvar1, environmentvar2), ~py_to_r(.))`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `original_code`.
Caused by error in `dplyr_internal_error()`:
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `mutate()`:
ℹ In argument: `across(c(environmentvar1, environmentvar2), ~py_to_r(.))`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
---
Backtrace:
     ▆
  1. ├─dplyr::mutate(...)
  2. ├─dplyr:::mutate.data.frame(rowwise(pandasframe), across(c(environmentvar1, environmentvar2), ~py_to_r(.)))
  3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  4. │   ├─base::withCallingHandlers(...)
  5. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─mask$eval_all_mutate(quo)
  8. │       └─dplyr (local) eval()
  9. ├─dplyr:::dplyr_internal_error("dplyr:::mutate_not_vector", `<named list>`)
 10. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
 11. │   └─rlang:::signal_abort(cnd, .file)
 12. │     └─base::signalCondition(cnd)
 13. └─dplyr (local) `<fn>`(`<dpl:::__>`)
Caused by error in `dplyr_internal_error()`:
---
Backtrace:
    ▆
 1. ├─dplyr::mutate(...)
 2. ├─dplyr:::mutate.data.frame(...)
 3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
 4. │   ├─base::withCallingHandlers(...)
 5. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 6. │     ├─base::withCallingHandlers(...)
 7. │     └─mask$eval_all_mutate(quo)
 8. │       └─dplyr (local) eval()
 9. └─dplyr:::dplyr_internal_error("dplyr:::mutate_not_vector", `<named list>`)
Run rlang::last_trace(drop = FALSE) to see 5 hidden frames.

Any suggestions as to how I might be able to do this conversion without having to loop all of them?

Thank you in advance!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.