Reticulate-based py_to_r() fails in dplyr mutate framework

TPDeRamus · July 11, 2024, 8:23pm

Hi Posit users.

Working with protected health data so apologies in advance for not being able to provide examples.

I have some data that is behaving strangely in the conversion from pandas dataframe to arrow table after being saved as a .parquet file.

Thankfully, it's easily solved by just opening the .parquet in pyarrow and then just flipping it back to a pandas dataframe.

Like so:

library(tidyverse)
library(janitor)
library(arrow)
library(reticulate)
pd <- import("pandas")
pa <- import("pyarrow", convert = FALSE)
pq <- import("pyarrow.parquet")

arrowtable <- r_to_py(pq$read_table(Parquetfilevariable))
pandasframe <- clean_names(py_to_r(arrowtable$to_pandas()))

However, once I take that dataframe into R, it seems to take issue with pandas environment variables.

Of a ~6k x 7 dataframe, two of the columns appear as complex environment variables. Like so:

<environment: 0x556b61edee48>
<environment: 0x556b62459e40>
<environment: 0x556b5c1f18c0>

Now I have noticed that these variables can be converted to something I can use in single cases:

> do.call(py_to_r,pandasframe$environmentvar2[1])
b'AB

> as.character(do.call(py_to_r,pandasframe$place_of_service[1]))
[1] "AB"

And that they have their own special class:

> class(pandasframe$environmentvar2)

[1] "list"

> class(do.call(py_to_r,pandasframe$environmentvar2[1]))

[1] "python.builtin.bytes"  "python.builtin.object"

Which explains why the py_to_r() works, but when I try to make a dplyr call to do column or rowwise operations to apply this to all of the cases, errors, or runs but fails to convert the data.

> Rframe <- pandasframe |> mutate(across(c(environmentvar1,environmentvar2), ~ do.call(py_to_r,.)))
Error in `mutate()`:
ℹ In argument: `across(...)`.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
Caused by error:
! unused arguments (<environment>, <environment>, <environment>, <environment>, <environment>, <environment>, <environment>, <environment>, <environment>)
Run `rlang::last_trace()` to see where the error occurred.
---
Backtrace:
    ▆
 1. ├─dplyr::mutate(...)
 2. ├─dplyr:::mutate.data.frame(...)
 3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
 4. │   ├─base::withCallingHandlers(...)
 5. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 6. │     ├─base::withCallingHandlers(...)
 7. │     └─mask$eval_all_mutate(quo)
 8. │       └─dplyr (local) eval()
 9. └─base::do.call(py_to_r, environmentvar1)

> Rframe <- pandasframe |> mutate(across(c(environmentvar1,environmentvar2), ~ py_to_r(.)))
> Rframe$environmentvar2[1]
[[1]]
b'AB'

> class(Rframe$environmentvar2)
[1] "list"

> class(Rframe$environmentvar2[1])
[1] "list"

And rowwise() on either just outright errors on both but for different reasons.

> pandasframe |> rowwise() |> mutate(across(c(environmentvar1,environmentvar2), ~ do.call(py_to_r,.)))
Error in `mutate()`:
ℹ In argument: `across(...)`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
Caused by error in `do.call()`:
! second argument must be a list
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `mutate()`:
ℹ In argument: `across(...)`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
Caused by error in `do.call()`:
! second argument must be a list
---
Backtrace:
     ▆
  1. ├─dplyr::mutate(...)
  2. ├─dplyr:::mutate.data.frame(...)
  3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  4. │   ├─base::withCallingHandlers(...)
  5. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─mask$eval_all_mutate(quo)
  8. │       └─dplyr (local) eval()
  9. └─base::do.call(py_to_r, environmentvar1)
 10.   └─base::stop("second argument must be a list")

> pandasframe |> rowwise() |> mutate(across(c(environmentvar1,environmentvar2), ~ py_to_r(.)))
Error in `mutate()`:
ℹ In argument: `across(c(environmentvar1, environmentvar2), ~py_to_r(.))`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `original_code`.
Caused by error in `dplyr_internal_error()`:
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `mutate()`:
ℹ In argument: `across(c(environmentvar1, environmentvar2), ~py_to_r(.))`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
---
Backtrace:
     ▆
  1. ├─dplyr::mutate(...)
  2. ├─dplyr:::mutate.data.frame(rowwise(pandasframe), across(c(environmentvar1, environmentvar2), ~py_to_r(.)))
  3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  4. │   ├─base::withCallingHandlers(...)
  5. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─mask$eval_all_mutate(quo)
  8. │       └─dplyr (local) eval()
  9. ├─dplyr:::dplyr_internal_error("dplyr:::mutate_not_vector", `<named list>`)
 10. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
 11. │   └─rlang:::signal_abort(cnd, .file)
 12. │     └─base::signalCondition(cnd)
 13. └─dplyr (local) `<fn>`(`<dpl:::__>`)
Caused by error in `dplyr_internal_error()`:
---
Backtrace:
    ▆
 1. ├─dplyr::mutate(...)
 2. ├─dplyr:::mutate.data.frame(...)
 3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
 4. │   ├─base::withCallingHandlers(...)
 5. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 6. │     ├─base::withCallingHandlers(...)
 7. │     └─mask$eval_all_mutate(quo)
 8. │       └─dplyr (local) eval()
 9. └─dplyr:::dplyr_internal_error("dplyr:::mutate_not_vector", `<named list>`)
Run rlang::last_trace(drop = FALSE) to see 5 hidden frames.

Any suggestions as to how I might be able to do this conversion without having to loop all of them?

Thank you in advance!

system · October 9, 2024, 8:24pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.