Hi Posit users.
Working with protected health data so apologies in advance for not being able to provide examples.
I have some data that is behaving strangely in the conversion from pandas
dataframe to arrow
table after being saved as a .parquet
file.
Thankfully, it's easily solved by just opening the .parquet
in pyarrow
and then just flipping it back to a pandas
dataframe.
Like so:
library(tidyverse)
library(janitor)
library(arrow)
library(reticulate)
pd <- import("pandas")
pa <- import("pyarrow", convert = FALSE)
pq <- import("pyarrow.parquet")
arrowtable <- r_to_py(pq$read_table(Parquetfilevariable))
pandasframe <- clean_names(py_to_r(arrowtable$to_pandas()))
However, once I take that dataframe into R
, it seems to take issue with pandas
environment variables.
Of a ~6k x 7 dataframe, two of the columns appear as complex environment
variables. Like so:
<environment: 0x556b61edee48>
<environment: 0x556b62459e40>
<environment: 0x556b5c1f18c0>
Now I have noticed that these variables can be converted to something I can use in single cases:
> do.call(py_to_r,pandasframe$environmentvar2[1])
b'AB
> as.character(do.call(py_to_r,pandasframe$place_of_service[1]))
[1] "AB"
And that they have their own special class:
> class(pandasframe$environmentvar2)
[1] "list"
> class(do.call(py_to_r,pandasframe$environmentvar2[1]))
[1] "python.builtin.bytes" "python.builtin.object"
Which explains why the py_to_r()
works, but when I try to make a dplyr
call to do column or rowwise operations to apply this to all of the cases, errors, or runs but fails to convert the data.
> Rframe <- pandasframe |> mutate(across(c(environmentvar1,environmentvar2), ~ do.call(py_to_r,.)))
Error in `mutate()`:
ℹ In argument: `across(...)`.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
Caused by error:
! unused arguments (<environment>, <environment>, <environment>, <environment>, <environment>, <environment>, <environment>, <environment>, <environment>)
Run `rlang::last_trace()` to see where the error occurred.
---
Backtrace:
▆
1. ├─dplyr::mutate(...)
2. ├─dplyr:::mutate.data.frame(...)
3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
4. │ ├─base::withCallingHandlers(...)
5. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
6. │ ├─base::withCallingHandlers(...)
7. │ └─mask$eval_all_mutate(quo)
8. │ └─dplyr (local) eval()
9. └─base::do.call(py_to_r, environmentvar1)
> Rframe <- pandasframe |> mutate(across(c(environmentvar1,environmentvar2), ~ py_to_r(.)))
> Rframe$environmentvar2[1]
[[1]]
b'AB'
> class(Rframe$environmentvar2)
[1] "list"
> class(Rframe$environmentvar2[1])
[1] "list"
And rowwise()
on either just outright errors on both but for different reasons.
> pandasframe |> rowwise() |> mutate(across(c(environmentvar1,environmentvar2), ~ do.call(py_to_r,.)))
Error in `mutate()`:
ℹ In argument: `across(...)`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
Caused by error in `do.call()`:
! second argument must be a list
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `mutate()`:
ℹ In argument: `across(...)`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
Caused by error in `do.call()`:
! second argument must be a list
---
Backtrace:
▆
1. ├─dplyr::mutate(...)
2. ├─dplyr:::mutate.data.frame(...)
3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
4. │ ├─base::withCallingHandlers(...)
5. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
6. │ ├─base::withCallingHandlers(...)
7. │ └─mask$eval_all_mutate(quo)
8. │ └─dplyr (local) eval()
9. └─base::do.call(py_to_r, environmentvar1)
10. └─base::stop("second argument must be a list")
> pandasframe |> rowwise() |> mutate(across(c(environmentvar1,environmentvar2), ~ py_to_r(.)))
Error in `mutate()`:
ℹ In argument: `across(c(environmentvar1, environmentvar2), ~py_to_r(.))`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `original_code`.
Caused by error in `dplyr_internal_error()`:
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `mutate()`:
ℹ In argument: `across(c(environmentvar1, environmentvar2), ~py_to_r(.))`.
ℹ In row 1.
Caused by error in `across()`:
! Can't compute column `environmentvar1`.
---
Backtrace:
▆
1. ├─dplyr::mutate(...)
2. ├─dplyr:::mutate.data.frame(rowwise(pandasframe), across(c(environmentvar1, environmentvar2), ~py_to_r(.)))
3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
4. │ ├─base::withCallingHandlers(...)
5. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
6. │ ├─base::withCallingHandlers(...)
7. │ └─mask$eval_all_mutate(quo)
8. │ └─dplyr (local) eval()
9. ├─dplyr:::dplyr_internal_error("dplyr:::mutate_not_vector", `<named list>`)
10. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
11. │ └─rlang:::signal_abort(cnd, .file)
12. │ └─base::signalCondition(cnd)
13. └─dplyr (local) `<fn>`(`<dpl:::__>`)
Caused by error in `dplyr_internal_error()`:
---
Backtrace:
▆
1. ├─dplyr::mutate(...)
2. ├─dplyr:::mutate.data.frame(...)
3. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
4. │ ├─base::withCallingHandlers(...)
5. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
6. │ ├─base::withCallingHandlers(...)
7. │ └─mask$eval_all_mutate(quo)
8. │ └─dplyr (local) eval()
9. └─dplyr:::dplyr_internal_error("dplyr:::mutate_not_vector", `<named list>`)
Run rlang::last_trace(drop = FALSE) to see 5 hidden frames.
Any suggestions as to how I might be able to do this conversion without having to loop all of them?
Thank you in advance!