[Arrow] RStudio hangs and R terminal segfaults when attempting to join tables

Hi Arrow Posit Users.

I'm attempting to join some pointers about 3 pointers and and tables, in the same call (I have tried separately as well but the same error occurs so its not that) that have schema interpreting the same variable as different types (ID as utf8() vs large_utf8()).

However, whenever I try to do so, the code either hangs in Rstudio or segfaults in the terminal/debugger.

This is the whole set of arguments:

omnibus_schema <- schema(
  ID_Str = utf8(),
  IndexDt = date32(),
  Class = utf8(),
  Class_code = utf8()
)

Output <- Importing_data_pointer |> 
inner_join(Importing_data_reference_pointer, by = "RefKey") |>
mutate(Class = "Class") |>
select(ID_Str, Strings_of_Interest, LogDt, Class) |>
filter(str_detect(Strings_of_Interest,"(?i)science|fantasy|reference")) |>
right_join((Sample_of_Interest |> as_arrow_table(schema = schema(ID_Str = large_utf8(),Department = utf8(),DDID = utf8(),PubDt = date32(),LogDt = date32(),CheckDt = date32(),DonFlag = uint64(),PurchFlag = uint64(),DamFlag = uint64(),MissFlag = uint64(),LastCheckDt = date32(),LongestDuration = uint64(),AuditStatus = uint64(),Lagged_Covariate_1_VOI = uint64(),Covariate_1_VOI = uint64(),Lagged_Covariate_3_VOI = uint64(),Covariate_3_VOI = uint64(),Lagged_Covariate_5_VOI = uint64(),Covariate_5_VOI = uint64(),Lagged_Covariate_6_VOI = uint64(),Covariate_6_VOI = uint64(),Lagged_Covariate_7_VOI = uint64(),Covariate_7_VOI = uint64(),Lagged_Covariate_9_VOI = uint64(),Covariate_9_VOI = uint64(),Lagged_Covariate_12_VOI = uint64(),Covariate_12_VOI = uint64(),Lagged_Covariate_22_VOI = uint64(),Covariate_22_VOI = uint64(),Lagged_Covariate_1_Nuisance_1 = uint64(),Covariate_1_Nuisance_1 = uint64(),Lagged_Covariate_6_Nuisance_1 = uint64(),Covariate_6_Nuisance_1 = uint64(),Lagged_Covariate_7_Nuisance_1 = uint64(),Covariate_7_Nuisance_1 = uint64(),Lagged_Covariate_22_Nuisance_1 = uint64(),Covariate_22_Nuisance_1 = uint64(),Lagged_Covariate_3_Nuisance_2 = uint64(),Covariate_3_Nuisance_2 = uint64(),Lagged_Covariate_9_Nuisance_2 = uint64(),Covariate_9_Nuisance_2 = uint64(),Lagged_Covariate_22_Nuisance_2 = uint64(),Covariate_22_Nuisance_2 = uint64(),ReturnDt = uint64()))), by = "ID_Str") |>
rename(ADate = LogDt, Class_code = Strings_of_Interest) |>
select(ID_Str, PubDt, Class, Class_code) |>
as_arrow_table(schema = omnibus_schema) |>
mutate(ID_Str = cast(ID_Str,utf8()),Class_code = cast(Class_code,utf8())) |>
as_arrow_table()

But the problem component appears to be this part:

as_arrow_table(schema = omnibus_schema) |>

Which throws the following error in the debugger whether you include the omnibus_schema argument or not and just hangs in RStudio:

Thread 19 "R" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffba25fd640 (LWP 2085506)]
0x00007ffff776a94d in __memmove_evex_unaligned_erms () from /lib64/libc.so.6

This gives me the impression based on other threads that there's some kind of "null pointer" issue to the output, or that the schema isn't being read correctly so it's malformed in some way, but I haven't been able to isolate the issue.

Would anyone happen to know what I might be doing wrong here?

Here's what information I have about the inputs, but unfortunately I don't have any null data I can generate at the moment.

I can try to permute something if necessary.

Importing_data_pointer:
FileSystemDataset with X Parquet files
10 columns
ID_Str: large_string
Source_ID: large_string
LogDt: date32[day]
RefKey: large_string
Length: double
Amount: double
Vendor: large_string
Transfer: large_string
Transfer_ID: large_string
Transfer_Loc: large_string

Importing_data_reference_pointer
FileSystemDataset with X Parquet file
25 columns
RefKey: large_string
Weight: double
Publisher: string
REF_ID_1: string
REF_ID_2: string
REF_ID_3: string
REF_ID_4: string
REF_ID_5: string
REF_ID_6: string
REF_ID_7: string
Class: string
Width: string
Flagged: string
Strings_of_Interest: string
PublishingDt: date32[day]
Ingr: string
MultRefKey: double
DeprecationDt: date32[day]
Version: string
State: string
Zip: string
Network: string
Lender: double
Zone: string
Operating_area: string

Sample_of_Interest (this is a tibble that is converted to an arrow table)
FileSystemDataset with X Parquet file
ID_Str: string
Department: string
DDID: string
PubDt: date32[day]
LogDt: date32[day]
CheckDt: date32[day]
DonFlag: double
PurchFlag: double
DamFlag: double
MissFlag: double
LastCheckDt: date32[day]
LongestDuration: double
AuditStatus: double
Lagged_Covariate_1_VOI: double
Covariate_1_VOI: double
Lagged_Covariate_3_VOI: double
Covariate_3_VOI: double
Lagged_Covariate_5_VOI: double
Covariate_5_VOI: double
Lagged_Covariate_6_VOI: double
Covariate_6_VOI: double
Lagged_Covariate_7_VOI: double
Covariate_7_VOI: double
Lagged_Covariate_9_VOI: double
Covariate_9_VOI: double
Lagged_Covariate_12_VOI: double
Covariate_12_VOI: double
Lagged_Covariate_22_VOI: double
Covariate_22_VOI: double
Lagged_Covariate_1_Nuisance_1: double
Covariate_1_Nuisance_1: double
Lagged_Covariate_6_Nuisance_1: double
Covariate_6_Nuisance_1: double
Lagged_Covariate_7_Nuisance_1: double
Covariate_7_Nuisance_1: double
Lagged_Covariate_22_Nuisance_1: double
Covariate_22_Nuisance_1: double
Lagged_Covariate_3_Nuisance_2: double
Covariate_3_Nuisance_2: double
Lagged_Covariate_9_Nuisance_2: double
Covariate_9_Nuisance_2: double
Lagged_Covariate_22_Nuisance_2: double
Covariate_22_Nuisance_2: double
ReturnDt: double

Thank you in advance!

Quick update.

Changing it to the following seems to work, but I'm not sure why:

right_join((Sample_of_Interest |>  mutate(ID_Str= cast(ID_Str,large_utf8()))), by = "ID_Str") |> as_arrow_table() |>
rename(ADate = LogDt, Class_code = Strings_of_Interest) |>
select(ID_Str, PubDt, Class, Class_code) |>
as_arrow_table(schema = omnibus_schema) |>

This is a pretty heavy Arrow topic; someone might be able to help here, but you're probably more likely to get help for this particular question opening an issue on the Arrow GitHub repo as both the R and C++ Arrow developers are more likely to see this question there.

1 Like

First place I posted @thisisnic

Was just reaching out here as well in case anyone might have known.

1 Like