[Arrow] RStudio hangs and R terminal segfaults when attempting to join tables

TPDeRamus · August 8, 2024, 3:52pm

Hi Arrow Posit Users.

I'm attempting to join some pointers about 3 pointers and and tables, in the same call (I have tried separately as well but the same error occurs so its not that) that have schema interpreting the same variable as different types (ID as utf8() vs large_utf8()).

However, whenever I try to do so, the code either hangs in Rstudio or segfaults in the terminal/debugger.

This is the whole set of arguments:

omnibus_schema <- schema(
  ID_Str = utf8(),
  IndexDt = date32(),
  Class = utf8(),
  Class_code = utf8()
)

Output <- Importing_data_pointer |> 
inner_join(Importing_data_reference_pointer, by = "RefKey") |>
mutate(Class = "Class") |>
select(ID_Str, Strings_of_Interest, LogDt, Class) |>
filter(str_detect(Strings_of_Interest,"(?i)science|fantasy|reference")) |>
right_join((Sample_of_Interest |> as_arrow_table(schema = schema(ID_Str = large_utf8(),Department = utf8(),DDID = utf8(),PubDt = date32(),LogDt = date32(),CheckDt = date32(),DonFlag = uint64(),PurchFlag = uint64(),DamFlag = uint64(),MissFlag = uint64(),LastCheckDt = date32(),LongestDuration = uint64(),AuditStatus = uint64(),Lagged_Covariate_1_VOI = uint64(),Covariate_1_VOI = uint64(),Lagged_Covariate_3_VOI = uint64(),Covariate_3_VOI = uint64(),Lagged_Covariate_5_VOI = uint64(),Covariate_5_VOI = uint64(),Lagged_Covariate_6_VOI = uint64(),Covariate_6_VOI = uint64(),Lagged_Covariate_7_VOI = uint64(),Covariate_7_VOI = uint64(),Lagged_Covariate_9_VOI = uint64(),Covariate_9_VOI = uint64(),Lagged_Covariate_12_VOI = uint64(),Covariate_12_VOI = uint64(),Lagged_Covariate_22_VOI = uint64(),Covariate_22_VOI = uint64(),Lagged_Covariate_1_Nuisance_1 = uint64(),Covariate_1_Nuisance_1 = uint64(),Lagged_Covariate_6_Nuisance_1 = uint64(),Covariate_6_Nuisance_1 = uint64(),Lagged_Covariate_7_Nuisance_1 = uint64(),Covariate_7_Nuisance_1 = uint64(),Lagged_Covariate_22_Nuisance_1 = uint64(),Covariate_22_Nuisance_1 = uint64(),Lagged_Covariate_3_Nuisance_2 = uint64(),Covariate_3_Nuisance_2 = uint64(),Lagged_Covariate_9_Nuisance_2 = uint64(),Covariate_9_Nuisance_2 = uint64(),Lagged_Covariate_22_Nuisance_2 = uint64(),Covariate_22_Nuisance_2 = uint64(),ReturnDt = uint64()))), by = "ID_Str") |>
rename(ADate = LogDt, Class_code = Strings_of_Interest) |>
select(ID_Str, PubDt, Class, Class_code) |>
as_arrow_table(schema = omnibus_schema) |>
mutate(ID_Str = cast(ID_Str,utf8()),Class_code = cast(Class_code,utf8())) |>
as_arrow_table()

But the problem component appears to be this part:

as_arrow_table(schema = omnibus_schema) |>

Which throws the following error in the debugger whether you include the omnibus_schema argument or not and just hangs in RStudio:

Thread 19 "R" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffba25fd640 (LWP 2085506)]
0x00007ffff776a94d in __memmove_evex_unaligned_erms () from /lib64/libc.so.6

This gives me the impression based on other threads that there's some kind of "null pointer" issue to the output, or that the schema isn't being read correctly so it's malformed in some way, but I haven't been able to isolate the issue.

Would anyone happen to know what I might be doing wrong here?

Here's what information I have about the inputs, but unfortunately I don't have any null data I can generate at the moment.

I can try to permute something if necessary.

Importing_data_pointer:
FileSystemDataset with X Parquet files
10 columns
ID_Str: large_string
Source_ID: large_string
LogDt: date32[day]
RefKey: large_string
Length: double
Amount: double
Vendor: large_string
Transfer: large_string
Transfer_ID: large_string
Transfer_Loc: large_string

Importing_data_reference_pointer
FileSystemDataset with X Parquet file
25 columns
RefKey: large_string
Weight: double
Publisher: string
REF_ID_1: string
REF_ID_2: string
REF_ID_3: string
REF_ID_4: string
REF_ID_5: string
REF_ID_6: string
REF_ID_7: string
Class: string
Width: string
Flagged: string
Strings_of_Interest: string
PublishingDt: date32[day]
Ingr: string
MultRefKey: double
DeprecationDt: date32[day]
Version: string
State: string
Zip: string
Network: string
Lender: double
Zone: string
Operating_area: string

Sample_of_Interest (this is a tibble that is converted to an arrow table)
FileSystemDataset with X Parquet file
ID_Str: string
Department: string
DDID: string
PubDt: date32[day]
LogDt: date32[day]
CheckDt: date32[day]
DonFlag: double
PurchFlag: double
DamFlag: double
MissFlag: double
LastCheckDt: date32[day]
LongestDuration: double
AuditStatus: double
Lagged_Covariate_1_VOI: double
Covariate_1_VOI: double
Lagged_Covariate_3_VOI: double
Covariate_3_VOI: double
Lagged_Covariate_5_VOI: double
Covariate_5_VOI: double
Lagged_Covariate_6_VOI: double
Covariate_6_VOI: double
Lagged_Covariate_7_VOI: double
Covariate_7_VOI: double
Lagged_Covariate_9_VOI: double
Covariate_9_VOI: double
Lagged_Covariate_12_VOI: double
Covariate_12_VOI: double
Lagged_Covariate_22_VOI: double
Covariate_22_VOI: double
Lagged_Covariate_1_Nuisance_1: double
Covariate_1_Nuisance_1: double
Lagged_Covariate_6_Nuisance_1: double
Covariate_6_Nuisance_1: double
Lagged_Covariate_7_Nuisance_1: double
Covariate_7_Nuisance_1: double
Lagged_Covariate_22_Nuisance_1: double
Covariate_22_Nuisance_1: double
Lagged_Covariate_3_Nuisance_2: double
Covariate_3_Nuisance_2: double
Lagged_Covariate_9_Nuisance_2: double
Covariate_9_Nuisance_2: double
Lagged_Covariate_22_Nuisance_2: double
Covariate_22_Nuisance_2: double
ReturnDt: double

Thank you in advance!

TPDeRamus · August 8, 2024, 5:02pm

Quick update.

Changing it to the following seems to work, but I'm not sure why:

right_join((Sample_of_Interest |>  mutate(ID_Str= cast(ID_Str,large_utf8()))), by = "ID_Str") |> as_arrow_table() |>
rename(ADate = LogDt, Class_code = Strings_of_Interest) |>
select(ID_Str, PubDt, Class, Class_code) |>
as_arrow_table(schema = omnibus_schema) |>

thisisnic · August 8, 2024, 5:22pm

This is a pretty heavy Arrow topic; someone might be able to help here, but you're probably more likely to get help for this particular question opening an issue on the Arrow GitHub repo as both the R and C++ Arrow developers are more likely to see this question there.

TPDeRamus · August 8, 2024, 6:01pm

First place I posted @thisisnic

github.com/apache/arrow

[R] Arrow function hangs/Segfaults during table generation request

opened 03:31PM - 08 Aug 24 UTC

TPDeramus

Component: R Type: usage

### Describe the usage question you have. Please include as many useful details …as possible. Hi Arrow Devs. I'm attempting to join some pointers about 3 pointers and and tables, in the same call (I have tried separately as well but the same error occurs so its not that) that have `schema` interpreting the same variable as different types (**ID** as `utf8()` vs `large_utf8()`). However, whenever I try to do so, the code either hangs in Rstudio or `segfaults` in the terminal/debugger. This is the whole set of arguments: ``` omnibus_schema <- schema( ID_Str = utf8(), IndexDt = date32(), Class = utf8(), Class_code = utf8() ) Output <- Importing_data_pointer |> inner_join(Importing_data_reference_pointer, by = "RefKey") |> mutate(Class = "Class") |> select(ID_Str, Strings_of_Interest, LogDt, Class) |> filter(str_detect(Strings_of_Interest,"(?i)science|fantasy|reference")) |> right_join((Sample_of_Interest |> as_arrow_table(schema = schema(ID_Str = large_utf8(),Department = utf8(),DDID = utf8(),PubDt = date32(),LogDt = date32(),CheckDt = date32(),DonFlag = uint64(),PurchFlag = uint64(),DamFlag = uint64(),MissFlag = uint64(),LastCheckDt = date32(),LongestDuration = uint64(),AuditStatus = uint64(),Lagged_Covariate_1_VOI = uint64(),Covariate_1_VOI = uint64(),Lagged_Covariate_3_VOI = uint64(),Covariate_3_VOI = uint64(),Lagged_Covariate_5_VOI = uint64(),Covariate_5_VOI = uint64(),Lagged_Covariate_6_VOI = uint64(),Covariate_6_VOI = uint64(),Lagged_Covariate_7_VOI = uint64(),Covariate_7_VOI = uint64(),Lagged_Covariate_9_VOI = uint64(),Covariate_9_VOI = uint64(),Lagged_Covariate_12_VOI = uint64(),Covariate_12_VOI = uint64(),Lagged_Covariate_22_VOI = uint64(),Covariate_22_VOI = uint64(),Lagged_Covariate_1_Nuisance_1 = uint64(),Covariate_1_Nuisance_1 = uint64(),Lagged_Covariate_6_Nuisance_1 = uint64(),Covariate_6_Nuisance_1 = uint64(),Lagged_Covariate_7_Nuisance_1 = uint64(),Covariate_7_Nuisance_1 = uint64(),Lagged_Covariate_22_Nuisance_1 = uint64(),Covariate_22_Nuisance_1 = uint64(),Lagged_Covariate_3_Nuisance_2 = uint64(),Covariate_3_Nuisance_2 = uint64(),Lagged_Covariate_9_Nuisance_2 = uint64(),Covariate_9_Nuisance_2 = uint64(),Lagged_Covariate_22_Nuisance_2 = uint64(),Covariate_22_Nuisance_2 = uint64(),ReturnDt = uint64()))), by = "ID_Str") |> rename(ADate = LogDt, Class_code = Strings_of_Interest) |> select(ID_Str, PubDt, Class, Class_code) |> as_arrow_table(schema = omnibus_schema) |> mutate(ID_Str = cast(ID_Str,utf8()),Class_code = cast(Class_code,utf8())) |> as_arrow_table() ``` But the problem component appears to be this part: ``` as_arrow_table(schema = omnibus_schema) |> ``` Which throws the following error in the debugger whether you include the `omnibus_schema` argument or not and just hangs in RStudio: ``` Thread 19 "R" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffba25fd640 (LWP 2085506)] 0x00007ffff776a94d in __memmove_evex_unaligned_erms () from /lib64/libc.so.6 ``` This gives me the impression based on other threads that there's some kind of "null pointer" issue to the output, or that the schema isn't being read correctly so it's malformed in some way, but I haven't been able to isolate the issue. Would anyone happen to know what I might be doing wrong here? Here's what information I have about the inputs, but unfortunately I don't have any null data I can generate at the moment. I can try to permute something if necessary. ``` Importing_data_pointer: FileSystemDataset with X Parquet files 10 columns ID_Str: large_string Source_ID: large_string LogDt: date32[day] RefKey: large_string Length: double Amount: double Vendor: large_string Transfer: large_string Transfer_ID: large_string Transfer_Loc: large_string Importing_data_reference_pointer FileSystemDataset with X Parquet file 25 columns RefKey: large_string Weight: double Publisher: string REF_ID_1: string REF_ID_2: string REF_ID_3: string REF_ID_4: string REF_ID_5: string REF_ID_6: string REF_ID_7: string Class: string Width: string Flagged: string Strings_of_Interest: string PublishingDt: date32[day] Ingr: string MultRefKey: double DeprecationDt: date32[day] Version: string State: string Zip: string Network: string Lender: double Zone: string Operating_area: string Sample_of_Interest (this is a tibble that is converted to an arrow table) FileSystemDataset with X Parquet file ID_Str: string Department: string DDID: string PubDt: date32[day] LogDt: date32[day] CheckDt: date32[day] DonFlag: double PurchFlag: double DamFlag: double MissFlag: double LastCheckDt: date32[day] LongestDuration: double AuditStatus: double Lagged_Covariate_1_VOI: double Covariate_1_VOI: double Lagged_Covariate_3_VOI: double Covariate_3_VOI: double Lagged_Covariate_5_VOI: double Covariate_5_VOI: double Lagged_Covariate_6_VOI: double Covariate_6_VOI: double Lagged_Covariate_7_VOI: double Covariate_7_VOI: double Lagged_Covariate_9_VOI: double Covariate_9_VOI: double Lagged_Covariate_12_VOI: double Covariate_12_VOI: double Lagged_Covariate_22_VOI: double Covariate_22_VOI: double Lagged_Covariate_1_Nuisance_1: double Covariate_1_Nuisance_1: double Lagged_Covariate_6_Nuisance_1: double Covariate_6_Nuisance_1: double Lagged_Covariate_7_Nuisance_1: double Covariate_7_Nuisance_1: double Lagged_Covariate_22_Nuisance_1: double Covariate_22_Nuisance_1: double Lagged_Covariate_3_Nuisance_2: double Covariate_3_Nuisance_2: double Lagged_Covariate_9_Nuisance_2: double Covariate_9_Nuisance_2: double Lagged_Covariate_22_Nuisance_2: double Covariate_22_Nuisance_2: double ReturnDt: double ``` Thank you in advance! ### Component(s) R

Was just reaching out here as well in case anyone might have known.

system · November 6, 2024, 6:02pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.