debugging valgrind errors

slowkow · October 9, 2023, 3:27pm

Prof Brian Ripley sent out the email below to a few R package developers ( including @davis for the slider package), giving us 14 days to solve problems reported by valgrind.

I'd like to ask if anyone here might be able to help developers figure out what to do about this message. I hope that our discussion here might be helpful to all the developers who received the email, as well as future developers who might eventually receive such a message.

Here are my questions:

What is the general problem here? How important are these errors?
How can we find the source of each error? What are we (R package authors) supposed to do about it?
What's the quickest way to set up a local environment to reproduce the reported errors?

The email:

That is

Carlson MatchIt PheVis boodist ggrepel jacobi lavacreg round secsse
slider smerc

see the logs at
https://www.stats.ox.ac.uk/pub/bdr/memtests/valgrindExtra/ with
reproduction details in the README.txt

Compiling with valgrind often changes computations or spills results
from registers to memory (which on x86_64 reduces the f/p precision).
Some of you are relying on arithmetic preserving the status of NA as a
special NaN.

Reports like

   identical(pos1, pos2) is not TRUE
   zcompare(fzones1, fzones1b, lprimes) is not TRUE

are maximally unhelpful: you have no way of knowing what the values were
on a remote machine.

Please correct before 2023-10-22 to safely retain your package on CRAN.

--
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford

Gabor · October 9, 2023, 5:11pm

In your package, you do some arithmetic in C/C++/Fortran on double values, and you expect that this arithmetic keeps NAs, but that is unfortunately not always true, they might turn into NaN values.

You can probably reproduce these results using this container: R-hub - R-hub containers
You can use it directly, or via the rhub2 package on GitHub Actions: GitHub - r-hub/rhub2: The 'R-hub' package builder, v2

E.g. here are the results for slider: valgrind (debatable-mudpuppy) · r-lib/slider@45a5668 · GitHub

nirgrahamuk · October 9, 2023, 5:20pm

interesting issue; how are such issues typically addresses ? would it be considered too brutish to record the positions of any NA's that its feared might be converted to NaN, and after the external function call returns, forcing NA's back into those positions ?

Gabor · October 9, 2023, 5:24pm

Yeah, that seems like a good way: keep NAs NAs, and only do the arithmetic operation on the rest of the vector. Of course if the operation is not a simple mapping of a vector, then this can be more complicated.

slowkow · October 9, 2023, 8:23pm

Could I ask why it matters that NA turn into NaN? What is the problem with this?

A lot of developers would benefit from seeing an example with before and after code in Rcpp, to demonstrate the issue clearly.

Any clarifications, examples, or links are more than welcome! Thank you for your help!

Gabor · October 10, 2023, 8:25am

It matters for some applications, because NA is not the same as NaN:

❯ identical(NA_real_, NaN)
[1] FALSE

Not Rcpp, but maybe you can wait until slider is fixed, to see an example.

Btw. not really a fix, strictly speaking, but to keep your package on CRAN, you can also skip the failing tests on CRAN (temporarily?), e.g. with skip_on_cran() if you are using testthat.

davis · October 10, 2023, 12:46pm

I'll try and work on slider today or tomorrow and report back with my fix / anything helpful I found along the way

slowkow · October 10, 2023, 2:14pm

Another developer on the email list replied and said:

The code is supposed to return the degrees of freedom of a Chi² distribution, but with Valgrind it founds -NaN. That makes no sense, I have no idea how to fix that.

I feel the same way as this developer.

There is a lack of clarity about why valgrind is being used in this way, what is the underlying source of the valgrind issue, which users are affected by this issue, and what the package developers are expected to do about it.

davis · October 12, 2023, 12:15am

Okay, in slider's case the problem was that I was expecting a sum(), mean(), and prod() calculation that involved both NA and NaN to return a precise result - i.e. I was expecting sum(NA, NaN) to return NA.

But if you read the help page of ?NA then you'll see:

Numerical computations using NA will normally result in NA : a possible exception is where NaN is also involved, in which case either might result (which may depend on the R platform). However, this is not guaranteed and future CPUs and/or compilers may behave differently. Dynamic binary translation may also impact this behavior (with valgrind, computations using NA may result in NaN even when no NaN is involved).

The last line is particularly interesting, where it is saying that on valgrind you could even see sum(NA) returning NaN.

It seems that on valgrind my specific tests were returning NaN when I expected NA or vice versa.

Anyways, I was using expect_identical() in my tests, which tests for the exact difference between NA and NaN. The easiest fix I could come up with was to use expect_equal() instead, which treats NA and NaN as equivalent values, while still actually comparing any other non-missing values in the vector.

github.com/r-lib/slider

Fix `NA` vs `NaN` propagation differences on valgrind

r-lib:main ← r-lib:fix/na-nan-issue

opened 11:47PM - 11 Oct 23 UTC

DavisVaughan

+57 -54

Closes #198 Basically, if a test combined both `NA` and `NaN` in the same `s…um/mean/prod` calculation then the result is essentially up in the air as to whether it will be another `NA` or `NaN`. While I _was_ actually trying to test this difference before by using `expect_identical()`, it isn't really worth the effort since the base equivalents don't make any promises about it either. The easiest fix is to just switch to `expect_equal()` where `NA_real_` and `NaN` are considered equivalent.

I think this will probably apply to a few of us who got that email. Switching from expect_true(identical(x, y)) or expect_identical(x, y) to expect_equal(x, y) is probably an easy fix.

I don't have any good advice for the developer who responded with "The code is supposed to return the degrees of freedom of a Chi² distribution, but with Valgrind it founds -NaN." besides just using skip_on_cran() for that test and moving on with their life.

If you do want to test that your fixes pass with Valgrind, you can use this great r-hub workflow file:

github.com

r-hub/rhub2/blob/main/inst/workflow/rhub.yaml

# R-hub's generic GitHub Actions workflow file. It's canonical location is at
# https://github.com/r-hub/rhub2/blob/v1/inst/workflow/rhub.yaml
# You can update this file to a newer version using the rhub2 package:
#
# rhub2::rhub_setup()
#
# It is unlikely that you need to modify this file manually.

name: R-hub
run-name: "${{ github.event.inputs.id }}: ${{ github.event.inputs.name || format('Manually run by {0}', github.triggering_actor) }}"

on:
  workflow_dispatch:
    inputs:
      config:
        description: 'A comma separated list of R-hub platforms to use.'
        type: string
        default: 'linux,windows,macos'
      name:
        description: 'Run name. You can leave this empty now.'

This file has been truncated. show original

As the instructions show (GitHub - r-hub/rhub2: The 'R-hub' package builder, v2) you basically just need to install rhub2 with pak::pak("r-hub/rhub2") and then call rhub2::rhub_setup() to add the workflow file to your package and push that to GitHub. After that you should be able to call rhub2::rhub_check(platforms = "valgrind", branch = "my-branch-name") to start a check.

That didn't immediately work for me though, so, after pushing the workflow file to github, another way is to go to your Actions page and manually trigger the workflow with the valgrind platform and the right branch name.

Where I now get a successful check Manually run by DavisVaughan () · r-lib/slider@bf972d1 · GitHub

davis · October 12, 2023, 12:19am

And, of course, don't forget to go to the link in the email (Index of /pub/bdr/memtests/valgrindExtra) to see your package's specific issues.

I recommend copying the results out into a local text file, because they could disappear out from under you at any moment and I like to have a historical record of them. For example, I opened a slider issue and dumped it there NA vs NaN failures on BDR's valgrind machine · Issue #198 · r-lib/slider · GitHub

slowkow · October 12, 2023, 7:42pm

@davis Thank you so much! Many R package developers will learn a lot from your detailed reply.

For ggrepel, the fix was easy thanks to your detailed comments. Also thanks to the rhub2 package developers for allowing us to run valgrind on github.

Just as you suggested, the valgrind error went away when I changed expect_true(identical(x, y)) to expect_equal(x, y).

system · October 19, 2023, 7:43pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.