Compute-sanitizer not quite a drop-in replacement of cuda-memcheck

Hi,

Given the deprecation notice of cuda-memcheck, we are trying to port over to compute-sanitizer. However we notice that it’s not at all a drop-in replacement:

  • On default launch, we get this error:

========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.

  • compute-sanitizer now uses “ports” to connect to the application, which must be open. This is a huge difference w.r.t. cuda-memcheck where no ports were needed. This is hard to use in CI build farms that have a very strict network configuration.

Would it be possible to address these issues, alternatively un-deprecate cuda-memcheck?

Thanks!

Additionally, compute-sanitizer errors out when running on a non-CUDA binary, which cuda-memcheck didn’t do. Is it possible to disable this error and let it run on non-CUDA binaries?

EDIT: this can be addressed using --require-cuda-init=no

Thanks for getting in touch. With respect to the port issue, is the application and compute sanitizer running on the same machine or is there any type of remote connection happening? I think we do use ports, even locally, but that’s an easier problem to fix than communicating across machines without ports.

For the default launch issue, was this done on a system that didn’t have a port issue? If so, can you share the full command line and output you saw?

Thanks.

Thanks for the response!

The application and compute sanitizer are running on the same machine. However multiple (in the order of 10-100) tests are scheduled in parallel, which might lead to the same port being used for all tests, or simply not enough ports being available. We do experience that moving from cuda-memcheck to compute-sanitizer leads to our tests timing out if running in parallel, whereas they normally take a few seconds running in isolation. I wonder if it’s due to lack of ports or waiting on the same port.

For the “default launch” issue, the output looks like this with the following command (even adding require-cuda-init=no doesn’t fix the problem):

compute-sanitizer --error-exitcode 1 --require-cuda-init=no --tool racecheck --racecheck-report all /path/to/test
========= COMPUTE-SANITIZER
Running main() from gmock_main.cc
[==========] Running 11 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 11 tests from Test
[ RUN      ] Test.A
[       OK ] Test.A (644 ms)
[ RUN      ] Test.B
[       OK ] Test.B (576 ms)
[ RUN      ] Test.C
[       OK ] Test.C (900 ms)
[ RUN      ] Test.D
========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.
[       OK ] Test.D (11592 ms)
[ RUN      ] Test.E
[       OK ] Test.E (0 ms)
[ RUN      ] Test.F
[       OK ] Test.F (0 ms)
[ RUN      ] Test.G
[       OK ] Test.G (0 ms)
...
[----------] 11 tests from Test (13714 ms total)
[----------] Global test environment tear-down
[==========] 11 tests from 1 test suite ran. (13714 ms total)
[  PASSED  ] 11 tests.

Lastly, I have recently found that when running our tests in parallel under compute sanitizer - racecheck, they sometimes (not always) fail with a segfault, and it’s not always the same test getting it. Example output:

========= COMPUTE-SANITIZER
Running main() from gmock_main.cc
[==========] Running 6 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 3 tests from Foo/0, where TypeParam = int
*** SIGSEGV received at time=1661336821 on cpu 2 ***
PC: @     0x7f02bc006150  (unknown)  (unknown)
    @     0x560b65a993ec         64  absl::WriteFailureInfo()
    @     0x560b65a9959d         96  absl::AbslFailureSignalHandler()
    @     0x7f02ddc51980  (unknown)  (unknown)
    @     0x7f02de88df00  (unknown)  (unknown)
========= Error: process didn't terminate successfully
=========     The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or a host debugger to catch host side errors.
========= Target application returned an error
========= RACECHECK SUMMARY: 0 hazards displayed (0 errors, 0 warnings)

We are not using any unified memory, and these tests have been battle-tested under cuda-memcheck in CI for well over a year.

Sometimes there are idling compute-sanitizer processes that remain running until they are killed, even if the test they are supposed to test was finished long ago. Stracing that process leads to:

futex(0x26e542c, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)

For a more concrete repro:

$ cat main.cpp 
#include <thread>
#include <iostream>

int main(int argc, char* argv[])
{
  std::cout << "Start...\n";
  std::this_thread::sleep_for(std::chrono::seconds(std::atoi(argv[1])));
  std::cout << "Stop\n";
  return 0;
}

Launching:

/usr/local/cuda-11.7/bin/compute-sanitizer --error-exitcode=1 --require-cuda-init=no --launch-timeout=1  --tool=racecheck --racecheck-report=all ./a.out 10

Gives:

========= COMPUTE-SANITIZER
Start...
========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.
Stop

This exits with error code 255, failing CI checks. The tool fails even if require-cuda-init=no was specified.

Setting launch-timeout=0 (unlimited) works, but I have the feeling this is the reason why some tests get stuck when launching multiple instances in parallel.

That’s all my findings so far, let me know if I can provide more info!

Hi,

Could you try using the --max-connections option to increase the number of ports available?

Note that you can also use the --port option to specify the base port to use.

We’re investigating alternative options to ports.

Hi,

–max-connections doesn’t solve the issue.

Consider that I, as a CI user, am not in control of the CI build machines and their configuration. I cannot arbitrarily choose a port and expect the CI machine to allow that. Also, consider that the CI machines might run multiple jobs simultaneously, jobs that don’t know about each other and therefore cannot negotiate which job uses which port.

Why were ports introduced? Everything was working just fine on “cuda-memcheck” without ports - why are ports needed now?

Hi!

I had one more related question. What does --max-connections do? I can think of 2 possibilities:

  1. Unconditionally reserve N network ports, even if not all are used.
  2. If base_port is occupied, try base_port + 1, and so forth until base_port + N. If all ports are occupied, wait until one becomes free.

Thanks!