Determining cause of Stochastic Failures of multiple processes run under MPS

Recently I have been investigated running my Cuda code with MPS, and began to notice stochastic failures when running multiple processes on the same GPU. These failures are not on GPU calls (and no errors appear in the MPS logs), but rather it appears that the state of the code is corrupted and fail assertions in the CPU code. My initial guess was an invalid memory write, and I managed to find and fix such a bug. However even after fixing the one bug, still seeing stochastic failures and haven’t been able to use compute-sanitizer to catch failures. I have a few questions, to make sure I am not headed down the wrong path.

  1. Are there any patterns with streams or memory allocations that should be avoided, when using MPS? Docs indicate no, but want to be certain
  2. Is shared memory isolated in MPS, it sounds like global memory is? For example, can kernels from different processes share an SM and stomp on each other’s shared memory? Since I can’t detect an invalid global memory write with compute-sanitizer on an A10g GPU, I have wondered if there is an memory leak in shared memory which apparently can’t be detected on the 8.* series of GPUs. Investigating invalid shared memory writes on an older card (RTX 2080), but so far no luck.
  3. Besides running compute-sanitizer --tool memcheck against these processes, are there any other tools that might help identify the issue? Currently takes about an hour to trigger the failure (if it triggers), and running with compute-sanitizer takes days to run (if I limit the force synchronization limit to avoid a GPU OOM) while showing no issues.
  4. Post-Volta MPS is enabled on any Volta and newer card correct?
  5. Do the environment variables for both the pipe and log stream have to be set? Seen on stackoverflow that you have to, but docs indicate that there are default values (which is what I am using anyways). I would assume if this was mis-configured, failures would be immediate, but trying to verify all my assumptions.

Machine Setup
OS: Ubuntu 20.04.2 LTS
GPU: A10g (code compiled with cuda arch 8.6)
Driver: 550.54.15
Cuda: 12.4.1
Compute-sanitizer command: compute-sanitizer --force-synchronization-limit 1024 --launch-timeout 120 --leak-check full --padding 128 --tool memcheck --error-exitcode 1

If you have found one such bug, there is a probability, there is another bug.

These failures are not on GPU calls […] it appears that the state of the code is corrupted and fail assertions in the CPU code

Can you do time travel debugging? WinDbg or gdb?
And then go back the failed assertion to find the contradicting memory location and then show the stacktrace of the previous code location, which modifies that address?

Is shared memory isolated in MPS, it sounds like global memory is?

Or are you talking about GPU memory?

Then I would either separate the kernel into small steps and test a lot of times to find the part of the code, which is not deterministic.

Or write testing code on the GPU that often validates the current state. Or does computations the slow way to compare with the fast way.

1 Like
  1. I’m not aware of any.

  2. Shared memory is logically distinct between any two threadblocks (let’s leave threadblock clusters out of the discussion for now - it’s not applicable for cc8.x, and it complicates the response wording, but doesn’t change the underlying premise). It does not matter whether those threadblocks are from the same or different kernels, processes, or whatever. And it doesn’t matter whether we are talking about the actual shared declaration in the code or the per-threadblock reservation.

  3. Another possible tool since you mention shared memory, and the variability of issue occurrence, might be the compute-sanitizer racecheck tool. Presumably you have already checked it. Since you mention that there are no device-side error reports, and the actual failure is in host code, from what I can see here I would not be able to rule out host-side corruption, e.g. stack corruption. I suppose a tool to investigate that might be valgrind.

  4. AFAIK yes

  5. With a single MPS server running, I don’t think it should be necessary to override the defaults if you don’t want to.

1 Like

IIRC shortly ago, we had a forum post, that in practice two thread blocks on the same SM could write to (or corrupt) each other’s shared memory. That is not a feature, but an UB situation with using out-of-bounds indices. (Logically they are still distinct, but not physically.)

Not sure, if this actually works (have not tested it) and on what compute capabilities and under what conditions (e.g. two thread blocks can write into each other’s shared memory or only the one with the lower physical addresses into the one with the higher addresses; or perhaps there are memory safety features with a coarse granularity, which only hits under certain conditions, …).

1 Like

Interesting.

I don’t recall the post.

I wonder if compute-sanitizer would catch that sort of “out of bounds” access.

Sorry, cannot find it anymore. But some older links I just saw state there can be an exception since Fermi:

If you have found one such bug, there is a probability, there is another bug.

I am inclined to agree, just having had a hard time tracking down the source. I want to understand the implications of MPS, since this code (which has only subtly changed in the last year) only starts to fail when running under MPS (while another process runs on the same GPU), to see if that gives me a hint on where to look.

And then go back the failed assertion to find the contradicting memory location and then show the stacktrace of the previous code location, which modifies that address?

The assertion is immediately after returning from the GPU, doing a very coarse check of the results (IE check results are within some very broad bounds). And the number of kernel launches between each return can be in the thousands. Will look into time travel debugging, would be very helpful if I could go back to the good state and figure out what changed.

Or are you talking about GPU memory?

Yes, I meant GPU memory.

separate the kernel into small steps and test a lot of times to find the part of the code, which is not deterministic.

We have a lot of tests that validate determinism, though most are not at the kernel level due the difficultly of setting up the inputs without python. This code is bitwise identical with and without MPS running, if running with MPS doesn’t trigger the assertion, which is why I suspect this is a memory bug (IE some random corruption).

IIRC shortly ago, we had a forum post, that in practice two thread blocks on the same SM could write to (or corrupt) each other’s shared memory.

Interesting, I will look around for this.

Appreciate the thorough answers!

Another possible tool since you mention shared memory, and the variability of issue occurrence, might be the compute-sanitizer racecheck tool. Presumably you have already checked it.

I have looked at it briefly, but we have a few known races. Will go back and exclude the races I feel confident in and make sure I am not missing another data race.

Since you mention that there are no device-side error reports, and the actual failure is in host code, from what I can see here I would not be able to rule out host-side corruption, e.g. stack corruption. I suppose a tool to investigate that might be valgrind.

Good point, I will run with valgrind and verify it isn’t on the host.

Hi fyork,
can you run each of the thousand kernels twice? With different output memory, but the same input? If you have not enough GPU memory for two output arrays, you could copy to the CPU first and reuse the same one.
If the input is modified in-place, can you make copies of the input arrays first?

Then you can automatically test, whether the two runs yield the same results or when and where stochastic failures occur?