P100 non-deterministic results with dynamic parallelism


I wrote some code that performs clustering in the GPU memory using dynamic parallelism. A single device thread controls the procedure and creates many other threads using cudaDeviceSynchronize() for synchronization.

The code works well on a V100 GPU producing repeatable results, every time it’s run on the same data the same results are produced (as expected). However, when the same code is run on a P100 (and on the same data) results are not repeatable. It would seem like thread synchronization is not working as expected (or as in the V100).

Is this a known issue?

What would be a good way to get to the root cause of such behavior?


It’s entirely possible to have different execution behavior (order) on different GPUs. If your code will produce different results for different orderings of operations, and you find that objectionable, then you would need to remove the possibility of variance from your code, or use a different algorithm/realization that is not sensitive to variation in processing order. This may have a significant negative performance impact on your code.

You should be able to use the profiler to confirm differences in execution order of kernels. If that is the case, you would need to study your algorithm for numerical behavior.

If you uncover no difference in execution behavior between two cases, then you should treat it as any other bug.

The CUDA programming model provides no guarantees of any sort of thread execution order or synchronization, other than those that you impose explicitly in your code.

Thanks for the reply.

The code should produce the same results regardless of different ordering of operations aside for operations that are explicitly synchronized. This is confirmed by execution on the V100.

So one question is: is it known that the P100 would behave any differently w.r.t. the V100 from the CUDA programmer perspective as it pertains to synchronization when using dynamic parallelism?

synchronization associated with dynamic parallelism should not be any different between P100 and V100.

I think the other claims you are making are suspect, but there’s little point arguing it based on the information provided here.

I have some crazy nondeterministic stuff going on, that didn’t occur in my laptop’s GPU but happened in the server’s GPUs. Solution: Now I call cudaDeviceSynchronize() every time after a call to a cublas, cusolver, etc., function, and the nondeterministic issue dissapeared! :) It made me really crazy and angry but aparently because those libraries use stream, then you can end using the content of a device pointer before the results have been written completely by those libs’ functions.