P100 non-deterministic results with dynamic parallelism

agrv100 · August 17, 2018, 12:48am

Hi,

I wrote some code that performs clustering in the GPU memory using dynamic parallelism. A single device thread controls the procedure and creates many other threads using cudaDeviceSynchronize() for synchronization.

The code works well on a V100 GPU producing repeatable results, every time it’s run on the same data the same results are produced (as expected). However, when the same code is run on a P100 (and on the same data) results are not repeatable. It would seem like thread synchronization is not working as expected (or as in the V100).

Is this a known issue?

What would be a good way to get to the root cause of such behavior?

Thanks.

Robert_Crovella · August 17, 2018, 4:33am

It’s entirely possible to have different execution behavior (order) on different GPUs. If your code will produce different results for different orderings of operations, and you find that objectionable, then you would need to remove the possibility of variance from your code, or use a different algorithm/realization that is not sensitive to variation in processing order. This may have a significant negative performance impact on your code.

You should be able to use the profiler to confirm differences in execution order of kernels. If that is the case, you would need to study your algorithm for numerical behavior.

If you uncover no difference in execution behavior between two cases, then you should treat it as any other bug.

The CUDA programming model provides no guarantees of any sort of thread execution order or synchronization, other than those that you impose explicitly in your code.

agrv100 · August 17, 2018, 6:01pm

Thanks for the reply.

The code should produce the same results regardless of different ordering of operations aside for operations that are explicitly synchronized. This is confirmed by execution on the V100.

So one question is: is it known that the P100 would behave any differently w.r.t. the V100 from the CUDA programmer perspective as it pertains to synchronization when using dynamic parallelism?

Robert_Crovella · August 17, 2018, 6:13pm

synchronization associated with dynamic parallelism should not be any different between P100 and V100.

I think the other claims you are making are suspect, but there’s little point arguing it based on the information provided here.

usuario2785 · January 23, 2022, 9:00pm

I have some crazy nondeterministic stuff going on, that didn’t occur in my laptop’s GPU but happened in the server’s GPUs. Solution: Now I call cudaDeviceSynchronize() every time after a call to a cublas, cusolver, etc., function, and the nondeterministic issue dissapeared! :) It made me really crazy and angry but aparently because those libraries use stream, then you can end using the content of a device pointer before the results have been written completely by those libs’ functions.

Topic		Replies	Views
Even without sync, a parallel reduction sum using dynamic parallelism works !? CUDA Programming and Performance	2	935	March 14, 2017
dynamic parallelism performance CUDA Programming and Performance	4	974	January 3, 2013
dynamic parallelism CUDA Programming and Performance	3	1112	December 30, 2012
Cuda Dynamic Parallelism Performance CUDA Programming and Performance	3	1967	July 14, 2016
k20c Problem Dynamic Parallelism CUDA Programming and Performance	1	576	March 22, 2013
Another question regarding the bizarre behavior of grid nesting and synchronization in cuda samples CUDA Programming and Performance	0	576	April 8, 2018
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23300	October 7, 2010
getting random output for different run of the same program CUDA Programming and Performance	2	1347	January 5, 2010
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	733	June 3, 2024
Depricated cudaDeviceSynchronize() in Dynamic Parallelism CUDA NVCC Compiler cuda , synchronization	2	2021	September 30, 2022

P100 non-deterministic results with dynamic parallelism

Related topics