CUDA Visual Profiler

Hi everyone,

I’m using CUDA Visual Profiler to analyze my CUDA program and I’m so confused. I’m executing exactly the same code in differents sessions, and the result differs. For instance the number of instructions is different between executions. The same happens with other parameters.

Is this normal or I am doing something wrong? If this is normal, can anyone explain the reason?

Thanks everyone.

Theoretically, this shouldn’t happen…
But since it’s happening in your case, the only thing that I can make out of this is the ‘race-condition’. You might be trying to write to a global memory location from all the threads and probably might be using it later for some condition checks.

keep in mind that the cuda profiler profile only one SM (Streaming Multiprocessor) so you will have consistent result only if all SM have the same work to do and ar all active (which appens only with enough threads blocks launched)
Hope that helps

Exactly. Tha’ts what I am doing. But I still understand why is happening. Could you explain me? I’m very iterested.


Let us say that there’s a variable ‘a’ in global memory. Now, you want to write to this location from every thread.
For eg. in thread ‘i’, suppose you want to do, ‘a = a + i;’
Now, we all know that only one fellow can write to the memory at any given time. Hence, if, let us say, all the threads want to do this operation at the same time, then, at the first cycle, only one of the thread succeeds in writing to ‘a’, while the remaining N-1 threads stall! Now, in the second cycle, some other thread succeeds, while the other N-2 threads stall, so on… till the Nth clock cycle, the last threads succeeds!
Whereas, ideally it should have taken one clock cycle to execute this instruction, now it has taken ‘N’ clocks!!!

Let us make this condition a bit more complicated :)
Let us say, you want to write to ‘a’ only when ‘a’ is even… (just a cooked up example)
i.e if(!(a % 2)) { a = a + i; }

Now, based the ‘current’ value of ‘a’, the threads which might be contending will change. Also, now it matters in which order the threads succeed in writing to memory!

This is one of the reasons for the discrepancy you observed. Noting the problems which you could run into if you have such conflicts to memory access, it is always strongly suggested not to use such method of programming.

Very clear example. Thank you very much!!!

This information will be very usefull to the test what I’m trying now. I’ll post another question referencing this post.

Once again. Thanks.