clock() function on titanX

In about 8 lines of code total, you could answer this question yourself with a simple test case. (hint: sizeof()).

CUDA generally tries to have type-sizing consistent with the platform that it is running on. On windows, long int and long long int are different sizes (with or without CUDA). On Linux (at least the Linux variants that CUDA works with) they are the same size.

Yes, I think it’s possible for two different threads timing, even though they are doing “the same thing”, to vary quite a bit. The GPU does not pick up a thread and bang on it until it retires. It is constantly choosing instructions from different threads (warps, threadblocks) to schedule, and there is no reason to assume any kind of lockstep order to it all. To get a more in-depth treatment of this, you might wish to review session 3 here.

Thank you so much for pointing this out, specially for linking the cuda-training-series. I will be watching the entire course and learn from it as soon as possible.

In the meanwhile…


I have one more question (hopefully the last) on this topic. In this post two solutions/approaches (listed below) are presented. Question: Is one preferable than the other? I ask this since the use of __syncthreads() seems to be way more convenient, however most of the answers here do focus on the creation of dependency.

Once again, thank you so much.


Approach 1 - To create a dependency on both ends of chain

Approach 2 - To use __syncthreads

I don’t know of obvious, critical reasons to choose one over the other. __syncthreads() has implications across the threadblock obviously. That may or may not matter. Either approach is attempting to do battle with the compiler to prevent instruction reordering. But the compiler likes to do what it likes to do, and it is sometimes aggressive about it. YMMV. I think the designed way to attempt to avoid instruction reordering is to use inline ptx with an appropriate “clobber” or volatile decorator. But that strikes me as going to a lot of trouble for what basically amounts to throwaway instrumentation (I assume). I would probably use any approach that worked that I found simplest/to my liking, and use that.