Hi,
I’ve been trying to time various CUDA operations on my GTX770 and am having difficulty understanding the results I’m getting. I’m using the clock() kernel function but am getting some pretty confusing results. Here is what I am doing:
int temp;
start = clock();
temp = 1;
stop = clock();
With this simple register assignment, I’m getting 32 clock cycles (1 cycle per thread on a full warp). This makes sense and is what one would expect. However, with shared and global memory things stop making sense:
__shared__ int temp;
start = clock();
temp = 1;
stop = clock();
This returns 73 clocks. This is NOT a multiple of 32, which doesn’t seem possible. Looking at the contents of the cubin file with “nvdisasm”, it’s doing 2 operations: set 11 to register and store to shared memory. This means shared memory storage takes 41 clock cycles! Do threads continue to increment the clock counter while they are waiting for memory access?
I’ve done a series of tests like this with many types of operations, and the results don’t seem to make sense:
Register store/load: 32 cycles
Shared memory store: 73 cycles
Shared memory load: 64 cycles
Global memory store/load (involves 4 assembly operations): ~360 cycles
By this result, global memory access only takes ~11 clock cycles, and shared memory only 2! Is this accurate or am I not understanding how clock() operates? If I am not using it correctly, is there are better way to get the results I am looking for?
Any insight would be greatly appreciated! Thanks!