How to accurately time individual memory operations

I’ve been trying to time various CUDA operations on my GTX770 and am having difficulty understanding the results I’m getting. I’m using the clock() kernel function but am getting some pretty confusing results. Here is what I am doing:

int temp;
start = clock();
temp = 1;
stop = clock();

With this simple register assignment, I’m getting 32 clock cycles (1 cycle per thread on a full warp). This makes sense and is what one would expect. However, with shared and global memory things stop making sense:

__shared__ int temp;
start = clock();
temp = 1;
stop = clock();

This returns 73 clocks. This is NOT a multiple of 32, which doesn’t seem possible. Looking at the contents of the cubin file with “nvdisasm”, it’s doing 2 operations: set 11 to register and store to shared memory. This means shared memory storage takes 41 clock cycles! Do threads continue to increment the clock counter while they are waiting for memory access?

I’ve done a series of tests like this with many types of operations, and the results don’t seem to make sense:

Register store/load: 32 cycles
Shared memory store: 73 cycles
Shared memory load: 64 cycles
Global memory store/load (involves 4 assembly operations): ~360 cycles

By this result, global memory access only takes ~11 clock cycles, and shared memory only 2! Is this accurate or am I not understanding how clock() operates? If I am not using it correctly, is there are better way to get the results I am looking for?

Any insight would be greatly appreciated! Thanks!

The measurement might be correct, but I’m not sure that your reasoning is sound. Threads in a warp execute in parallel physically. There are no thread dependencies in the assignment of value ‘1’ to a thread-local register. I don’t see how warp size would dictate the cycles you measure. Also, the code you present is very simple–the compiler may store the value ‘1’ in the register before “start = clock()” is executed. What do you see in the cubin? Do compiler optimization flags make harmful alterations?

I would generally trust the measurements you get by reading clock()—it boils down to a single register-read operation. You may want to bookend clock() calls with memory barriers to ensure nothing funny is going on with respect to instruction re-ordering. I think __threadfence_block() would do the trick (but block size should == warp size). Depending upon what you’re trying to measure, you may want to use __threadfence() when you measure global load/stores.

Thanks for the reply!

I’ve tried adding __threadfence_block() and it hasn’t changed the results. Thanks for the tip though I’ll be sure to use it in future tests.

I’ve turned off compiler optimization and when I look at cubin, I see that the correct instructions are between the clock measurements. For example, for the first test (register assignment), the cubin looks like:

MOV32I R8, 0x1;

And the resulting 1-clock per thread (32 clocks) makes sense. My question was more about the shared-memory and global-memory results. For the shared-memory store test, the cubin was:

MOV32I R2, 0x1;
STS [RZ], R2;

this results in 73 clock cycles. If the register assignment takes 32 (from last test), how is it possible that the shared-memory store takes 41 cycles? This is not a multiple of 32…

Similarly, for the global memory store test, the cubin is:

MOV R5, c[0x0][0x14c];
SHL.W R3, R2, 0x2;
MOV32I R6, 0x1;
IADD R4.CC, R3, c[0x0][0x148];
IMAD.U32.U32.HI.X R5, R2, 0x4, R5;
ST.E [R4], R6;

Clearly global memory is a much more complex operation and it takes between 370-390 clock cycles. One again, this is not a multiple of 32. Also, I’ve read that global memory operations are very slow but this appears to be only taking ~12 cycles per thread.

Thanks again for the help.

There appears to be a lot of misconceptions in your post.

  1. If you measure the sequence
MOV32I R2, 0x1;
... //until R2 write dependency is complete
STS [RZ], R2;

you are measuring the issue latency of the STS. This time may include the following additional cycels

  • the warp is eligible but the warp scheduler does not pick the warp to issue
  • the warp is not eligible because the LSU unit is in use

It’s not uncommon for the warp scheduler to not pick an eligible warp for 100s-1000s of cycles.

You can avoid this by launching a single warp per warp scheduler, but this will impact your achieved shared memory latency.

  1. If you want to measure the dependent latency you will have to add a register dependency before the second S2R. It’s fairly difficult to add a dependency on a store. You can add a threadfence but I think you will find this greatly increases your latency. Try measuring this on a shared LD. The sequence would be
... //until R2 write dependency is complete
LDS R2, [RZ];
ADD R3, R2, 1;

This will tell you an upper bound for the execution latency.

  1. In your post you state “One gain, this is not a multiple of 32.”. What is your assumption? CC2.0 and above have no such limit.

If you want to get best case timing then you should limit yourself to 1 warp per SM or 1 warp per warp scheduler. If you want average timing then you should collect the information for all warps over a period of runs.

Thanks for the reply. I’ll try measuring a shared LD and see what results I get.

Sorry for not explaining my experiments earlier. For all of the above tests, I am only running 1 block of 32 threads, so there should be only 1 warp running on the entire device. This should eliminate any timing issues caused by the warp scheduler.

I am just trying to determine the number of cycles required for different operations. For example, I have read that shared-memory operations take anywhere from 8 to 40 cycles, but I want to get a more accurate reading from my card itself.

I am not sure what you mean by “CC2.0 and above have no such limit”. If a clock timing returns a value that is not a multiple of 32, what does that mean? Are some of the threads not running for one of the cycles? I was under the impression that every cycle, all 32 threads in a warp increment the clock.


The clock() function reads a 32-bit unsigned SM cycle counter that increments at the gpu core clock frequency as long as the GPU/SM is not clock gated. This counter is not at a fixed frequency unless you are running the CUDA profiling tools. CC 3.0 and above devices also have a globaltimer that can be accessed through inline PTX. See for more information.

I see. I didn’t realize clock gating was done on a individual thread (or core) level. So I suppose that during the shared memory access, some of threads are gated while they wait for the memory access, giving me a clock value that is not a multiple of 32?

Thanks for the tip! I’ll try out the globaltimer register and work with embedded ptx instructions to try to isolate and time each instruction.

I a warp is allocated to the SM then the clock will be incremented every cycle independent of the scheduling, dispatching, and execution of instructions.

The continued reference to the warp size and instruction execution leads me to believe you have misconceptions on how a pipelined CPU or GPU executes instructions.

The execution of a shared memory instruction is dependent upon (a) depth of the shared memory pipeline, and (b) instruction replays required to satisfy the memory request given either bank conflicts and/or a data size exceeding the width that can be requested in one issuing of the instruction.

I think my misconception regarding the clock itself is what is causing this confusion. I was under the impression that each cycle, the clock was incremented by each thread of the warp. This is why I was thinking that clock values must be a multiple of 32. Is this correct or does the clock increment only once per cycle?.

Sorry again for the confusion, I was assuming something that I should not have been.

The clock is not incremented by each thread of the warp.

Perhaps you should read the documentation:

which states:

"when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle. "

Hi ,

I tried measuring the latency of a load instruction as illustrated in this post. I launched 1 thread block with 32 threads (1 warp). I get a latency value between 400-700 cycles for different stride values.
Here is my kernel code and part of the assembly code.

/*0030*/         S2R R0, SR_CLOCKLO;            /* 0x2c00000140001c04 */
        /*0038*/         SHL.W R4, R0, 0x1;             /* 0x6000c00004011e03 */
        /*0040*/         LD.E R0, [R6];                 /* 0x8400000000601c85 */
        /*0048*/         FADD R0, R0, 1;                /* 0x5000cfe000001c00 */
        /*0050*/         S2R R5, SR_CLOCKLO;            /* 0x2c00000140015c04 */
        /*0058*/         SHL.W R5, R5, 0x1;             /* 0x6000c00004515e03 */
unsigned int clock_start, elapsed_time;
            unsigned int index=(unsigned int)threadIdx.x;
            float a;
            float *x_add;
            x_add = data_array + index;              
            clock_start = clock();
            a = *x_add;
            elapsed_time = clock() - clock_start;

Here are my questions:

  1. Is the cycles value in terms of GPU core cycles at the frequency of 0.824 GHz ? I am using a K80 GPU (Kepler architecture). A 600 cycle latency would then correspond to 728 ns. Is this expected ?
  2. I tried measuring the time for the entire kernel using nvprof. I get 6.2 us. However, when I try to look at the metric elapsed_cycles_sm, it gives 94000 cycles. (nvprof -e elapsed_cycles_sm ./myprog )
    I am having trouble understanding the relationship between the cycles value being reported and the time in seconds. Could you please help ? Thanks in advance !

Can someone please help me with this? I’ve been stuck with this for quite sometime now ! Thanks!

your SASS shows that measurement is correct. you may want to ensure that no JIT compilation may be used, by compiling without PTX.

in the post and above you can find technique that ensures correct measurement without need to analyze SASS

about your results - i’ve heard that when there are lot of memory reads, delay may be up to 2000 cycles, so your results doesn’t look absolutely impossible. you can try to lower amount of memory reads by making large thread blocks (using 32K shared mem per block) and reading data in only single thread per block. of course, i expect that you perfrom a lot of chanined reads in each thread block in order to gather reliable statistics

Hi BulatZiganshin,

Thank you for your response. What about the discrepancy between elapsed cycles and the time ? I tried timing a few other kernels as well (not the ones I used for latency measurements of memory operations). The relationship between nvprof reported time in seconds doesn’t seem to be elapsed_cycles_sm / 824500000. For example, I get differing values for cycles computed from summary time and elapsed_cycles_sm.

==62290== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 65.05%  188.27ms         1  188.27ms  188.27ms  188.27ms  d_mult(int, int, int, float*, float*, float*, int)
 23.30%  67.427ms         2  33.714ms  33.711ms  33.716ms  [CUDA memcpy HtoD]
 11.65%  33.712ms         1  33.712ms  33.712ms  33.712ms  [CUDA memcpy DtoH]
==62301== Profiling application: ./mt 8192 8192 8 PINNED_COPY
==62301== Profiling result:
==62301== Event result:
Invocations                                Event Name         Min         Max         Avg
Device "Tesla K80 (0)"
        Kernel: d_mult(int, int, int, float*, float*, float*, int)
          1                         elapsed_cycles_sm  1341512569  1341512569  1341512569

1341512569/824500 = 1627 ms, whereas time reported is 188.27 ms. I checked the time using cudaEventElapsedTime as well, that is the same as 188.27 ms. So why is the elapsed_cycles_sm so high ?
Thanks !


Can someone please help me with this?

The kernel (pasted in my first post) I used for latency measurement of a load instruction returns 600 cycles on average. I checked with the cuda documentations for instruction latency and it seems that 300-600 cycles is expected. I looked at the post suggested by BulatZigalshin as well. However, I am not sure if this number in cycles corresponds to 728 ns (using core clock frequency of 0.824 GHz). To check that I measured the time in seconds and time in elapsed_cycles_sm of different kernels. The relationship doesn’t seem to be time = elapsed_cycles_sm / core clock frequency. That is what I am confused about.

You probably want to repeat the experiment for a few different runtimes, and look for a linear relationship between runtime in ns and cycles elapsed within the kernel.

That would allow for a constant launch overhead on top of the time spent within the kernel, where only the latter is proportional to the number of cycles measured…

Thank you Tera. Here is a table of a few kernel measurements with nvprof reported time and nvprof reported elapsed_cycles_sm.

time(ms)        elapsed_cycles_sm
47.077          335589212
188.33          1341491893
95.95           667766585
5.868           42182903
2.8697          20722125

Each time, elapsed_cycles_sm is much higher than time*gpu_core_frequency, the core frequency being 0.824 Ghz. So I am not sure of the overhead part being included. However, if I try to compute the frequency based on given times and elapsed_cycles, the frequency is consistently ~7 GHz. Why is this so ? Kepler architectures operate only on a common clock domain (no shader).

I am also pasting the measurement of a rodinia benchmark, pathfilter.

==70419== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 36.41%  3.8404ms         9  426.71us  426.49us  426.94us  normalize_weights_kernel(double*, int, double*, double*, double*, int*)
 31.62%  3.3353ms         9  370.59us  361.95us  376.92us  likelihood_kernel(double*, double*, double*, double*, double*, int*, int*, double*, unsigned char*, double*, double*, int, int, int, int, int, int, int*, double*)
 31.11%  3.2809ms         9  364.54us  146.43us  477.75us  find_index_kernel(double*, double*, double*, double*, double*, double*, double*, int)
  0.36%  38.431us         6  6.4050us  1.5040us  25.215us  [CUDA memcpy HtoD]
  0.29%  31.040us         9  3.4480us  3.3920us  3.7760us  sum_kernel(double*, int)
  0.13%  13.984us         3  4.6610us  4.1280us  5.2160us  [CUDA memcpy DtoH]
  0.06%  6.5920us         1  6.5920us  6.5920us  6.5920us  [CUDA memset]
==70428== Profiling application: ./particlefilter -x 128 -y 128 -z 10 -np 1000
==70428== Profiling result:
==70428== Event result:
Invocations                                Event Name         Min         Max         Avg
Device "Tesla K80 (0)"
        Kernel: sum_kernel(double*, int)
          9                         elapsed_cycles_sm       62262       65937       62883
        Kernel: likelihood_kernel(double*, double*, double*, double*, double*, int*, int*, double*, unsigned char*, double*, double*, int, int, int, int, int, int, int*, double*)
          9                         elapsed_cycles_sm     2705816     2767134     2733817
        Kernel: normalize_weights_kernel(double*, int, double*, double*, double*, int*)
          9                         elapsed_cycles_sm     3129284     3132574     3130264
        Kernel: find_index_kernel(double*, double*, double*, double*, double*, double*, double*, int)
          9                         elapsed_cycles_sm     1687176     3379377     2661844

I am not sure how to interpret it.