Measuring global memory access speed

Hi, I am trying to measure the speed of global memory access by profiling the execution of the following code sample:

__global__ void test_kernel(volatile int* input_value)
{
    int temp;
    for (int i = 0; i < 1000000; i++)
    {
        temp = input_value[0]; 
    }
}

This is being run by a single thread in a single block.

However,this sample is running in approximately 10ms. This gives an average time per global memory access of only 10 nanoseconds. This seems to me to be far too fast for global memory access. I would expect the actual value to be around 500 nanoseconds. Therefore, I suspect that the contents of global memory is being cached either in L1 or L2. However, my current understanding is that the use of the volatile keyword should prevent the compiler from doing this. What am I doing wrong here?

If you don’t store temp anywhere, the compiler finds that the result is never used and will eliminate the entire for loop.

I simplified the code sample down too far there. I have tested it with a copy of temp into global memory followed by a copy of that back to the host. I know that the loop is being run as the time does scale with the number of iterations. It is just a much smaller value than I would expect.

Your GPU has L1 and L2 caches. If you always read input_value[0] I believe all your requests will be handled by the L1 cache. For sure, the volatile would not prevent the GPU from using the L1 cache (as there is really no way to switch that off).

Whether or not the volatile statement has any effect on the code would have to be evaluated by looking at the PTX and the SASS code.

For a particular reason I want to be able to poll GPU memory directly during the execution of the kernel. Is it possible to achieve this with the L1 cache in the way?

If you want to measure global latency, your kernel, which should be started with one thread only, should look something like this:

__global__ void test_kernel(int* input_value/*initialized with 0*/, int stride /* should be at least one cache line, thus 32*/, int size)
{
  int index = 0;
  int temp = 0;
  int iter  = 0;
  while(temp != 1 && iter < 10000 )
  {
     temp += input_value[index]; 
     index = (index + stride) % size;
     iter++;
  }
  input_value[0] = temp;
}

Depending on the variable “size” you measure the latency of the l1 cache, the l2 cache or the DRAM. Note that if large value for the variable “size” is chosen, you will encounter some effects of DRAM being bad at random access by increasing the variable “stride”.

Thanks, that is helpful. However, my end goal here is to be able to poll an area of global memory directly for a change to a value during the kernel runtime. I am using a persistent thread model in order to avoid the overheads involved in repeated kernel launches. How would I poll a small region of global memory and ensure that I am not just polling a cache?

The GPU automatically provides cache coherency for you (if you do not use a read only cache, e.g L1/Tex on Keplar), thus you do not need to worry about it.

However, you should worry about memory ordering. Thus do not forget to use the appropiate memory fences.

So I have run some tests with the code sample that you provided above and with both a size and number of iterations of 1000000, I got a time of about 250ms which averages to 0.25us per memory access. This sounds more reasonable to me.

I also found that there was a significant jump in time taken between array sizes of 1 and 10, followed by another jump between 100,000 and 1,000,000. I assume this corresponds to different levels of caching. Can I therefore assume that past a size of 1,000,000, I am measuring the time to global memory directly?

Finally, I tried removing the temp!=1 condition from the while loop and found that it dramatically sped up (~250ms -> ~50ms). I assume this clause prevents the compiler from performing some optimisation. Is this correct and what is the optimisation that the compiler otherwise performs?

You may look up the cache sizes of your GPU and try to sample around those sizes to verify this benchmark. 4 MB (a size of 1 M values) may be L2 depending on your GPU, while 400 KB (100 000 values) seems to large for L1. However, I do not know how dense you have sampled.

Also as a side note: Global memory is only a memory space, which is typically residing in the DRAM of the GPU, is cached on the GPU, is paged to the CPU DRAM or is residing in the CPU DRAM. Thus the statement “time to global memory directly” is imho kind of misleading: Is a indirect global memory access a cache hit? Sounds strange to me… The best way to put it, is probably to say that you are measuring the latency of a DRAM access.

The compiler may perform loop unrolling:

var_reg_1 =  input_value[(index + 0*stride) % size]
var_reg_2 =  input_value[(index + 1*stride) % size]
var_reg_3 =  input_value[(index + 2*stride) % size]
....
Temp += var_reg_1 
Temp += var_reg_2 
Temp += var_reg_3 
....

As a consequence, the instruction level parallelism of the load instructions allows the in order pipeline of the GPU to execute those load instructions concurrently, and the pipeline will stall at the first addition instruction while all the loads are in flight. As a consequence, the measured latency is reduced by the unroll factor of the loop (in your case since the duration is reduced by a factor of 5 the compiler probably also unrolled the loop 5 times). However, by putting the result of a loop in the header of this loop you can prevent this unrolling, which is pretty much one of the basics of the basics of writing micro benchmarks.