Measuring global memory access speed

rowanphilip · October 25, 2018, 9:04am

Hi, I am trying to measure the speed of global memory access by profiling the execution of the following code sample:

__global__ void test_kernel(volatile int* input_value)
{
    int temp;
    for (int i = 0; i < 1000000; i++)
    {
        temp = input_value[0]; 
    }
}

This is being run by a single thread in a single block.

However,this sample is running in approximately 10ms. This gives an average time per global memory access of only 10 nanoseconds. This seems to me to be far too fast for global memory access. I would expect the actual value to be around 500 nanoseconds. Therefore, I suspect that the contents of global memory is being cached either in L1 or L2. However, my current understanding is that the use of the volatile keyword should prevent the compiler from doing this. What am I doing wrong here?

cbuchner1 · October 25, 2018, 9:12am

If you don’t store temp anywhere, the compiler finds that the result is never used and will eliminate the entire for loop.

rowanphilip · October 25, 2018, 9:26am

I simplified the code sample down too far there. I have tested it with a copy of temp into global memory followed by a copy of that back to the host. I know that the loop is being run as the time does scale with the number of iterations. It is just a much smaller value than I would expect.

cbuchner1 · October 25, 2018, 9:35am

Your GPU has L1 and L2 caches. If you always read input_value[0] I believe all your requests will be handled by the L1 cache. For sure, the volatile would not prevent the GPU from using the L1 cache (as there is really no way to switch that off).

Whether or not the volatile statement has any effect on the code would have to be evaluated by looking at the PTX and the SASS code.

rowanphilip · October 25, 2018, 9:39am

For a particular reason I want to be able to poll GPU memory directly during the execution of the kernel. Is it possible to achieve this with the L1 cache in the way?

Fiepchen · October 25, 2018, 9:47am

If you want to measure global latency, your kernel, which should be started with one thread only, should look something like this:

__global__ void test_kernel(int* input_value/*initialized with 0*/, int stride /* should be at least one cache line, thus 32*/, int size)
{
  int index = 0;
  int temp = 0;
  int iter  = 0;
  while(temp != 1 && iter < 10000 )
  {
     temp += input_value[index]; 
     index = (index + stride) % size;
     iter++;
  }
  input_value[0] = temp;
}

Depending on the variable “size” you measure the latency of the l1 cache, the l2 cache or the DRAM. Note that if large value for the variable “size” is chosen, you will encounter some effects of DRAM being bad at random access by increasing the variable “stride”.

rowanphilip · October 25, 2018, 9:55am

Thanks, that is helpful. However, my end goal here is to be able to poll an area of global memory directly for a change to a value during the kernel runtime. I am using a persistent thread model in order to avoid the overheads involved in repeated kernel launches. How would I poll a small region of global memory and ensure that I am not just polling a cache?

Fiepchen · October 25, 2018, 9:58am

The GPU automatically provides cache coherency for you (if you do not use a read only cache, e.g L1/Tex on Keplar), thus you do not need to worry about it.

However, you should worry about memory ordering. Thus do not forget to use the appropiate memory fences.

rowanphilip · October 25, 2018, 3:39pm

So I have run some tests with the code sample that you provided above and with both a size and number of iterations of 1000000, I got a time of about 250ms which averages to 0.25us per memory access. This sounds more reasonable to me.

I also found that there was a significant jump in time taken between array sizes of 1 and 10, followed by another jump between 100,000 and 1,000,000. I assume this corresponds to different levels of caching. Can I therefore assume that past a size of 1,000,000, I am measuring the time to global memory directly?

Finally, I tried removing the temp!=1 condition from the while loop and found that it dramatically sped up (~250ms → ~50ms). I assume this clause prevents the compiler from performing some optimisation. Is this correct and what is the optimisation that the compiler otherwise performs?

Fiepchen · October 25, 2018, 7:10pm

You may look up the cache sizes of your GPU and try to sample around those sizes to verify this benchmark. 4 MB (a size of 1 M values) may be L2 depending on your GPU, while 400 KB (100 000 values) seems to large for L1. However, I do not know how dense you have sampled.

Also as a side note: Global memory is only a memory space, which is typically residing in the DRAM of the GPU, is cached on the GPU, is paged to the CPU DRAM or is residing in the CPU DRAM. Thus the statement “time to global memory directly” is imho kind of misleading: Is a indirect global memory access a cache hit? Sounds strange to me… The best way to put it, is probably to say that you are measuring the latency of a DRAM access.

The compiler may perform loop unrolling:

var_reg_1 =  input_value[(index + 0*stride) % size]
var_reg_2 =  input_value[(index + 1*stride) % size]
var_reg_3 =  input_value[(index + 2*stride) % size]
....
Temp += var_reg_1 
Temp += var_reg_2 
Temp += var_reg_3 
....

As a consequence, the instruction level parallelism of the load instructions allows the in order pipeline of the GPU to execute those load instructions concurrently, and the pipeline will stall at the first addition instruction while all the loads are in flight. As a consequence, the measured latency is reduced by the unroll factor of the loop (in your case since the duration is reduced by a factor of 5 the compiler probably also unrolled the loop 5 times). However, by putting the result of a loop in the header of this loop you can prevent this unrolling, which is pretty much one of the basics of the basics of writing micro benchmarks.

Topic		Replies	Views
CUDA clock() issue CUDA Programming and Performance	12	2394	March 7, 2017
cuda accessing global memory slow CUDA Programming and Performance	1	729	May 24, 2016
Read a value in global memory which was written by another thread block CUDA Programming and Performance	5	1696	October 15, 2014
Accessing/caching access to global/pinned memory CUDA Programming and Performance	10	784	May 29, 2023
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26205	December 2, 2011
Access Global memory from kernel CUDA Programming and Performance cuda	2	647	December 15, 2020
Writing global memory 14 times slower than reading? CUDA Programming and Performance	6	10118	January 19, 2011
Global memory latency CUDA Programming and Performance	0	796	January 9, 2012
evaluating global memory access trade-off CUDA Programming and Performance	0	845	April 2, 2009
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7561	July 21, 2008

Measuring global memory access speed

Related topics