Performance loading overlapping values of global array within warp

Hey all,

I have a question regarding the performance when the threads in a warp (or block) are loading overlapping values of the same array, much like this:

// float* __restrict__ dev_features is address of features

const int x = blockIdx.x*blockDim.x + threadIdx.x;
const int y = blockIdx.y*blockDim.y + threadIdx.y;
const int z = blockIdx.z*blockDim.z + threadIdx.z;

const int linear_featurenumber = (z*height_cells_features+y)*width_cells_features+x;     
float features[10];

#pragma unroll            
for (int n =0; n < 10; ++n)
{
  features[n] = dev_features[linear_featurenumber+n];
}

Thread 0 will load value 0-9 from dev_features
Thread 1 will load value 1-10 from dev_features
Thread 2 will load value 2-11 from dev_features etc.

As you can see there is a lot of overlap between the required values of each thread within a warp or block. The assembly code shows a total of 10 global load instruction that are performed by each thread. If all these instructions are actually performed this would be very inefficient since for every 32 threads there will be 32*10 load instructions while only 32+9 values are actually needed.

I know that global loads are served in 32 bytes, or 8 floats in my case. Memory is perfectly coalesced. So the first load instruction is one with 100% load efficiency as every thread gets the first value it asks for (thread 0 gets dev_features[0], thread 1 dev_features[1] etc). The second instruction however, is asking for values that for most part are “already on the way”. To the best of my knowledge these instruction would be issued long before the first global values have arrived, so there would be no L1 (or even L2?) cache hit. Does the GPU recognize and optimize this in some way (and how would that work)?

When I’m using shared memory and try to optimize the global loading and load all the required values in shared memory beforehand, I actually get worse performance than when just using global loads as described above (roughly 3.9ms vs 3ms). This would indicate that there would be some kind of optimization going on, but I have no idea how or what that would be.

Please let me know your take on the problem, anything I can read for a better understanding or if you know a good/ the best way to efficiently load all 10 values per thread.

I’m using a Titan X (Pascal) and CUDA 8

Thanks in advance!

caches work other way than you think. when the same value is requested again, no problem is that it’s not yet arrived - both requests will coalesce and perform only one read from memory. so in your example, extra requests are served from L1 or L2 cache (depending on settings)

Thanks! So that means that regardless of the fact that the second (or third) request is issued x cycles later, the gpu knows L1 cache line Y is going to be filled with the requested data and just waits for that to arrive? How does it know it has arrived? Does it keep “polling” the desired address in the cache? Do you have a good reference for me to understand the topic a little bit better? I can imagine this could would work in the same way on a cpu.

well, to be fair i just never thought or read about it. for me, it looks “obvious”, expected behavior. note that requests are issued with 32-byte granularity, so these are just multiple requests to the same address. probably, gpu just removes duplicates from the request queue

Yes I am aware of that. But thread 8 first requests float 8 from that 32 byte memory ‘block’, while its second request will be float 9, which will be in the next memory ‘block’.

But I’m still a bit puzzled. the gpu can’t just remove duplicates because somehow it has to remember all the threads that request the data to get it to their registers.

Perhaps the first request already removes “reserves” a certain cache line for the requested data, which the second request can see so it will wait for the data there instead of going to look for it in L2 or global memory?

ok, my speculation:

memory by itself serves only 32-byte aligned 32-byte large requests. probably memory request queues hold requests in this form. when warp requests 32 bytes starting from address 1, two memory requests are pushed to the queue - (0,32) and (32,32). warp is delayed and kept pointer to these queue entries to detect when it can resume. when another warp requests 32 memory bytes starting from address 2, it’s converted to (0,32) and (32,32) requests too, but they are found in the queue, so this warp also delayed keeping pointers to the same memory queue entries

once these entries are marked as finished, both warps are resumed