Performance loading overlapping values of global array within warp

jpbakker · August 19, 2017, 5:24pm

Hey all,

I have a question regarding the performance when the threads in a warp (or block) are loading overlapping values of the same array, much like this:

// float* __restrict__ dev_features is address of features

const int x = blockIdx.x*blockDim.x + threadIdx.x;
const int y = blockIdx.y*blockDim.y + threadIdx.y;
const int z = blockIdx.z*blockDim.z + threadIdx.z;

const int linear_featurenumber = (z*height_cells_features+y)*width_cells_features+x;     
float features[10];

#pragma unroll            
for (int n =0; n < 10; ++n)
{
  features[n] = dev_features[linear_featurenumber+n];
}

Thread 0 will load value 0-9 from dev_features
Thread 1 will load value 1-10 from dev_features
Thread 2 will load value 2-11 from dev_features etc.

As you can see there is a lot of overlap between the required values of each thread within a warp or block. The assembly code shows a total of 10 global load instruction that are performed by each thread. If all these instructions are actually performed this would be very inefficient since for every 32 threads there will be 32*10 load instructions while only 32+9 values are actually needed.

I know that global loads are served in 32 bytes, or 8 floats in my case. Memory is perfectly coalesced. So the first load instruction is one with 100% load efficiency as every thread gets the first value it asks for (thread 0 gets dev_features[0], thread 1 dev_features[1] etc). The second instruction however, is asking for values that for most part are “already on the way”. To the best of my knowledge these instruction would be issued long before the first global values have arrived, so there would be no L1 (or even L2?) cache hit. Does the GPU recognize and optimize this in some way (and how would that work)?

When I’m using shared memory and try to optimize the global loading and load all the required values in shared memory beforehand, I actually get worse performance than when just using global loads as described above (roughly 3.9ms vs 3ms). This would indicate that there would be some kind of optimization going on, but I have no idea how or what that would be.

Please let me know your take on the problem, anything I can read for a better understanding or if you know a good/ the best way to efficiently load all 10 values per thread.

I’m using a Titan X (Pascal) and CUDA 8

Thanks in advance!

BulatZiganshin · August 20, 2017, 8:08am

caches work other way than you think. when the same value is requested again, no problem is that it’s not yet arrived - both requests will coalesce and perform only one read from memory. so in your example, extra requests are served from L1 or L2 cache (depending on settings)

jpbakker · August 20, 2017, 8:34am

Thanks! So that means that regardless of the fact that the second (or third) request is issued x cycles later, the gpu knows L1 cache line Y is going to be filled with the requested data and just waits for that to arrive? How does it know it has arrived? Does it keep “polling” the desired address in the cache? Do you have a good reference for me to understand the topic a little bit better? I can imagine this could would work in the same way on a cpu.

BulatZiganshin · August 20, 2017, 8:42am

well, to be fair i just never thought or read about it. for me, it looks “obvious”, expected behavior. note that requests are issued with 32-byte granularity, so these are just multiple requests to the same address. probably, gpu just removes duplicates from the request queue

jpbakker · August 20, 2017, 9:44am

Yes I am aware of that. But thread 8 first requests float 8 from that 32 byte memory ‘block’, while its second request will be float 9, which will be in the next memory ‘block’.

But I’m still a bit puzzled. the gpu can’t just remove duplicates because somehow it has to remember all the threads that request the data to get it to their registers.

Perhaps the first request already removes “reserves” a certain cache line for the requested data, which the second request can see so it will wait for the data there instead of going to look for it in L2 or global memory?

BulatZiganshin · August 20, 2017, 9:56am

ok, my speculation:

memory by itself serves only 32-byte aligned 32-byte large requests. probably memory request queues hold requests in this form. when warp requests 32 bytes starting from address 1, two memory requests are pushed to the queue - (0,32) and (32,32). warp is delayed and kept pointer to these queue entries to detect when it can resume. when another warp requests 32 memory bytes starting from address 2, it’s converted to (0,32) and (32,32) requests too, but they are found in the queue, so this warp also delayed keeping pointers to the same memory queue entries

once these entries are marked as finished, both warps are resumed

Topic		Replies	Views
Memory access should be coalesced but is not CUDA Programming and Performance	6	1226	May 16, 2019
Concurrency of Global Memory Operations CUDA Programming and Performance	1	629	February 17, 2011
How to understand the alignment of 2D array and fully coalesce of the memory access CUDA Programming and Performance	7	3689	July 27, 2016
Reading from global memory to registers in a fast way CUDA Programming and Performance	10	2399	November 15, 2021
Global Memory Fetches How to arrange them in code for best performance CUDA Programming and Performance	6	1306	June 2, 2010
Will compiler optimise these memory accesses CUDA Programming and Performance	3	798	July 11, 2013
Low global load efficiency, why? I think this access pattern should be coalesced CUDA Programming and Performance	4	3869	May 12, 2012
coalescing problem CUDA Programming and Performance	4	1162	August 8, 2011
Data load question CUDA Programming and Performance	3	124	December 18, 2024
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5597	March 6, 2007

Performance loading overlapping values of global array within warp

Related topics