Hey all,
I have a question regarding the performance when the threads in a warp (or block) are loading overlapping values of the same array, much like this:
// float* __restrict__ dev_features is address of features
const int x = blockIdx.x*blockDim.x + threadIdx.x;
const int y = blockIdx.y*blockDim.y + threadIdx.y;
const int z = blockIdx.z*blockDim.z + threadIdx.z;
const int linear_featurenumber = (z*height_cells_features+y)*width_cells_features+x;
float features[10];
#pragma unroll
for (int n =0; n < 10; ++n)
{
features[n] = dev_features[linear_featurenumber+n];
}
Thread 0 will load value 0-9 from dev_features
Thread 1 will load value 1-10 from dev_features
Thread 2 will load value 2-11 from dev_features etc.
As you can see there is a lot of overlap between the required values of each thread within a warp or block. The assembly code shows a total of 10 global load instruction that are performed by each thread. If all these instructions are actually performed this would be very inefficient since for every 32 threads there will be 32*10 load instructions while only 32+9 values are actually needed.
I know that global loads are served in 32 bytes, or 8 floats in my case. Memory is perfectly coalesced. So the first load instruction is one with 100% load efficiency as every thread gets the first value it asks for (thread 0 gets dev_features[0], thread 1 dev_features[1] etc). The second instruction however, is asking for values that for most part are “already on the way”. To the best of my knowledge these instruction would be issued long before the first global values have arrived, so there would be no L1 (or even L2?) cache hit. Does the GPU recognize and optimize this in some way (and how would that work)?
When I’m using shared memory and try to optimize the global loading and load all the required values in shared memory beforehand, I actually get worse performance than when just using global loads as described above (roughly 3.9ms vs 3ms). This would indicate that there would be some kind of optimization going on, but I have no idea how or what that would be.
Please let me know your take on the problem, anything I can read for a better understanding or if you know a good/ the best way to efficiently load all 10 values per thread.
I’m using a Titan X (Pascal) and CUDA 8
Thanks in advance!