I have one block and ‘n’ threads.
I have an float array ‘arr’ allocated on GPU - the size = kn (k is known before the kernel is launched).
t-th thread needs to access elements of array with indices between 'tk’ and ‘(t+1)*k-1’ (e.g. when k=4, then second thread accesses elements number 4,5,6,7).
How to do it fast? This solution is slow (uncoalesced):
for(int a=0;a<k;++a)
{
// Use arr[threadIdx.x * k + a]
}
I made some solution - I read blockDim.x elements to shared memory (coalesced read), then in each thread I check if I can use these loaded elements. Then I read next blockDim.x elements, I do it until I read all k*n elements.
The problem is, I get many divergent threads and the solution is only about 2 times as fast as the first one.
What is the typical size of arr, or range of sizes that you want to work on.
The GPU works best if you have multiple blocks. So can you divide the problem up into multiple smaller blocks ?
as a guideline NVIDIA say for best performance the GPU prefers 8k threads or more. (For example 32 blocks of 256 threads)