I have one block and ‘n’ threads.
I have an float array ‘arr’ allocated on GPU - the size = kn (k is known before the kernel is launched).
t-th thread needs to access elements of array with indices between 'tk’ and ‘(t+1)*k-1’ (e.g. when k=4, then second thread accesses elements number 4,5,6,7).
How to do it fast? This solution is slow (uncoalesced):
// Use arr[threadIdx.x * k + a]
I made some solution - I read blockDim.x elements to shared memory (coalesced read), then in each thread I check if I can use these loaded elements. Then I read next blockDim.x elements, I do it until I read all k*n elements.
The problem is, I get many divergent threads and the solution is only about 2 times as fast as the first one.
Is there any better solution?