i’m wondering if my kernel is accessing the global memory in a coalesced way. as i understood coalescing, the threads of an active warp have to read elements from a global array that are stored right next to each other:
thread 0: element N
thread 1: element N+1
thread 2: element N+2
i’m using the following grid:
blockDim.x = 256, blockDim.y = 0;
gridDim.x = gridDim.y = 256;
that makes it possible to access 256256256 elements in a “parallel” way:
my kernel is similar to this
global void someKernel(float* ArrayInGlobalMemory)
int idx = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;
//doing some calculations (takes around 400ms for 256256256 elements)
//then i’m writing each result to the array in global memory:
arrayInGlobalMemory[idx] = some result; // this takes about 4200 ms (for 256256256 elements)!!!
isn’t that coalesced??!!
there must be some serous bottlenek! but i don’t really know how to solve it!
if someone can give me a hint, i will be very thankfull!
best regards rob