I’m trying to get a simple kernel running with coalesced memory reads & writes:
// copy assignments int* to = (int*)(ASSIGNMENT(&assign_in,scan_result[blockIdx.x])); int* from = (int*)(ASSIGNMENT(&assign_out,blockIdx.x)); for(int i=threadIdx.x; i<variables+1; i+=THREADS_PER_BLOCK) to[i] = from[i];
That’s the complete kernel.
If I print out the addresses and the addresses modulo 64 I get:
So all the starting addresses are multiples of 64 and the array elements are integer. The size of THREADS_PER_BLOCK is 192.
This should make both the reads and the writes coalesced… but the CUDA profiler tells me that incoherent accesses far outweigh the coherent ones.
So where did I make my mistake …?