Problem with coalesced memory access


I’m trying to get a simple kernel running with coalesced memory reads & writes:

 // copy assignments

  int* to   = (int*)(ASSIGNMENT(&assign_in,scan_result[blockIdx.x]));

  int* from = (int*)(ASSIGNMENT(&assign_out,blockIdx.x));


  for(int i=threadIdx.x; i<variables+1; i+=THREADS_PER_BLOCK)

  	to[i] = from[i];

That’s the complete kernel.

If I print out the addresses and the addresses modulo 64 I get:

So all the starting addresses are multiples of 64 and the array elements are integer. The size of THREADS_PER_BLOCK is 192.

This should make both the reads and the writes coalesced… but the CUDA profiler tells me that incoherent accesses far outweigh the coherent ones.

So where did I make my mistake …?

Why are you checking modulo 64 and not 32 (however should produce the same in your case, so just curiousity)?

Tells the profiler uncoherent loads or stores? Or both?

In my opinion your code is alright. Try perhaps 64 or 128 threads per block.

The check for mod64 has no real reason, other than trying to be extra safe ;)

The profiler counts too many global loads, for example:

gld_incoherent: 4511

gld_coherent: 179

gst_incoherent: 0

gst_coherent: 700

The storing seems to work fine… which confuses me a bit, since loading and storing both use the same mechanism.

128 threads per block did not work.

The first two lines also contain one global memory access, so I tried to make this access only once per threadBlock:

 // copy assignments

 Â __shared__ int* to;

 Â __shared__ int* from;


 Â if(threadIdx.x==0){

     to  = (int*)(ASSIGNMENT(&assign_in,scan_result[blockIdx.x]));

 Â Â Â Â from = (int*)(ASSIGNMENT(&assign_out,blockIdx.x));

 Â } 


 Â __syncthreads();


 Â for(int i=threadIdx.x; i<variables+1; i+=THREADS_PER_BLOCK)

      to[i] = from[i];

But besides a compiler warning, nothing really changed.

(“Advisory: Cannot tell what pointer points to, assuming global memory space”)


I think I may have found the problem. Not quoted was another if(…) which also contained one global memory read. Using a shared value like above, this value is now only read once to a shared variable and global incoherent loads went down by a few thousand :)

But now I’d like to get rid of this compiler warning :/