sequential sum within a kernel.

To coalesce, thread 0 of a half-warp needs to read memory location [N16+0], thread 1, needs to read [N16+1], etc. You are reading memory locations SegmentLen apart between the threads.

If you read the programming manual (or check devicequery), you can launch a grid of size 65535 x 65535. So yes, you can easily launch such a grid.

You cannot have more than x blocks at the same time on 1 multiprocessor (depends on block size and amount of registers used & shared memory). Other blocks get scheduled as soon as a block is finished. To get optimal performance out of CUDA, you normally want to launch thousands of blocks (or at least as much as possible)

Indeed, for you there is no use.

Thank you for the quick reply. :)

In my 2048x2048 case, SegmentLen is 32 = 16*2; yet the profiler tells me I have fully uncoalesced reads and writes, and indeed the relative performances are no better than the 1024x1024 case (4x worse in fact, which is linear worsening).

Anyway this evening I’ll try the new code, thanks! External Media

Fernando

you will only coalesce if SegmentLen = 1 on anything older than GTX260/280. On GTX260/280, it will be approximately like this:

segmentlen = 2 → 2 times as slow

until segmentlen = 16. Then after that segmentlen = 32 and higher should have the same (slow) speed as segmentlen = 16.

Great! 10x the speed and Occupancy = 1!! External Image

Thanks a lot Denis!

For reference, I ended up with:

__global__ void  ArraySumInt (int *ResArray, const int *Array1, const int *Array2, const unsigned int iSize)

{

	

  unsigned int	index = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

 if (index < iSize)

   ResArray[index] = Array1[index] + Array2[index];

}

Fernando