My concrete example for global coalecsing, please help

I am not sure if I understand global coalescing for compute capability 1.2 from the programming guides.

I have a useless vector addition kernel I have been messing with to test out what I have learned.

const char programSource[] =

		"__kernel void vecAdd(__global int *a, __global int *b, __global int *c)"

		"{"

		"for(int i=0; i<10; i++){"

		"	a[i] = b[i] + c[i];}"

		"}";

It seems like the big idea with 1.2 is that a half-warp’s read and writes can be coalesced into a single memory transaction of 32, 64, or 128 bytes. Also multiple memory transactions can be minimised if 128+ bytes is requested.

The thing that I can’t quite figure out is how do I look at my code and figure out the details. Is my half-warp causing extraneous memory transactions when loading or storing memory? How do I know if my memory access is misaligned, non-contiguous, random, or contiguous? If anybody can discuss this with me I’d appreciate it :)

The main way to tell is to look at the generated PTX code. But even that may not be correct… so use decuda to convert the cubin back into assembly language and see what load ops are REALLY being used. For example, you may see your memory read broken down into 2 or 3 loads, showing you you’re not properly aligned or whatever.

It’s a good way to understand CUDA and your code… you’re asking the right low level questions.

The main way to tell is to look at the generated PTX code. But even that may not be correct… so use decuda to convert the cubin back into assembly language and see what load ops are REALLY being used. For example, you may see your memory read broken down into 2 or 3 loads, showing you you’re not properly aligned or whatever.

It’s a good way to understand CUDA and your code… you’re asking the right low level questions.

s/decuda/cuobjdump/ :)

s/decuda/cuobjdump/ :)

Does all this apply to OpenCL? Are there any references or tutorials on how to figure this out what is going on in the assembly code that is generated?

Does all this apply to OpenCL? Are there any references or tutorials on how to figure this out what is going on in the assembly code that is generated?