why is it uncoalesced ? SDK example simpleGL

hello everyone,

today, while testing cuda with a cc 1.1 GPU, i decided to do some cuda visual profiling on some SDK examples.
the thing is that we all know that packed float4 arrays should be coalesced when the k-th thread accesses the k-th float4 element. However, if you run the visual profiler on the simpleGL example
you can notice that you get all read and writes uncoalesced!!!

i can confirm that this happens at least under 1.1 compute capability.
does someone know why?

The float4 array also needs to be aligned to 128 bytes.

tera, im using a plain array of float4 vectors (the cuda ones), i thought that these came well aligned, i am missing something?

Depends on where it comes from. If allocated with cudaMalloc(), alignment should be fine. If you just have the array as a global variable, I think alignment is only guaranteed to 16 bytes.

i put the buffer on the GPU by using openGL glBufferData instructions, i’m looking into that at the moment.

also, the nbody simulation (from the sdk) is 100% coalesced and uses openGL buffers, so im trying to understand what is the difference on the aligment.

regards

Cristobal

I can’t help you there, but if unsure you may just print out what the actual alignment is in the kernel you are profiling.

thanks tera, but how can i do that?

printf("Pointer is %lx\n", (unsigned long)ptr);

and check that it either ends on [font=“Courier New”]00[/font] or [font=“Courier New”]80[/font].

its weird, the vbo buffer shows “00” on both uncoalesced cases.

thanks anyway tera, i wont take more time from you you have helped already enough :)

i will keep investigating until i solve this, and will post the solution back for the records.

regards

Cristobal

ok after looking for many things, the problem was at the very basic level, the SDK example was using blocks of size (8, 8, 1), and i think the minimum amount of threads is 16/dimension to start talking about coalesced memory.

fast solution → use blocks of (16, 16, 1)

edit: moderator can change this to solved.