why is it uncoalesced ? SDK example simpleGL

neoideo · January 29, 2011, 6:48pm

hello everyone,

today, while testing cuda with a cc 1.1 GPU, i decided to do some cuda visual profiling on some SDK examples.
the thing is that we all know that packed float4 arrays should be coalesced when the k-th thread accesses the k-th float4 element. However, if you run the visual profiler on the simpleGL example
you can notice that you get all read and writes uncoalesced!!!

i can confirm that this happens at least under 1.1 compute capability.
does someone know why?

tera · January 30, 2011, 2:10am

The float4 array also needs to be aligned to 128 bytes.

neoideo · January 30, 2011, 3:40am

tera, im using a plain array of float4 vectors (the cuda ones), i thought that these came well aligned, i am missing something?

tera · January 30, 2011, 4:00am

Depends on where it comes from. If allocated with cudaMalloc(), alignment should be fine. If you just have the array as a global variable, I think alignment is only guaranteed to 16 bytes.

neoideo · January 30, 2011, 10:57pm

i put the buffer on the GPU by using openGL glBufferData instructions, i’m looking into that at the moment.

also, the nbody simulation (from the sdk) is 100% coalesced and uses openGL buffers, so im trying to understand what is the difference on the aligment.

regards

Cristobal

tera · January 31, 2011, 1:30am

I can’t help you there, but if unsure you may just print out what the actual alignment is in the kernel you are profiling.

neoideo · January 31, 2011, 2:42am

thanks tera, but how can i do that?

tera · January 31, 2011, 10:39am

printf("Pointer is %lx\n", (unsigned long)ptr);

and check that it either ends on [font=“Courier New”]00[/font] or [font=“Courier New”]80[/font].

neoideo · January 31, 2011, 1:38pm

its weird, the vbo buffer shows “00” on both uncoalesced cases.

thanks anyway tera, i wont take more time from you you have helped already enough :)

i will keep investigating until i solve this, and will post the solution back for the records.

regards

Cristobal

neoideo · February 3, 2011, 9:47pm

ok after looking for many things, the problem was at the very basic level, the SDK example was using blocks of size (8, 8, 1), and i think the minimum amount of threads is 16/dimension to start talking about coalesced memory.

fast solution → use blocks of (16, 16, 1)

edit: moderator can change this to solved.

Topic		Replies	Views
Coalesced VBO Access CUDA Programming and Performance	14	1771	February 4, 2011
hwo to make float2 and float4 data coalesced? CUDA Programming and Performance	1	3558	May 27, 2008
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11767	March 20, 2009
Coalescing - beginner question CUDA Programming and Performance	10	1782	June 23, 2010
question in the sample code (simpleStream.cu) CUDA Programming and Performance	3	3815	November 26, 2007
Coalescing issue, presumably due to the CUDA Optimizer CUDA Programming and Performance	18	3188	December 9, 2009
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4114	October 15, 2008
Quick question about memory coalescence CUDA Programming and Performance	5	5699	May 5, 2008
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10196	June 28, 2009
Cannot coalesce global memory reads using builtin vector types CUDA Programming and Performance	6	3335	July 14, 2010

why is it uncoalesced ? SDK example simpleGL

Related topics