Hi,
I was modifying marching cubes example for my purposes and came across a strange behaviour.
I’m writing my results to a volume array (binded to a 1d texture later used for fetching to generate the mesh). In the first scenario each thread was writing out uchar (which resulted in uncoalesced memory store, since g80 doesn’t support coalescing of less than 32 bit data types). I’ve changed it to float to enable coalescing (which was proved by the profiler):
External Media
We’ve got coalescing, but no performance improvement:
External Media
Does this mean that I’m sort of ALU bound in this case? And in the first case uncoalesced memory access latency was hidden by the computationally expensive kernel (the actual calculation of the value that needs to be stored in the global memory)?
What could be the way to improve performance in this case?
Thanks!