Coalescing with no performance improvement

Hi,

I was modifying marching cubes example for my purposes and came across a strange behaviour.
I’m writing my results to a volume array (binded to a 1d texture later used for fetching to generate the mesh). In the first scenario each thread was writing out uchar (which resulted in uncoalesced memory store, since g80 doesn’t support coalescing of less than 32 bit data types). I’ve changed it to float to enable coalescing (which was proved by the profiler):
External Media

We’ve got coalescing, but no performance improvement:
External Media

Does this mean that I’m sort of ALU bound in this case? And in the first case uncoalesced memory access latency was hidden by the computationally expensive kernel (the actual calculation of the value that needs to be stored in the global memory)?

What could be the way to improve performance in this case?

Thanks!

You’re either computation bound, you’re using a lot of texture lookups, or crippled by warp serialization / divergent threads.

Sorry, I’m not sure I got your point regarding texture lookups. In the kernel that now has only coalesced store operations I wasn’t doing any fetches (only in the kernels being ran afterwards). And this kernel still runs at the same speed.

What’s the way to find out if I have problems with warp serialization or am really comp. bound?

Thanks!