i have a quite small routine which shall convert an array of unsigned short values to a complex array. The imaginary part is filled with 0.
I read in another thread that someone needed about 2 ms to read 16mb of unsigned chars.
But my code is really really slow… it takes about 1 second to convert less than 400 blocks of deata, each containing 256k elements. All data was allocated via cublasAlloc so it should be aligned well. I also tried to use a texture of ushort2 values but it didn’t help (thought this might avoid bank conflicts).
but if i dont write back the data to global memory, the time is about 10 ms. So the global memory writes need all the performance.
I bet __syncthreads() is killing your performance. Instead of that you should restructure your program in a way each thread will write an unique memory adress ( so multiple threads won’t collide writing the same value )…
i just changed some things and have two working versions of the code…
The times went down a little bit by using float4 and ushort4 (also tried the combinations ushort4-float2, ushort2-float). But another speedup would be really nice because the time is still much too large (0.75 seconds without texture and 0.70 seconds with texture).
coalescing again…
you should hard code a multiply-of-32 block size, write your result to shared memory, sync, and write things to global coalesced. it’s a 2x ~ 8x difference, but people so often overlook it