I have been reading through a number of topics to confront best practices against some code I’ve been working on.
These have been a few of my references:
https://devtalk.nvidia.com/default/topic/1037101/cuda-programming-and-performance/-float4-bandwidth-advantages-over-plain-float1/
https://devtalk.nvidia.com/default/topic/802048/memory-copy-for-max-coalescing/
https://stackoverflow.com/questions/42451832/cuda-profiler-reports-inefficient-global-memory-access/42451933#42451933
Some of my global functions work on a cufftComplex/float2 array, but I am returning to the host separated arrays for, say, real and imaginary, as it is much faster to split the data in a kernel than using the interleaved 2D copy from/to host (on my particular case). Because of this, after I do an inverse C2C FFT, I will be mostly using its real part and the original input, so I decided to split everything in SoA storage.
For example, before my functions that depend on the result of the inverse FFT, I do this:
__global__ void cuda_Copy_C2R(const cufftComplex * __restrict__ complex, float *real, const int LENGTH)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x,
offset = gridDim.x * blockDim.x;
while(tid < LENGTH)
{
real[tid] = complex[tid].x / LENGTH; // Scaling
tid += offset;
}
}
Then I pass data to subsequent functions this way:
Some_Func <<< GRID_SIZE, BLOCK_SIZE >>> (input, real, output, LENGTH);
Where input is the original data that went into the forward FFT, real is the output of the inverse FFT after I did some operations in frequency domain, output is where I store the result of Some_Func and LENGTH is how many elements to be processed.
Notice that I am not packing input, real and output in a float3. If I understood from these readings (plus others), it doesn’t make much of a difference after CC 2.x. But we are not working with this GPU thing to get SOME performance out of it, we want to make it scream.
So my question is: should I rearrange my code to pack everything in an AoS, as no optimization done by the compiler can match/outperform a properly aligned/coalesced data structure? Correct anything you want, I don’t have any problem to step back if I realize ahead that I got my basics wrong.