[SOLVED] Considerations on SoA x AoS in device for coalescence (individual arrays VS float2/float3)

I have been reading through a number of topics to confront best practices against some code I’ve been working on.
These have been a few of my references:
https://devtalk.nvidia.com/default/topic/1037101/cuda-programming-and-performance/-float4-bandwidth-advantages-over-plain-float1/
https://devtalk.nvidia.com/default/topic/802048/memory-copy-for-max-coalescing/
https://stackoverflow.com/questions/42451832/cuda-profiler-reports-inefficient-global-memory-access/42451933#42451933

Some of my global functions work on a cufftComplex/float2 array, but I am returning to the host separated arrays for, say, real and imaginary, as it is much faster to split the data in a kernel than using the interleaved 2D copy from/to host (on my particular case). Because of this, after I do an inverse C2C FFT, I will be mostly using its real part and the original input, so I decided to split everything in SoA storage.
For example, before my functions that depend on the result of the inverse FFT, I do this:

__global__ void cuda_Copy_C2R(const cufftComplex * __restrict__ complex, float *real, const int LENGTH)
    {
    int	tid	= blockDim.x * blockIdx.x + threadIdx.x,
        offset	= gridDim.x * blockDim.x;

    while(tid < LENGTH)
        {
        real[tid] = complex[tid].x / LENGTH;    // Scaling
        tid += offset;
        }
    }

Then I pass data to subsequent functions this way:

Some_Func <<< GRID_SIZE, BLOCK_SIZE >>> (input, real, output, LENGTH);

Where input is the original data that went into the forward FFT, real is the output of the inverse FFT after I did some operations in frequency domain, output is where I store the result of Some_Func and LENGTH is how many elements to be processed.

Notice that I am not packing input, real and output in a float3. If I understood from these readings (plus others), it doesn’t make much of a difference after CC 2.x. But we are not working with this GPU thing to get SOME performance out of it, we want to make it scream.

So my question is: should I rearrange my code to pack everything in an AoS, as no optimization done by the compiler can match/outperform a properly aligned/coalesced data structure? Correct anything you want, I don’t have any problem to step back if I realize ahead that I got my basics wrong.

The benefit of your Copy_C2R routine (if any) will depend on how often you actually have to load from real in Some_Func (and in any other functions like it). If you only need to load it once, your copy routine is a waste of time and slowing things down. If you need to load multiple times, it may help. I wouldn’t be able to answer when exactly it might become a benefit, that would require benchmarking IMO. Even with code for Some_Func, it’s not always trivial to determine how many times a load must be made, because the compiler may optimize loads into registers. Alternatively, with some coding effort, you may be able to “guarantee” that the load from real is only done once in Some_Func, and therefore obviate the need for Copy_C2R.

Regarding the float3 question, I see little point in trying to go to a float3 anywhere in CUDA. You may be able to convert 3 individual float loads into one float2 vector load and a float load (still not ideal from a coalescing point of view), but I doubt this could yield much efficiency gains (IMHO a coalesced warp-wide load of 128 bytes is already pretty efficient) and there are code complexity and canonical reasons not to do it.

These opportunities look like small potatoes to me. Furthermore this sort of “ask the experts” is not a good way to learn IMO. A better way to learn is to benchmark options, and then see if your knowledge is sufficient to explain the results. If not, then research for an explanation. If you then need to ask the experts, you have a specific test case to offer, and the results (both ways) are much more credible.

My intuition is far less valuable than cold, hard profiler facts. Plus, I am human and make mistakes more often than the profiler.

Thanks a lot for your considerations.
What I read from it is that, unless there is a blatant problem in the code, in the end things will have to be profiled as the loads can’t be predicted. In my mind it was black-on-white and I possibly chose the wrong path to walk, but apparently it can be gray zone.

Apologies for the title, I will fix it. Certainly I want to hear various opinions, I can see now that readers might just ignore the topic the way it is.

Let me just add that I second everything Robert Crovella said in #2.

There are a few copies that I can’t escape, since I am returning split arrays to the host.
According to NVVP, a kernel function runs slightly faster with data packed in a float2 and a second array for output instead of 3 arrays, but the time needed to populate float2 with the appropriate data for a given function, it offsets the gain of its shorter run time.
Well, I think it is just how it is, given my choices I had to make.
Evidently I thank the above feedbacks and any that comes next.