I have been reading through a number of topics to confront best practices against some code I’ve been working on.

These have been a few of my references:

https://devtalk.nvidia.com/default/topic/1037101/cuda-programming-and-performance/-float4-bandwidth-advantages-over-plain-float1/

https://devtalk.nvidia.com/default/topic/802048/memory-copy-for-max-coalescing/

https://stackoverflow.com/questions/42451832/cuda-profiler-reports-inefficient-global-memory-access/42451933#42451933

Some of my **global** functions work on a cufftComplex/float2 array, but I am returning to the host separated arrays for, say, real and imaginary, as it is much faster to split the data in a kernel than using the interleaved 2D copy from/to host (on my particular case). Because of this, after I do an inverse C2C FFT, I will be mostly using its real part and the original input, so I decided to split everything in SoA storage.

For example, before my functions that depend on the result of the inverse FFT, I do this:

```
__global__ void cuda_Copy_C2R(const cufftComplex * __restrict__ complex, float *real, const int LENGTH)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x,
offset = gridDim.x * blockDim.x;
while(tid < LENGTH)
{
real[tid] = complex[tid].x / LENGTH; // Scaling
tid += offset;
}
}
```

Then I pass data to subsequent functions this way:

```
Some_Func <<< GRID_SIZE, BLOCK_SIZE >>> (input, real, output, LENGTH);
```

Where *input* is the original data that went into the forward FFT, *real* is the output of the inverse FFT after I did some operations in frequency domain, *output* is where I store the result of Some_Func and *LENGTH* is how many elements to be processed.

Notice that I am not packing *input, real and output* in a float3. If I understood from these readings (plus others), it doesn’t make *much* of a difference after CC 2.x. But we are not working with this GPU thing to get SOME performance out of it, we want to make it scream.

So my question is: should I rearrange my code to pack everything in an AoS, as no optimization done by the compiler can match/outperform a properly aligned/coalesced data structure? Correct anything you want, I don’t have any problem to step back if I realize ahead that I got my basics wrong.