__global__ void func(..., float2 * m, int w, ...)
{
...
for (i += n; i < w; i += n)
{
float2 t = m[i];
...
Loading 8 bytes in a loop - sounds simple. But when investigating -ptx output, to my surprise, instead of loading 8 bytes (ld.global.v2.f32) compiler generated two 4 byte loading instructions ld.global.f32.
Playing around with the code I managed to make nvcc do what I want, following code works as expected - 1 8 byte load instruction in a loop:
__global__ void func(..., float2 * m, int w, ...)
...
for (i += n, m += n; i < w; i += n, m += n)
{
float2 t = *m;
...
It depends on when you use t.x and t.y. If there is a lot of code produced between the usages of t.x and t.y, the compiler may decide to save a register and not load the 2nd half until later. It is unfortunate that this breaks coalescing, but one could imagine circumstances where there are enough computations between the two memory reads that there is actually a latency-hiding benefit to split them.
I’ve found that using the volatile keyword on the read register:
for (i += n; i < w; i += n)
{
volatile float2 t = m[i];
will usually convince the compiler to load the full type in one read.
As for the issue with stores, I’ve never seen the compiler break them up before.
I assume that you are accessing global memory. Personally I think this is an example of the problem that CUDA does not recognize the destination type of the pointer. Could you try to declare the pointer explicitly like this:
__device__ float2 *m;
__global__ void func(..., float2 * m_arg, int w, ...)
{
m = (__device__ float2 *)m_arg;
...
for (i += n; i < w; i += n)
{
float2 t = m[i];
...