I had been profiling the simple kernel

```
__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
if (i < N1 && j < N2) {
double a = pow(-1.0, (i+j)&1);
data[j*blockDim.x*gridDim.x+i].x *= a;
data[j*blockDim.x*gridDim.x+i].y *= a;
}
}
```

and discovered that, due to the “struct” nature of double2, I had 50% global memory load/store efficiency.

The following solution was suggested at http://stackoverflow.com/questions/14246592/coalesced-memory-access-and-global-memory-load-store-efficiency-with-complex-ari

```
__global__ void fftshift_2D(double *data, int N1, int N2)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
if (i < N1 * 2 && j < N2) {
double a = pow(-1.0, (i / 2 + j)&1);
data[j*blockDim.x*gridDim.x+i] *= a;
}
}
```

which re-established 100% global memory load/store efficiency, but is much slower.

Anyone has an explanation for that?

Thank you very much in advance.