Coalesced memory access and global memory load/store efficiency with complex arithmetics in CUDA

I had been profiling the simple kernel

__global__ void fftshift_2D(double2 *data, int N1, int N2)
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 && j < N2) {
        double a = pow(-1.0, (i+j)&1);

        data[j*blockDim.x*gridDim.x+i].x *= a;
        data[j*blockDim.x*gridDim.x+i].y *= a;

and discovered that, due to the “struct” nature of double2, I had 50% global memory load/store efficiency.

The following solution was suggested at

__global__ void fftshift_2D(double *data, int N1, int N2)
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 * 2 && j < N2) {
        double a = pow(-1.0, (i / 2 + j)&1);
        data[j*blockDim.x*gridDim.x+i] *= a;

which re-established 100% global memory load/store efficiency, but is much slower.

Anyone has an explanation for that?

Thank you very much in advance.

Double precision pow is not very fast. Especially if you are testing on a GeForce card.

On a Geforce card this is almost certainly limited by how fast the pow can be computed, not by the actual memory bandwidth. That may even be the case on a Tesla card, but I’m not sure.

Note that in the second kernel you will compute twice as many pows as the first one…

Try removing the pow to see what effect that has.

Also it looks like you are using pow simply to toggle between -1 and 1?! Using an if statement would be waaay faster. You could even directly flip the sign bit of the floating point number if you were feeling fancy.

Thank you very much eelsen. Two good observations in your reply:

  1. pow is slow;
  2. the second kernel will compute twice as many pows as the first one.

So, I changed the

double a = pow(-1.0, (i / 2 + j)&1);

instruction to

double a = 1-2*((i/2+j)&1);

to toggle between -1 and +1.

Unfortunately, and surprisingly, the second kernel is still slower and becomes even slower as compared to the first one as long as the size of the problem increases.

Hmm…you should basically have a memcpy kernel now. What is the actual bandwidth being achieved by the two kernels (in GB/sec)? What card are you using? How many elements are you processing and what is the block size? And what is the result of the bandwidth test from the SDK on your card?

Eelsen, thank you very much for your further questions.

Here are the results of the SDK bandwidth test:

Device 0: GeForce GT 540M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     3820.1

 Device to Host Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     3730.1

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     23871.5

[bandwidthTest.exe] test results...

Here are the performance of the two kernels:


Execution time: 451 microseconds
Blocksize: 16x16
Gridsize: 32x32
Global Store Throughput: 17.91Gb/2
Global Load Throughput: 17.86Gb/2

Execution time: 916 microseconds
Blocksize: 16x16
Gridsize: 64x32
Global Store Throughput: 5.63Gb/2
Global Load Throughput: 5.62Gb/2

In both the cases:

  1. I’m compiling with “Maximize Speed [/O2]” and “Favour Fast Code [/Ot]”;
  2. data is a 512 elements array, while N1=N2=256.

Please, note that the Gridsizes are different since I’m performing a different access to the global memory for the two kernels: in one case I fetch double2 data, while in the other case double data.

I would recommend you make the blockSize (32, 8) or (32, 16) so that the x dimension is the warpsize.

The timing numbers are also interesting. The double2 kernel takes about half the time as the double kernel yet has 3 times the bandwidth. Since they are reading the same amount of data, this is impossible.

Also the bandwidth numbers themselves don’t compute:

(array is 256x256, double2 is 16 bytes, read / write)

256 * 256 * 16 * 2 = 2MB

2MB / 451 usec = 4.65 GB/sec
2MB / 916 usec = 2.29 GB/sec

I would recommend simply timing the kernel yourself and calculating the performance figures since the profiler numbers don’t seem to make much sense.

Hi Eelsen,
I apologize. In my previous post, I have mentioned wrong numbers. Actually, N1=N2=512.

I have timed the two kernels by myself.

Execution time: 450 microseconds (coinciding with the Visual Profiler)
Execution time: 679 microseconds (which, surprisingly, is less that the indication of the Visual Profiler)

I do not understand why do you count for read/write all together (you multiply by 2). I’m independently assessing the Global Read and Store Throughputs. Could you please explain better?

Could the difference be due to the fact that

double a = 1-2*((i/2+j)&1);

is performed twice for the “double” kernel?


Those extra couple of operations shouldn’t matter. Did you try changing the blockSize to (32, 8)? My feeling is that that is the culprit.

Yes, but unfortunately nothing changed in the double case timing when changing the blockSize from (16,16) to (32,8).

As one more piece of information. I took the two timings by removing the calculation of the constant a from both the kernels. These are the results:

double: 372us
double2: 436us

So, the double2 kernel now takes then longest time. Should I then conclude, from this result, that the calculation of the value of a is the main problem? If so, I wonder what is the main issue there. The computation of a double precision number or the access to the register containing it?

I have just made another experiment. I calculate the value of a “on the fly” without storing it in a register and I obtain the following timing:

double: 690us
double2: 438us

Note that in the double2 case, the value of a is


while in the double case, the value of a is


(there is an extra integer division /). If I remove, just for testing purposes, the division in the double kernel, the double timing drops to 400us.