why reading doubles is much faster than reading floats?

Here’s a simple kernel:

__global__ void reading0 (T* __restrict__ target, const T * __restrict__ src, int len, int n)
  int stride = gridDim.x * blockDim.x;
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  T x ;
  for (int i = tid; i < len; i += stride)
    if (i % 32  < n)
      x = __ldg(src + i);
    // we assume x is always == 0, so the next GMEM assignment should never take place:
    // this is to fool the compiler so as it does not optimze the whole kernel out
    if (x < 0)  {
      target[i] = x;

The idea is that n controls how many threads in a warp actually request data from the global memory.
This code is run for T=float and T=double on K20M, ECC off for the same number of array elements.
One might expect that for double the execution time should be twice the time for float. It is not.

The results are here:

As expected, the overall execution time for floats is almost independent of n, although 32-byte granularity of L2 is visible. For double this time is also almost constant and equal to that for float if n <= 16; for larger n one might expect the time to increase by a factor of two (in theory the memory is read in 128-byte segments and for n > 16 an extra segment is required). However,
the time increases VERY SLOWLY, for n<24 only by 10% rather than 100%. Again one can see the 32-byte granurality of L2. Do we also see the 384-bit (=48 byte, 12 floats, 6 doubles) memory bus?

Now look at this:

It was computed using a similar kernel that reads TWO streams of data, float/double + int.
Note a very similar pattern to that in the previous figure.
Note also that reading an array of double is MUCH faster tha reading 2 independent arrays of
4-byte data (float+int)

Now have a look at this:

The green starcase is what one would expect after reading nvidia CUDA documentation (the GMEM is accessed in 128-byte segments); the blue straight line shows that an array of doubles is effectively accesed in 256-byte segements.

Can someone from nvidia comment on this? How accessing doubles differs from accessing 4-byte streams of data? Do you use any “magic” to speed up 8-byte transfers? Do you use half-warps? Why the second half-warp behaves so differently than the first one? The most important question for me: will it be OK to assume that effectively the memory segements are 256-byte long for arrays of 8-byte data?

Z Koza

I can’t see the figures in my post; so here are the links:
Fig. 1
Fig. 2
Fig. 3

It could have to do with the fact that there are only 64 double-precision units compared to the 192 single-precision units. That would mean that there would be 66% the performance of the single-precision and not half.

Edit 1:
Also I forgot to ask, but isn’t the bus width 320-bit and not 384 for the K20? The documentation for it says that at least.

Edit 2:
Also looking at your second graph I do believe the reason that it looks that way is because you do two memory fetches, one right after the other. So you just add the time fetch that. Of course the time to add a double and an integer will take time but it will be the same for each thread. For the third graph I am not so sure. I have done some computation on the GPU and notice that you don’t get that jump. I think it might have to do with the fact that while you are grabbing memory you have more warps that you can swap in?

Ignore this, I didn’t see the edit button :/