why reading doubles is much faster than reading floats?

zkoza · December 26, 2013, 8:10pm

Here’s a simple kernel:

__global__ void reading0 (T* __restrict__ target, const T * __restrict__ src, int len, int n)
{
  int stride = gridDim.x * blockDim.x;
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  T x ;
  for (int i = tid; i < len; i += stride)
  {
    if (i % 32  < n)
      x = __ldg(src + i);
    // we assume x is always == 0, so the next GMEM assignment should never take place:
    // this is to fool the compiler so as it does not optimze the whole kernel out
    if (x < 0)  {
      target[i] = x;
      printf("ERROR\n");
    }
  }
}

The idea is that n controls how many threads in a warp actually request data from the global memory.
This code is run for T=float and T=double on K20M, ECC off for the same number of array elements.
One might expect that for double the execution time should be twice the time for float. It is not.

The results are here:

As expected, the overall execution time for floats is almost independent of n, although 32-byte granularity of L2 is visible. For double this time is also almost constant and equal to that for float if n <= 16; for larger n one might expect the time to increase by a factor of two (in theory the memory is read in 128-byte segments and for n > 16 an extra segment is required). However,
the time increases VERY SLOWLY, for n<24 only by 10% rather than 100%. Again one can see the 32-byte granurality of L2. Do we also see the 384-bit (=48 byte, 12 floats, 6 doubles) memory bus?

Now look at this:

It was computed using a similar kernel that reads TWO streams of data, float/double + int.
Note a very similar pattern to that in the previous figure.
Note also that reading an array of double is MUCH faster tha reading 2 independent arrays of
4-byte data (float+int)

Now have a look at this:

The green starcase is what one would expect after reading nvidia CUDA documentation (the GMEM is accessed in 128-byte segments); the blue straight line shows that an array of doubles is effectively accesed in 256-byte segements.

Can someone from nvidia comment on this? How accessing doubles differs from accessing 4-byte streams of data? Do you use any “magic” to speed up 8-byte transfers? Do you use half-warps? Why the second half-warp behaves so differently than the first one? The most important question for me: will it be OK to assume that effectively the memory segements are 256-byte long for arrays of 8-byte data?

Z Koza

zkoza · December 26, 2013, 8:17pm

I can’t see the figures in my post; so here are the links:
Fig. 1
Fig. 2
Fig. 3

deathly809 · January 2, 2014, 6:23am

It could have to do with the fact that there are only 64 double-precision units compared to the 192 single-precision units. That would mean that there would be 66% the performance of the single-precision and not half.

Edit 1:
Also I forgot to ask, but isn’t the bus width 320-bit and not 384 for the K20? The documentation for it says that at least.

Edit 2:
Also looking at your second graph I do believe the reason that it looks that way is because you do two memory fetches, one right after the other. So you just add the time fetch that. Of course the time to add a double and an integer will take time but it will be the same for each thread. For the third graph I am not so sure. I have done some computation on the GPU and notice that you don’t get that jump. I think it might have to do with the fact that while you are grabbing memory you have more warps that you can swap in?

deathly809 · January 2, 2014, 6:26am

Ignore this, I didn’t see the edit button :/

Topic		Replies	Views
Kernel faster in double precision than in simple ? CUDA Programming and Performance	4	1015	April 14, 2012
Performance Float vs Double effective bandwidth and execution time CUDA Programming and Performance	8	1359	June 6, 2011
global memory latency CUDA Programming and Performance	12	16172	December 13, 2007
A basic question about shared memory conflict of a simple example CUDA Programming and Performance	4	2277	October 4, 2013
memory latency CUDA Programming and Performance	5	3927	March 21, 2007
Slow shared memory reads, what am I doing wrong? CUDA Programming and Performance	0	2825	May 19, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16301	January 30, 2011
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	8937	January 24, 2008
Why does kernel with __syncthreads() and conditional checks run faster than kernel without on NVIDIA Tesla K20M? CUDA Programming and Performance	0	401	January 3, 2018
float4 bandwidth advantages over plain float1 CUDA Programming and Performance	6	2616	July 2, 2018

why reading doubles is much faster than reading floats?

Related topics