Kernel faster in double precision than in simple ?

Hi,

I have a curious behavior with a simple kernel :

template< typename T >

__global__ void kernel(int m, T *A, T *u1, T *u2) {

   int i = blockIdx.x*blockDim.x + threadIdx.x;

   if(i<N) {

     T u1_scalar = u1[i];

     T u2_scalar = u2[i];

     int j;

     for(j=0; j<m; j++) {

       A[i*M+j]=A[i*M+j]+u1_scalar+u2_scalar;

     }

   }

}

I run it with 2048 threads (32 blocks/64 thread per block). The float version runs in 3.5ms while the double version run in 2.5ms. I time it using Cuda events and/or with CUDA_PROFILE=1 environment variable, with a Tesla C2050.

The PTX is exactly the same for both version (expect that 64 bits instructions are used instead of 32 bits one).

I attached the test case to reproduce, compile it with:

nvcc float_vs_double.cu -arch=sm_20 -o float_vs_double

Is this a commonly encountered phenomena ? What is the explanation ?
float_vs_double.cu (2.7 KB)

Your kernel is entirely bandwidth bound and follows a suboptimal memory access pattern that heavily relies on the cache for decent performance. At the same time, it’s cache footprint is 4× the actual cache size, so small code changes may lead to large variations in performance.

Rearrange the work so that the threads of a block access consecutive memory locations. The loop in the kernel can then have a larger increment. You should see a decent speedup relative to both the single and double precision versions.

I know, this is a bad version, I wanted to use it as a pedagogical example, it is “to be analyzed and optimized”. I should have be clear about that in my first post.

Still I don’t understand the difference between float and double version.

Mmmmm, the cache footprint of the double version will be 2 times higher than the float version, so the result is still counter intuitive to me.

You don’t seem to be doing enough error checking after calls like cudaEventSynchronize or cudaMemcpy, so it’s not clear if your kernels are really working or not.

To be sure, you could also do check the result set to see if things are really working properly.

In both cases, a full 128 byte cacheline will be fetched for each array element. Whether 4 or 8 bytes of these are actually used should have no direct speed implication.