Kernel faster in double precision than in simple ?

Joky · April 14, 2012, 9:06am

Hi,

I have a curious behavior with a simple kernel :

template< typename T >

__global__ void kernel(int m, T *A, T *u1, T *u2) {

   int i = blockIdx.x*blockDim.x + threadIdx.x;

   if(i<N) {

     T u1_scalar = u1[i];

     T u2_scalar = u2[i];

     int j;

     for(j=0; j<m; j++) {

       A[i*M+j]=A[i*M+j]+u1_scalar+u2_scalar;

     }

   }

}

I run it with 2048 threads (32 blocks/64 thread per block). The float version runs in 3.5ms while the double version run in 2.5ms. I time it using Cuda events and/or with CUDA_PROFILE=1 environment variable, with a Tesla C2050.

The PTX is exactly the same for both version (expect that 64 bits instructions are used instead of 32 bits one).

I attached the test case to reproduce, compile it with:

nvcc float_vs_double.cu -arch=sm_20 -o float_vs_double

Is this a commonly encountered phenomena ? What is the explanation ?
float_vs_double.cu (2.7 KB)

tera · April 14, 2012, 12:45pm

Your kernel is entirely bandwidth bound and follows a suboptimal memory access pattern that heavily relies on the cache for decent performance. At the same time, it’s cache footprint is 4Ã— the actual cache size, so small code changes may lead to large variations in performance.

Rearrange the work so that the threads of a block access consecutive memory locations. The loop in the kernel can then have a larger increment. You should see a decent speedup relative to both the single and double precision versions.

Joky · April 14, 2012, 4:30pm

I know, this is a bad version, I wanted to use it as a pedagogical example, it is “to be analyzed and optimized”. I should have be clear about that in my first post.

Still I don’t understand the difference between float and double version.

Mmmmm, the cache footprint of the double version will be 2 times higher than the float version, so the result is still counter intuitive to me.

RezaRob3 · April 14, 2012, 10:10pm

You don’t seem to be doing enough error checking after calls like cudaEventSynchronize or cudaMemcpy, so it’s not clear if your kernels are really working or not.

To be sure, you could also do check the result set to see if things are really working properly.

tera · April 14, 2012, 10:52pm

In both cases, a full 128 byte cacheline will be fetched for each array element. Whether 4 or 8 bytes of these are actually used should have no direct speed implication.

Topic		Replies	Views
Performance Float vs Double effective bandwidth and execution time CUDA Programming and Performance	8	1359	June 6, 2011
why reading doubles is much faster than reading floats? CUDA Programming and Performance	3	1044	January 2, 2014
Expected performance of double precision arithmetic CUDA Programming and Performance	8	3999	August 20, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	675	April 4, 2017
Use float rather than double in a kernel? CUDA Programming and Performance	9	1641	November 14, 2019
Kernel works in single precision but not in double CUDA Programming and Performance	7	1596	July 28, 2009
Seemingly insignificant changes result in a 100x kernel slowdown CUDA Programming and Performance	2	561	February 14, 2020
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013
Problem with double precision matrix float works, double doesn't CUDA Programming and Performance	7	3915	April 22, 2009
Same kernel and data exhibits different performance CUDA Programming and Performance	3	480	December 3, 2021

Kernel faster in double precision than in simple ?

Related topics