Is float computation really so slow?


I’m quite new to CUDA, so I hope to find some help here.

I’m trying to implement an image processing algorithm computing with float values, which is really slow, even without writing the result back to global memory (so coalescing doesn’t play a role here). The details of the implementation are not important, because I brought the issue down to the following small kernel:

global void kernel(float *a)
shared float temp[1024][2];

float value1 = a[threadIdx.x] * a[threadIdx.x+1] * a[threadIdx.x+2] * a[threadIdx.x+3];
float value2 = a[threadIdx.x] / a[threadIdx.x+1] * a[threadIdx.x+3] / a[threadIdx.x+2];

temp[threadIdx.x][0] = value1;
temp[threadIdx.x][1] = value2;

As you can see, the kernel computes value1 and value2 (the formulars don’t make sense, it’s just assured that value1 and value2 are different) using array ‘a’ (which is initialized with random floats between 0-1) and sets these values in a shared array ‘temp’.

The configuration for this kernel call is as follows (which is quite close to the configuration in my algorithm):

dim3 block_size;
block_size.x = 1024; //max threads per block of my gpu

dim3 grid_size; //4,210,688 blocks in total; surely quite a lot
grid_size.x = 128;
grid_size.y = 64;
grid_size.z = 64;

After the kernel call follows a cudaThreadSynchronize().
In the main method of my program I’m doing a time measurement. If I run the program with commenting out all the lines in the kernel, the execution time is 0.04 seconds. If I comment out the computation of value2 (so only value1 is computed and set to ‘temp’), it takes 0.3 seconds. With both values computed and set to temp it takes 0.5 seconds.

What causes this enormous increasing of execution time here?
Is it because of the read-access of array ‘a’, which resides in global memory? On the other hand, if I do the whole thing with int instead of float (which is also 4 bytes in size), the execution time is almost twice as fast as with float. Can float multiplication (and division) really be that slow on the gpu?

My gpu is a Quadro K1100M (SM 3.0). Does anyone have an explanation for the execution times? This prevents my algorithm from running in real-time.

Thanks a lot!

Are you on windows or linux? If on windows, are you using visual studio? If using visual studio, are you building a debug or release project?

I’m using Windows 7 64 Bit and Visual Studio 2008 (I have to compile the whole code to 32Bit due to the requirements of the project I’m currently working on). The measurement was done in debug mode, in release mode the measured time is the same.

Your kernel as presented will be fully optimized away by the compiler (regardless of float vs. int) as you are modifying no externally visible state, if you compile the project in release mode.

In my opinion, there is little utility in trying to explain debug mode timing. It is not reflective of actual timings in release mode, nor should you be creating debug mode projects for production purposes.

I suggest that you craft a representative comparison in a project compiled in release mode.