Hello,
I’m quite new to CUDA, so I hope to find some help here.
I’m trying to implement an image processing algorithm computing with float values, which is really slow, even without writing the result back to global memory (so coalescing doesn’t play a role here). The details of the implementation are not important, because I brought the issue down to the following small kernel:
global void kernel(float *a)
{
shared float temp[1024][2];
float value1 = a[threadIdx.x] * a[threadIdx.x+1] * a[threadIdx.x+2] * a[threadIdx.x+3];
float value2 = a[threadIdx.x] / a[threadIdx.x+1] * a[threadIdx.x+3] / a[threadIdx.x+2];
temp[threadIdx.x][0] = value1;
temp[threadIdx.x][1] = value2;
}
As you can see, the kernel computes value1 and value2 (the formulars don’t make sense, it’s just assured that value1 and value2 are different) using array ‘a’ (which is initialized with random floats between 0-1) and sets these values in a shared array ‘temp’.
The configuration for this kernel call is as follows (which is quite close to the configuration in my algorithm):
dim3 block_size;
block_size.x = 1024; //max threads per block of my gpu
dim3 grid_size; //4,210,688 blocks in total; surely quite a lot
grid_size.x = 128;
grid_size.y = 64;
grid_size.z = 64;
After the kernel call follows a cudaThreadSynchronize().
In the main method of my program I’m doing a time measurement. If I run the program with commenting out all the lines in the kernel, the execution time is 0.04 seconds. If I comment out the computation of value2 (so only value1 is computed and set to ‘temp’), it takes 0.3 seconds. With both values computed and set to temp it takes 0.5 seconds.
What causes this enormous increasing of execution time here?
Is it because of the read-access of array ‘a’, which resides in global memory? On the other hand, if I do the whole thing with int instead of float (which is also 4 bytes in size), the execution time is almost twice as fast as with float. Can float multiplication (and division) really be that slow on the gpu?
My gpu is a Quadro K1100M (SM 3.0). Does anyone have an explanation for the execution times? This prevents my algorithm from running in real-time.
Thanks a lot!