Division problem in CUDA kernel

Hi, I’ve just started using CUDA and have come across a problem that’s got me stuck. Hopefully someone here can help out!
I have a host function that sets up three arrays (d_A, d_B and d_C) on the device, then loads data from the host (h_A) into one of these arrays (d_A). I then call a kernel function to try to calculate a rolling mean. d_A is unmodified, d_B is set equal to the sum of the 9 relevant elements of d_A and d_C is d_B/9.
Here’s the kernel code:

__global__ void test_func(float* data,float* otherdata,float* outdata)
{
	int i=blockDim.x * blockIdx.x + threadIdx.x;
	otherdata[i]=0;
	outdata[i]=data[i];
	if (i>=4)
	{
		int j=0;
		for (j=-4;j<=4;j++)otherdata[i]=otherdata[i]+data[i+j];
		outdata[i]=otherdata[i]/9.0;
	}
}

I call it with this:

test_func<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C);

Afterwards I use CUDA_CHECK_RETURN (from the nsight default code) on a cudaThreadSynchronize and a cudaGetLastError. My problem is that the division ( outdata[i]=otherdata[i]/9.0 ) does not work and gives me this error:

Error too many resources requested for launch at line 80 in file ../src/MAIN_FUNCS.cu

Line 80 is cudaGetLastError. If I switch the /9.0 to *9.0 then the code works fine. It also works fine if I change the line to read: outdata[i]=1.0/9.0

I’m running on Ubuntu 12.04 with Intel i7 processor, 24GiB RAM, a Tesla C2070 and the Cuda developers kit v5.5. I wrote/compiled my code in the nsight GUI.

Any ideas what I’m doing wrong? I assume I’m making a newbie mistake somewhere.
Thanks!

The resource limit that is exceeded is the number of registers. There is no hardware support for floating-point divisions (or integer divisions, for that matter) on the GPU, so these operations are implemented as software subroutines that require additional registers for temporary storage. You can observe the difference in register usage between multiplication and division versions of your code by adding -Xptxas -v to the nvcc invocation.

Since you are processing ‘float’ data, you would want to use the single-precision constant 9.0f instead of the doubl-precision constant 9.0. The use of the double-precision constant causes the division to be performed in double precision, requiring even more instructions and more registers. If, after that change, the launch still fails due to lack of resources, you will have to reduce the number of threads specified in the kernel launch.

Thanks for your fast reply. Changing from double to float fixed the problem and I’ll keep in mind the register limit for more complex code in the future.

It is possible that the /9 operation requires so many regiters that the are not enough. Try decreaseing the numerb of threads pe block. Also use the flags -Xptxas -v to find out the resources each kenel uses. multiply the numer of registers with the numebr of threads per block and compare with the number of registers per SMP.