Kernel works in single precision but not in double

I have a kernel that works perfectly on multiple cards in single precision (GTX-280 and Tesla c1060). If I switch to double precision, literally by replacing ‘float’ with ‘double’, the kernel runs but gives all zero results. This happens on multiple cards and under Linux or Windows (both 64-bit). I’m using CUDA 2.1 and compiling on the GTX-280. I added ‘-arch sm_13’ to the nvcc command line but it appears to make no difference to the final output.

Have other people seen similar to this? Is there something extra that needs to be done to get double precision to work? The kernel is simple and I can post the code if it is deemed helpful.

Ron

compile with the command line options -arch=sm_13

Already tried that before posting (‘-arch sm_13’ and ‘-arch=sm_13’) and it makes no difference to the output. Since the only change between the two versions is replacing ‘float’ with ‘double’ and then compiling (with -arch=sm_13) I’m thinking that there is something in the compiler like another switch to set or whatnot.

Ron

are you calling cudaGetLastError() immediately after calling the kernel (before synchronizing or doing a memcpy) and checking to make sure that it returns cudaSuccess?

That isn’t the only change though - changing the floating point representation to double increases the number of total registers the kernel will use and the amount of shared memory it will require for a given buffer size. It is entirely possible that your kernel is failing to launch because the double precision version run with your single precision execution parameters is exceeding a resource limit. You might need to reduce either the shared memory or threads per block.

I added -Xptxas -v to the nvcc command line and rebuilt the kernel:

ptxas info : Compiling entry function ‘_Z8k_filterPdPhi’

ptxas info : Used 8 registers, 36+32 bytes smem, 41000 bytes cmem[0], 4 bytes cmem[1]

the single precision version uses 6 registers, the same amount of smem and 21000 bytes of cmem[0]. My single precision version uses 256 threads per block. If I change the double version to use only 64 I still get the same output. I am checking for errors after all CUDA calls and none are reported.

This seems rather strange to me. I don’t see a resource limit being exceeded (and, if one is, shouldn’t there be a warning?)

Ron

Seriously you should be doing

kernelCall<<<whatever>>>(whatever here too);

cudaError_t result = cudaGetLastError();

if (result != cudaSuccess)

	printf("oh no everything is ruined forever--the kernel didn't even launch\n");

result = cudaThreadSynchronize();

if (result != cudaSuccess)

	printf("oh no now the kernel itself broke in the middle of execution\n");

I was checking for errors, there weren’t any. However, I did find what was causing the problem. It was the declaration of constant memory. I had:

define MAX_LIMITS 5

define MAX_PARAMS 1000

define MAX_PARAM_SPACE (MAX_PARAMS*MAX_LIMITS)

device constant double d_params[MAX_PARAM_SPACE];

device constant unsigned char d_index[MAX_PARAMS];

and when I changed it to:

device constant float d_params[MAX_PARAM_SPACE];

the kernel now works properly with the rest of the data being double. With ‘float’ I’m using 21000 bytes of constant memory. With ‘double’ it is using 41000 which is less than 64k so I’m not sure why there would be a difference.