I have a kernel that works perfectly on multiple cards in single precision (GTX-280 and Tesla c1060). If I switch to double precision, literally by replacing ‘float’ with ‘double’, the kernel runs but gives all zero results. This happens on multiple cards and under Linux or Windows (both 64-bit). I’m using CUDA 2.1 and compiling on the GTX-280. I added ‘-arch sm_13’ to the nvcc command line but it appears to make no difference to the final output.
Have other people seen similar to this? Is there something extra that needs to be done to get double precision to work? The kernel is simple and I can post the code if it is deemed helpful.
Already tried that before posting (‘-arch sm_13’ and ‘-arch=sm_13’) and it makes no difference to the output. Since the only change between the two versions is replacing ‘float’ with ‘double’ and then compiling (with -arch=sm_13) I’m thinking that there is something in the compiler like another switch to set or whatnot.
are you calling cudaGetLastError() immediately after calling the kernel (before synchronizing or doing a memcpy) and checking to make sure that it returns cudaSuccess?
That isn’t the only change though - changing the floating point representation to double increases the number of total registers the kernel will use and the amount of shared memory it will require for a given buffer size. It is entirely possible that your kernel is failing to launch because the double precision version run with your single precision execution parameters is exceeding a resource limit. You might need to reduce either the shared memory or threads per block.
I added -Xptxas -v to the nvcc command line and rebuilt the kernel:
ptxas info : Compiling entry function ‘_Z8k_filterPdPhi’
ptxas info : Used 8 registers, 36+32 bytes smem, 41000 bytes cmem[0], 4 bytes cmem[1]
the single precision version uses 6 registers, the same amount of smem and 21000 bytes of cmem[0]. My single precision version uses 256 threads per block. If I change the double version to use only 64 I still get the same output. I am checking for errors after all CUDA calls and none are reported.
This seems rather strange to me. I don’t see a resource limit being exceeded (and, if one is, shouldn’t there be a warning?)
kernelCall<<<whatever>>>(whatever here too);
cudaError_t result = cudaGetLastError();
if (result != cudaSuccess)
printf("oh no everything is ruined forever--the kernel didn't even launch\n");
result = cudaThreadSynchronize();
if (result != cudaSuccess)
printf("oh no now the kernel itself broke in the middle of execution\n");
the kernel now works properly with the rest of the data being double. With ‘float’ I’m using 21000 bytes of constant memory. With ‘double’ it is using 41000 which is less than 64k so I’m not sure why there would be a difference.