Kernel compiles and runs correctly using single precision data types.
Same kernel with double precision data types causes ptxas to crash.
It’s a template function that uses either float/float3 or double/double3 types. (I had to define my own double3 of course.)
1>ptxas info : Compiling entry function ‘Z30MiCudaComputeGradientArray_GPUId7double3EvjjPKT_S3_P
1>nvcc error : ‘ptxas’ died with status 0xC0000005 (ACCESS_VIOLATION)
Environment: CUDA 2.3, Visual Studio 2005, Windows Server 2003 64bit
It appears to be a code complexity limit. If I comment out a few lines of code, it compiles. In fact I can comment out almost any 2 or 3 lines and it compiles. Obviously the answers are not correct at that point :-) but at least I can get some approximate timing results.
The kernel isn’t really very big or complex (in my opinion). The single precision version reports:
1>ptxas info : Used 48 registers, 96+0 bytes lmem, 24+16 bytes smem, 48 bytes cmem, 20 bytes cmem
While the double precision version (hacked to avoid ptxas crash) reports:
1>ptxas info : Used 94 registers, 192+0 bytes lmem, 24+16 bytes smem, 48 bytes cmem, 4 bytes cmem
The number of registers reported in the cubin file is the same. Approximately double the amount of registers and local memory seems like the expected difference between single and double precision.
I’ve tried both 32 and 64 bit CUDA 2.3 and also tried going back to CUDA 2.2.