ptxas crashs on double precision kernel

Kernel compiles and runs correctly using single precision data types.
Same kernel with double precision data types causes ptxas to crash.
It’s a template function that uses either float/float3 or double/double3 types. (I had to define my own double3 of course.)

1>ptxas info : Compiling entry function ‘Z30MiCudaComputeGradientArray_GPUId7double3EvjjPKT_S3_P
KjPS1

1>Internal error
1>nvcc error : ‘ptxas’ died with status 0xC0000005 (ACCESS_VIOLATION)

Environment: CUDA 2.3, Visual Studio 2005, Windows Server 2003 64bit

It appears to be a code complexity limit. If I comment out a few lines of code, it compiles. In fact I can comment out almost any 2 or 3 lines and it compiles. Obviously the answers are not correct at that point :-) but at least I can get some approximate timing results.

The kernel isn’t really very big or complex (in my opinion). The single precision version reports:
1>ptxas info : Used 48 registers, 96+0 bytes lmem, 24+16 bytes smem, 48 bytes cmem[0], 20 bytes cmem[1]
While the double precision version (hacked to avoid ptxas crash) reports:
1>ptxas info : Used 94 registers, 192+0 bytes lmem, 24+16 bytes smem, 48 bytes cmem[0], 4 bytes cmem[1]
The number of registers reported in the cubin file is the same. Approximately double the amount of registers and local memory seems like the expected difference between single and double precision.

I’ve tried both 32 and 64 bit CUDA 2.3 and also tried going back to CUDA 2.2.

Any ideas?

Work-around:

Set -maxrregcount=61 (or less) to avoid the ptxas crash.

Any value higher than this (or no value) allows the crash.

Now the result is:

1>ptxas info : Compiling entry function ‘_Z34MiCudaComputeGradientArray_KernelDjjPKdS0_PKjPd’

1>ptxas info : Used 60 registers, 312+0 bytes lmem, 24+16 bytes smem, 48 bytes cmem[0], 4 bytes cmem[1]

Presumably the increased use of local memory means lower performance, but that’s better than not compiling. :-)

-Mike

If you could post or send me a repro, it would be appreciated.