running code from cudatoolkit 3.2 to 4.0 -- ptxas error

Hi,

I am new to CUDA. I have a matlab-cuda application written using cudatoolkit 3.2 and I am trying to run it on a machine with toolkit 4.0. when the code tries to create a mex file I get the following error:
nvcc error : ‘ptxas’ died due to signal 11 (Invalid memory reference)
nvcc error : ‘ptxas’ core dumped
I used --ptxas-options=-v with nvcc and these are the results:

ptxas info : Compiling entry function ‘_Z18_kernel_scale_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 25 registers, 256+16 bytes smem, 65536 bytes cmem[0], 40 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z23_kernel_ssGRBFNorm_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 23 registers, 256+16 bytes smem, 65536 bytes cmem[0], 28 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z19_kernel_ssGRBF_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 22 registers, 256+16 bytes smem, 65536 bytes cmem[0], 24 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z18_kernel_sdNDP_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
nvcc error : ‘ptxas’ died due to signal 11 (Invalid memory reference)
nvcc error : ‘ptxas’ core dumped
CUDA preprocessing [nvcc] failed

I used Geforce GTX 590 and cuda toolkit4.0

The same code runs correctly on Tesla with cuda toolkit 3.2 ----
ptxas info : Compiling entry function ‘_Z18_kernel_scale_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 25 registers, 256+16 bytes smem, 65536 bytes cmem[0], 40 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z23_kernel_ssGRBFNorm_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 25 registers, 256+16 bytes smem, 65536 bytes cmem[0], 28 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z19_kernel_ssGRBF_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 25 registers, 256+16 bytes smem, 65536 bytes cmem[0], 24 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z18_kernel_sdNDP_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 39 registers, 256+16 bytes smem, 65536 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z19_kernel_sdGRBF_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 37 registers, 256+16 bytes smem, 65536 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z19_kernel_sdConv_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 38 registers, 256+16 bytes smem, 65536 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z17_kernel_nLen_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 21 registers, 256+16 bytes smem, 65536 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z19_kernel_inhib2_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 13 registers, 256+16 bytes smem, 65536 bytes cmem[0], 8 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z19_kernel_inhib1_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 23 registers, 256+16 bytes smem, 65536 bytes cmem[0], 16 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z17_kernel_gMax_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 21 registers, 256+16 bytes smem, 65536 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z17_kernel_cMax_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 21 registers, 256+16 bytes smem, 65536 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z17_kernel_cAvg_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’
ptxas info : Used 22 registers, 256+16 bytes smem, 65536 bytes cmem[0], 28 bytes cmem[1]
CUDA preprocessing successful

Can anyone please help me with this error ? Thank you !

This error message indicates an error internal to the compiler, PTXAS appears to be accessing memory out of bounds. It would be helpful if you could file a bug against the compiler, attaching a self-contained repro case. Sorry for the inconvenience, and thank you for your help.

Thank you, I’ll do that. When I ran the code with ‘-g -G’ gdb options with nvcc, the compilation was successful and I could create a mex file. With ‘-g -G’ option, ptxas info seems to be the same when I run it on both versions of the cuda toolkit.

from gtx590 with cudatoolkit 4.0

[i]ptxas info : Compiling entry function ‘_Z18_kernel_scale_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 23 registers, 336+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z23_kernel_ssGRBFNorm_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 24 registers, 544+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z19_kernel_ssGRBF_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 24 registers, 544+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z18_kernel_sdNDP_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 27 registers, 448+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z19_kernel_sdGRBF_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 25 registers, 448+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z19_kernel_sdConv_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 24 registers, 448+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z17_kernel_nLen_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 23 registers, 336+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z19_kernel_inhib2_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 17 registers, 208+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z19_kernel_inhib1_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 17 registers, 272+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z17_kernel_gMax_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 30 registers, 336+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z17_kernel_cMax_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 27 registers, 336+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z17_kernel_cAvg_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 28 registers, 336+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

CUDA preprocessing successful

[/i]

from tesla with toolit 3.2

[i]compiling…

ptxas info : Compiling entry function ‘_Z18_kernel_scale_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 23 registers, 336+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z23_kernel_ssGRBFNorm_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 24 registers, 544+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z19_kernel_ssGRBF_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 24 registers, 544+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z18_kernel_sdNDP_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 27 registers, 448+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z19_kernel_sdGRBF_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 25 registers, 448+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z19_kernel_sdConv_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 24 registers, 448+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z17_kernel_nLen_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 23 registers, 336+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z19_kernel_inhib2_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 17 registers, 208+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z19_kernel_inhib1_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 17 registers, 272+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z17_kernel_gMax_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 30 registers, 336+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 52 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z17_kernel_cMax_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 27 registers, 336+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z17_kernel_cAvg_dfltjjPKfS0_9_OutTablejPK7ushort8jjjPK10_LayerData’ for ‘sm_10’

ptxas info : Used 28 registers, 336+0 bytes lmem, 256+16 bytes smem, 65536 bytes cmem[0], 44 bytes cmem[1]

CUDA preprocessing successful[/i]

Why is this successful ? Thanks !

PTXAS is more than a simple assembler, it is an actual compiler that performs optimizations, register allocation, and instruction scheduling. By default, it compiles with full optimizations. I am fairly sure that for the generation of debuggable code with -G all optimizations are turned off. This apparently avoids hitting the bug you encountered, but likely has a significant impact on the performance of the generated code.

If you are looking for a workaround, you could try to manually adjust the optimization level used by PTXAS until the compilation succeeds. To do this, add the following compiler flag to the NVCC command line: -Xptxas -O{0|1|2|3}. -O3 is the default, so I would suggest working backwards from there. You might also want to mention in your bug report the last level that works, and the first level that fails. This may help the compiler team to zero in more quickly on the source of the problem. Note that component-specific compiler flags are typically unsupported and therefore I would not recommend using them in a production environment.