nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference)

Hi, I am now encountering a problem with the K20 GPU. The problem is that when I try to compile some code in the cluster, the NVCC gives me the error code of 11, nvcc error : ‘ptxas’ died due to signal 11 (Invalid memory reference). The device is Tesla K20m. CUDA Driver Version and Runtime Version are 5.5 and 5.0.

I tried to debug the code by commenting everything then gradually uncomment them. The error occurs when I uncomment the first kernel function. The strange thing is that the same code can be successfully compiled in my local machine with the operating system windows 7 and equipped with Quadro 2000 and CUDA 4.2.

Any suggestions would be appreciated. the code could be accessed here, in case you are interested: https://dl.dropboxusercontent.com/u/40786218/FlockingBoid.rar

This is the kind of error that should not happen, and it usually points at an internal error in the compiler (an out of bounds memory access). However, I am aware of one scenario where the problem is not with the compiler per se. You would want to check into this given that you are using multiple different versions of CUDA.

If the tool chain from a given CUDA version is used with CUDA header files from a previous version, bad things can happen including the kind of “segfault” you are observing here. So you would want to ensure that in moving the code from a machine with CUDA 4.2 to a machine with CUDA 5.x you did not inadvertently copy over one of the many internal CUDA header files as well.

If such a check turns up nothing suspicious, consider filing a bug report (via the form linked from the registered developer website). Alternatively you may want to try the CUDA 6.0 release candidate.

Thank you for the reply. Actually I am moving the code from my local machine to the lab clusters. I checked the files according to your suggestion and none of them were CUDA’s internal header files. Besides, I don’t have the permission to override the CUDA toolkit directory, so I guess I am incapable of doing any damage to the internal header files. I may consider submitting a bug report. Thanks again for your suggestions.

Before submitting the bug report, I would gather the additional data point of trying to compile a known good cuda sample code (e.g. vectorAdd, don’t use deviceQuery) on that cluster. If you receive the same error, it almost certainly points to a machine configuration issue, as njuffa alluded to. But if it compiles fine, and only your code produces the error, then it may be a bug.

Hi, thanks for your post. The sample code is fine and my other codes are also fine.

I tried compiling the code in your FlockingBoid.rar archive. On a RHEL 5.5 machine with CUDA 5.0, compiling with:

nvcc -arch=sm_35 -o kernel kernel.cu

I get a variety of warnings (e.g. variables declared but never referenced) but no errors, and the compile was successful, took a few seconds. The executable was created.

I had the same results (no problems) on a RHEL 6.2/CUDA 5.5 machine. So I think your problem is not easily reproducible, and there are more specifics needed, either how you are compiling the app, or the specific machine configuration that is failing.

On my CUDA 5.5 machine, if I do deviceQuery, I see that CUDA driver version and CUDA runtime version are both reported as 5.5. So I’m not sure why your machine would be reporting driver version 5.5 and runtime version 5.0.

What is the result of deviceQuery on the failing machine and also the result of:

nvidia-smi -a

?

Thanks for your reply! The formatted version of the following results could be found here:
https://dl.dropboxusercontent.com/u/40786218/post.txt

I tried to compile with ‘nvcc -arch=sm_35 -o kernel kernel.cu’, basically it is the same error:

[user@cluster FlockingBoid]$ nvcc -arch=sm_35 -o kernel kernel.cu
/cm/shared/apps/cuda50/toolkit/5.0.35/bin/…/include/curand_kernel.h(403): warning: missing return statement at end of non-void function “__curand_uint32_as_float”

nvcc error : ‘ptxas’ died due to signal 11 (Invalid memory reference)

Here is the complete result of deviceQuery:

/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Tesla K20Xm”
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 5760 MBytes (6039339008 bytes)
(14) Multiprocessors x (192) CUDA Cores/MP: 2688 CUDA Cores
GPU Clock rate: 732 MHz (0.73 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 131 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = Tesla K20Xm

Here is the result of nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Wed Mar 12 11:55:08 2014
Driver Version : 319.23

Attached GPUs : 4
GPU 0000:03:00.0
Product Name : Tesla K20Xm
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 128
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0325112050314
GPU UUID : GPU-820e2210-9933-734f-cff4-67c4be40ef81
VBIOS Version : 80.10.17.00.02
Inforom Version
Image Version : 2081.0200.01.09
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x102110DE
Bus Id : 0000:03:00.0
Sub System Id : 0x097D10DE
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 5759 MB
Used : 3337 MB
Free : 2422 MB
Compute Mode : Default
Utilization
Gpu : 42 %
Memory : 31 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
Gpu : 44 C
Power Readings
Power Management : Supported
Power Draw : 106.75 W
Power Limit : 235.00 W
Default Power Limit : 235.00 W
Min Power Limit : 150.00 W
Max Power Limit : 235.00 W
Clocks
Graphics : 732 MHz
SM : 732 MHz
Memory : 2600 MHz
Applications Clocks
Graphics : 732 MHz
Memory : 2600 MHz
Default Applications Clocks
Graphics : 732 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 784 MHz
SM : 784 MHz
Memory : 2600 MHz
Compute Processes
Process ID : 11791
Name : /home/edwlin/codes/CudaMiner-master/cm
Used GPU Memory : 3321 MB

GPU 0000:04:00.0
Product Name : Tesla K20Xm
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 128
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0325112050317
GPU UUID : GPU-713e942e-5e0e-31f0-cb1f-42c826e11cea
VBIOS Version : 80.10.17.00.02
Inforom Version
Image Version : 2081.0200.01.09
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x04
Device : 0x00
Domain : 0x0000
Device Id : 0x102110DE
Bus Id : 0000:04:00.0
Sub System Id : 0x097D10DE
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 5759 MB
Used : 3569 MB
Free : 2190 MB
Compute Mode : Default
Utilization
Gpu : 39 %
Memory : 30 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
Gpu : 44 C
Power Readings
Power Management : Supported
Power Draw : 107.88 W
Power Limit : 235.00 W
Default Power Limit : 235.00 W
Min Power Limit : 150.00 W
Max Power Limit : 235.00 W
Clocks
Graphics : 732 MHz
SM : 732 MHz
Memory : 2600 MHz
Applications Clocks
Graphics : 732 MHz
Memory : 2600 MHz
Default Applications Clocks
Graphics : 732 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 784 MHz
SM : 784 MHz
Memory : 2600 MHz
Compute Processes
Process ID : 12682
Name : /home/edwlin/codes/CudaMiner-master/cm
Used GPU Memory : 3553 MB

GPU 0000:83:00.0
Product Name : Tesla K20Xm
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 128
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0325112050235
GPU UUID : GPU-93916275-382d-a31f-0a39-2060695e62c1
VBIOS Version : 80.10.17.00.02
Inforom Version
Image Version : 2081.0200.01.09
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x83
Device : 0x00
Domain : 0x0000
Device Id : 0x102110DE
Bus Id : 0000:83:00.0
Sub System Id : 0x097D10DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 5759 MB
Used : 13 MB
Free : 5746 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
Gpu : 29 C
Power Readings
Power Management : Supported
Power Draw : 17.50 W
Power Limit : 235.00 W
Default Power Limit : 235.00 W
Min Power Limit : 150.00 W
Max Power Limit : 235.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 732 MHz
Memory : 2600 MHz
Default Applications Clocks
Graphics : 732 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 784 MHz
SM : 784 MHz
Memory : 2600 MHz
Compute Processes : None

GPU 0000:84:00.0
Product Name : Tesla K20Xm
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 128
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0325112050322
GPU UUID : GPU-07543f69-00ff-e069-383c-ac16c2127757
VBIOS Version : 80.10.17.00.02
Inforom Version
Image Version : 2081.0200.01.09
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x84
Device : 0x00
Domain : 0x0000
Device Id : 0x102110DE
Bus Id : 0000:84:00.0
Sub System Id : 0x097D10DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 5759 MB
Used : 13 MB
Free : 5746 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
Gpu : 27 C
Power Readings
Power Management : Supported
Power Draw : 17.81 W
Power Limit : 235.00 W
Default Power Limit : 235.00 W
Min Power Limit : 150.00 W
Max Power Limit : 235.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 732 MHz
Memory : 2600 MHz
Default Applications Clocks
Graphics : 732 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 784 MHz
SM : 784 MHz
Memory : 2600 MHz
Compute Processes : None

what is the result of:

nvcc --version

Your machine might be messed up in some non-obvious way if nvidia-smi reports 4 GPUs and deviceQuery reports 1. Are you running this in a batch environment? It may be time for a reboot of that server or else a reload of GPU driver and CUDA toolkit. 319.23 is an old driver that I wouldn’t recommend using. 319.72 or 319.82 are good current choices, and upgrading the server to CUDA 5.5 might be a good idea as well.

Hi, thank you very much for your reply.

Here is the results of ‘nvcc --version’

nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

The reason of showing four GPUs is that the cluster in my lab has four GPUs in each node. Since the cluster is managed by other guys, I don’t have the permission to install or update anything. The way to use the cluster is that firstly, compiling the CUDA code on the master node, then secondly, run the binary on the worker nodes by using a job scheduler called SLURM.

The structures, configuration and manual of the cluster in my lab are shown as follows.
http://pdcc.ntu.edu.sg/content/multi-cores-cluster-k20x-k20-intel-xeon-phi
http://pdcc.ntu.edu.sg/sites/default/files/Cluster-info.pdf
http://pdcc.ntu.edu.sg/sites/default/files/EndUser-brief.v2.pdf