I’m using the Drive PX 2 as target platform and a PC with Ubuntu 14.04 as host, DriveWorks SDK v0.3 and PDK v4.1.6.1.
My host PC has a NVS 510, which has only 3.0 compute capability (shouldn’t be a problem, right?).
I can cross-compile the DriveNet application and even run my compiled sample directly on the Drive PX 2 without problems.
When I try to run it on the Drive PX 2 using a remote debugging session with Nsight-Eclipse on my host PC, I am able to step the sample until dwInitialize, but when it goes inside this function the application ends abruptly and I cannot debug it.
Within the gdb traces I can see the following error:
error,msg=“fatal: No CUDA capable device was found. (error code = CUDBG_ERROR_NO_DEVICE_
AVAILABLE(0x27)”
The remote shell shows the following:
Warning: Adjusting return value of linux_common_core_of_thread (pid=1391, tid=1391).
core = 17 >= num_cores = 6!
Program Arguments:
–camera-index=0
–camera-type=ar0231-rccb-ssc
–csi-port=ab
–input-type=video
–slave=0
–stopFrame=0
–video=/usr/local/driveworks/data/samples/raw/rccb.raw
Initialize DriveWorks SDK v0.3.400
Release build with GNU 4.9.2 from v0.3.0-rc8-0-g3eeebea against PDK v4.1.6.1
SDK: Resources mounted from /usr/local/driveworks/data/resources
Killing all inferiors
logout
•CUDA Toolkit 8.0 or higher
•NVIDIA® CUDA® version 8.0 or later
•NVIDIA® Vibrante™ PDK installation for DRIVE PX 2 on the Linux Host
•You may also need to install (using apt-get install) the following packages: libx11-dev
libxrandr-dev
libxcursor-dev
libxxf86vm-dev
libxinerama-dev
libxi-dev
libglu1-mesa-dev
Desktop development relies on NVCUVID for video decoding, which is included with the NVIDIA drivers. In general, the cmake build scripts can find NVCUVID installation. However, if this fails, you must set a symbolic link /usr/lib/nvidia-current pointing to your NVIDIA driver lib. For example /usr/lib/nvidia-367
All requirements that you mention are met, except that I’m not using SDK/PDK 4.1.8.0 but 4.1.6.1. However, I checked the release notes of v4.1.8.0 and I didn’t see anything that could explain/solve the problem I’m having. Moreover, like I mentioned, my compiled sample runs without problems if I start it directly on the Drive PX 2.
Do you have any other suggestions of what could I check/modify to solve this problem?
I checked what you asked me, including editing .bashrc, and everything looks fine.
To be honest, I fail to see how this is related with the problem. As I mentioned, there is no problem running the sample directly on the board, only when trying to execute it with remote debugging.
The output of nvcc --version and deviceQuery looks like this:
nvidia@nvidia:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Mon_Mar_20_17:07:33_CDT_2017
Cuda compilation tools, release 8.0, V8.0.72
nvidia@nvidia:~$ ~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery/deviceQuery
/home/nvidia/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GP10B"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 6660 MBytes (6983643136 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 1275 MHz (1.27 GHz)
Memory Clock rate: 1600 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GP10B
Result = PASS
We didn’t observe this issue while testing with Nsight IDE.
I suspect NVS 510 (Kepler will not support ) the cuda gdb (8.0) from host PC.
I would suggest you to try with some latest Maxwell / Parker GPUs rather than NVS 510. Thanks.
I tried cuda-gdb directly on the Drive PX 2 and it works, it is slow, but it works.
It seems like the suspicion from SteveNV might be right and a GPU with enough compute capability on the host is also necessary (even though the application runs on the target).
I’ll try to get another graphic card with compute capability of at least 5.0 on the host to be able to remote debug CUDA/DriveWorks applications running on the Drive PX 2.
I finally got a better graphic card for my host PC, a GTX 1060.
I updated the Drive PX board and the PC to the latest SDK/PDK version (5.0.5.0).
With the 1060 I am able to run and debug the samples on the host PC
Unfortunately the problem remains the same: the compiled sample runs on the board only if I start the binary directly on the Drive PX board, but the remote debug session aborts with the error “fatal: No CUDA capable device was found. (error code = CUDBG_ERROR_NO_DEVICE_AVAILABLE(0x27)”
Do you have any other ideas on how to solve this problem?
“When I try to run it on the Drive PX 2 using a remote debugging session with Nsight-Eclipse on my host PC, I am able to step the sample until dwInitialize”
You mean debug is basically works on Drive PX2 and only when your code calls dwInitialize, then error happens ? So what this function want to do ? Something about graphic display ?
The same behavior between cuda-gdb, or just Nsight EE have the problem ?
Yes, debug basically works on the Drive PX2. The error comes specifically when the function cudaFree(0) in the constructor of DriveWorksSample is called
With cuda-gdb directly on the Drive PX2 there is no problem
I’m using the DriveNet sample that is included in the latest SDK/PDK version (5.0.5.0)
The problem occurs with or without breakpoints, I just let it run in a remote debug session.
We checked with Drive 5.0.5.0 SDK/PDK, and we are able to run the sample application successfully.
We did not see the issue you reported but did see some sluggishness in the application run remotely on the target.
And we now have some known issues about Nsight EE that may related to your problem.
As cuda-gdb works for you, can you use this for WAR temporarily?
cuda-gdb works for remote debugging using it directly on the command line, but not with Nsight.
Like you mention, it is (extremely) slow.
The problem I have with this approach is that I get the following warnings on the debug session:
warning: Cuda API error detected: cudaGetLastError returned (0xb)
warning: Cuda API error detected: cudaHostGetFlags returned (0xb)
If I let DriveNet run (remotely, on the command line), I get these warnings several times per second, so I cannot really use this approach.
Do you also get these warnings?
And this is already confirmed by the dev this is demo app issue, it need fixed to not make these invalid calls in the first place.
Also there is something to do with cuda-gdb, it should bypass these api failures if the setting “break on API” is disabled. And our dev is working on this.
So currently I think you can just wait the new DriveInstall Release.
Sorry for that.