Hello everyone, I have a problem on profiling in Nsight System. The CLI is :
CUDA_VISIBLE_DEVICES=2 nsys profile --trace=cuda,cudnn,cublas -o p1 ./main, but I got a error retured.
I try to change the trace arguments, however, it didn’t work. Besides, without nsys, the ./main passed.
How did you resolve this error? I didn’t understand what you meant by “Just using the nsys in system path(installed with cuda driver), not standalone installed.”
Thank you
I believe, based on what they were saying, that they had multiple versions of Nsight Systems installed and that they were using the installed version that did not match their driver version.
Technically it is the CUDA toolkit version that needs to match the driver version. Most people achieve this by getting their drivers from the CUDA toolkit. Nsys will work on any set of driver/CUDA from CUDA 8.0 on (although we only test back to 10,0).
I am pretty sure my driver and toolkit matches; both cudaRuntimeGetVersion and cudaDriverGetVersion gives 11020. I cannot use the nsys that comes with the driver since it was simply not installed with the driver (by my system administrator); running
/usr/local/cuda/bin/nsys
simply gives
Error: Nsight Systems 2020.4.3 hasn't been installed with CUDA Toolkit 11.2
I have tried installing nsys 2020.4.1 and the latest 2023.2, both gives similar error.
I am on CentOS 7 and running onnxruntime built from C++ source on an A10 GPU. The program runs normally without nsys profile.
And there are several more Cannot find string for an exterior index errors that are identical to the one posted above. The error list is always one Unknown runtime API function index: 406 followed by several Cannot find string for an exterior index.
Thank you. It’s a bit strange here, because looking up the mapping between the function index and CUDA runtime API, 406 corresponds to cudaGetDriverEntryPoint, which should only exist since CUDA 11.3. Not sure why your application could trigger this API while your system is using CUDA 11.2. (And that’s why Nsys reports the error because this function index is unexpected under your driver version)
Could you also check nvcc --version? Is there any chance that the app was built with CTK 11.3 or higher despite the driver is CUDA 11.2?
Also, is it possible for you to update the CUDA driver to 11.3 or higher?
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
I tried to find cudaGetDriverEntryPoint in my code but there is no matching result. I’ll try later to find out it there are any third party libraries that might call this API.
Speaking of the “mapping between the function index and CUDA runtime API”, has the mapping been documented somewhere?
Updating the CUDA driver to 11.3 is possible on my development machine, but not possible on the kubernetes cluster where my application would be deployed onto, and I don’t want my development environment to be different from the production servers.
You can search for the cupti_runtime_cbid.h header in your CTK installation folder. For example, on my system, it’s at /usr/local/cuda-12.1/targets/x86_64-linux/include/cupti_runtime_cbid.h.
Turns out the cuDNN lib was not built with CUDA 11.2. Downgrading cuDNN solved the problem. Thanks very much for your quick reply and detailed explaination :)
output of nvidia-smi:
NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2
output of nvcc --version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
output of nsys --version:
NVIDIA Nsight Systems version 2023.1.1.127-32365746v0
There is indeed a mismatch on cuda version of the outputs of nvidia-smi and nvcc --version. However, if I use only one process (I run a pytorch model and use --nproc_per_node=N to set the number of process and GPUs to use), the aforementioned error does not occur. If I set --nproc_per_node to more than one, the error would always occur.
Could you give some hint of the cause of the error? Thanks a lot!