I was trying to deploy my project in docker with nsys and ncu to do profile, and the version of TensorRT and CUDA-Toolkit are the same as my host machine: CUDA-V11.6.55 and TensorRT8.5.2.2.
But when I trying to profile with the nsys cmd nsys profile --trace=cuda,osrt,nvtx --force-overwrite true --output=xx ./my_bin, the nsys in host caught 73 kernels while the nsys in docker caught only 67 kernels. And I find the lost 6 kernels are all called splitKreduce_kernel. What can cause this problem?
Here are some more detailed descriptions of the environment:
GPU: RTX3070
Driver Version: 535.179
Docker: docker version 26.1.3; CUDA-Toolkit version 11.6.55; TensorRT version8.5.2.2;CUDNN8.4.0
Host: CUDA-Toolkit version 11.6.55; TensorRT version8.5.2.2; CUDNN8.9.6
I run my docker image with cmd docker run --cap-add=SYS_ADMIN --name xx --runtime=nvidia --gpus all -it image_name /bin/bash
Can you upgrade to the latest version of Nsight Systems and try these collections again? You are using a version that is about a year old. You can find the latest version at Nsight Systems - Get Started | NVIDIA Developer
If it continues to fail after upgrading to the latest version of the tool, can you run a collection in the docker. Before the collection, add a file called “config.ini” to your target-linux-x64 directory. In the config.ini file, add this line;
CudaSkipSomeApiCalls = false
Also, does your workload work as expected when run inside the docker? In other words, does it seem like the kernels missing in the trace are get executed?
Finally, does the missing kernel originate in its own module?
Thanks for your reply, and I have upgraded my Nsight Systems to the version:
NVIDIA Nsight Systems version 2024.5.1.113-245134619542v0(from cmd nsys --version)
And it’s saved in: /opt/nvidia/nsight-systems/2024.5.1, I have also add CudaSkipSomeApiCalls = false to file called config.ini in: /opt/nvidia/nsight-systems/2024.5.1/target-linux-x64, but the issue still exists.
I also find out when I use the trtexec in docker to generate the engine(both used by host and docker), this issue vanished and the inference results are the same: both of docker and host caught 57 kernels.
But when I use the trtexec in host to generate the engine, this issue happens again with the inference results slightly different, that’s weird, cause the TensorRT and CUDA version in docker are the same as host.