I have nsight-compute-2022.1.0 installed under my cuda directory path.
I am trying to run ncu -o profile CuVectorAddMulti.exe command for analysis as I feel that my kernel launch time is too high for sxm4 machine.
But there is no file located name “CuVectorAddMulti.exe” in entire cuda directory.
How can I get this file?
“CuVectorAddMulti.exe” mentioned in the Nsight Compute document is just a example CUDA application name. It is not distributed with the CUDA Toolkit.
You could pick one of the samples from “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\<version>\extras\demo_suite”. Provide the full path of your application in the ncu command line, e.g.:
- ncu -o profile “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\<version>\extras\demo_suite\vectorAdd.exe”
Or you can download and build one of the CUDA samples from: GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Thank you @Sanjiv.Satoor for your reply. cuda samples did help. I am using linux machine.
I did run ncu -o profile Samples/0_Introduction/concurrentKernels/concurrentKernels.
I have two machines, and both have same processor and same number of A100 gpus with similar gpu memory.
one Machine uses PCIe for data transfer, on which, the result of the above test is following.
Measured time for sample = 12.488s
Second machine uses SXM4 tech, and the result of the above test is following.
Measured time for sample = 13.240s
The reason to come for this test is, when I trained identical deep learning application, machine one(PCIe) seems be taking less time compared to machine two (sxm4 form factor), which is surprising. By doing profiling, analyzed that machine two (sxm4 form factor) has high kernel launch time.
Also, please let me know if any other test I can perform to test both machines.
Why machine two (sxm4 form factor) taking longer time and what are suggested solutions?