I am trying to profile an application that asynchronously launches CUDA kernels on the GPU. But the profiling fails with the following error
==PROF== Profiling “potrf_alg2_set_info” - 1: 0%
==WARNING== Backing up device memory in system memory. Kernel replay might be slow. Consider using “–replay-mode application” to avoid memory save-and-restore.
==WARNING== Backing up device memory in system memory. Kernel replay might be slow. Consider using “–replay-mode application” to avoid memory save-and-restore.
…50%…100% - 73 passes
==PROF== Profiling “potrf_alg2_cta_upper” - 2: 0%…50%…100% - 71 passes
==ERROR== LaunchFailed
==ERROR== LaunchFailed
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==PROF== Report: /home/mannaparambil/dplasma/build/profile.ncu-rep
Hello Joseph,
Thank you for your question on Nsight and I’m sorry you ran into this problem. I just want to clarify which Nsight product are you using. Are you using Nsight Graphics or a different Nsight product such as Nsight systems or Nsight Compute?
Regards,
Nsight Compute stores and restores kernel state in memory in order to replay the kernel multiple times. That can double the memory footprint. To avoid this you can switch to application replay with “–replay-mode application”. This avoids the memory storage from needing to replay. Let me know if that solves your issue.
my app runs seems like OK alone, at least not showing any clearly error.
if i run my app with ncu --set full, it comes:
# ncu -f -o ktranspose --import-source on --set full test_kernels --gtest_filter=design/test_transpose.time/0
Note: Google Test filter = design/test_transpose.time/0
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from design/test_transpose
[ RUN ] design/test_transpose.time/0
==PROF== Connected to process 193819 (/home/dongwei/Workspace/lightnet/build/tests/test_kernels/test_kernels)
==PROF== Profiling "ktranspose" - 0: 0%....50%....100% - 34 passes
ktranspose: 430889 us
==PROF== Profiling "ktranspose_smem" - 1: 0%....50%....100% - 2 passes
==ERROR== LaunchFailed
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==PROF== Report: /home/dongwei/Workspace/lightnet/ktranspose.ncu-rep
if i remove --set full, it gose well:
ncu -f -o ktranspose --import-source on test_kernels --gtest_filter=design/test_transpose.time/0
Note: Google Test filter = design/test_transpose.time/0
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from design/test_transpose
[ RUN ] design/test_transpose.time/0
==PROF== Connected to process 193540 (/home/dongwei/Workspace/lightnet/build/tests/test_kernels/test_kernels)
==PROF== Profiling "ktranspose" - 0: 0%....50%....100% - 9 passes
ktranspose: 147706 us
==PROF== Profiling "ktranspose_smem" - 1: 0%....50%....100% - 9 passes
ktranspose_smem: 49922 us
==PROF== Profiling "ktranspose_smem_nbkcft" - 2: 0%....50%....100% - 9 passes
ktranspose_smem_nbkcft: 57446 us
==PROF== Profiling "transpose_readWrite_alignment..." - 3: 0%....50%....100% - 9 passes
cublasSgeam: 55954 us
[ OK ] design/test_transpose.time/0 (3051 ms)
[----------] 1 test from design/test_transpose (3051 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (3051 ms total)
[ PASSED ] 1 test.
==PROF== Disconnected from process 193540
==PROF== Report: /home/dongwei/Workspace/lightnet/ktranspose.ncu-rep
cause there are 4 kernel to profile, i run the crushed kernels with --set full one by one, they finished profile successfully.
i remove my kernel code line by line trying to find out is the fail caused by my abuse, i remove all code in my kernel, kelnel is exactly empty:
crush still happens, until i change launch config
from <<<GRID, BLOCK, shared_mem, cudaStreamDefault>>>
to <<<GRID, BLOCK, 0, cudaStreamDefault>>>.
4 kernels finish profile in single ncu run.
so, in my sight, ncu will fail when profiling multiple kernel which use shared_mem in single app …
\# ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2023.2.2.0 (build 33188574) (public-release)
I encountered similar issues, I figured out it was the RAM limitation in my system that caused this error code (9). I think by enabling --set full, the required peak RAM increases.
I have also encountered the issue of kernel replay being unable to profile in scenarios with large device memory usage. I had to resort to using application replay, but it is too slow and there is a possibility of mismatches with each replay. I hope that NCU can address this problem in the future and ensure that kernel replay can be properly profiled in scenarios with large memory usage.
Thanks for providing these inputs. We’re always trying to improve the stability and user experience of our tools and this type of input is very helpful. We recently released version 2023.3 with several bug fixes. Please try it out and let us know if the issue still occurs.