Nsight Compute Error

wen1664379144 · July 27, 2024, 5:50am

Hello, I am debugging my program using nsight compute but I am having some trouble.

When my replay mode is set to kernel, it reports error 9.

When my replay mode is set to application, it reports error 127.

My program is working fine when I don’t apply ncu.

Is there something I’m doing wrong? Could you tell me how to fix it?

veraj · July 29, 2024, 2:15am

Sorry for the issue you met.
Can you please tell the Nsight Compute/Driver/GPU you used ?
Also can you provide the repro for us to check ?

wen1664379144 · July 29, 2024, 10:19am

Hi, veraj

Thanks for your reply!
The GPU is Tesla V100 .
And the rep file(replay-kernel) is attached. (When I set replay to applition, it doesn’t produce rep files)

The program is FVCOM, and there are tens of thousands of lines of code, and I’m not sure exactly which function is at fault, so I may not be able to provide a copy program.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM3-32GB           Off | 00000000:3B:00.0 Off |                    0 |
| N/A   55C    P0              75W / 350W |    762MiB / 32768MiB |     31%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3-32GB           Off | 00000000:86:00.0 Off |                    0 |
| N/A   44C    P0              50W / 350W |      2MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     35757      C   ./fvg4                                      760MiB |
+---------------------------------------------------------------------------------------+

replay_kernel.zip (796.5 KB)

veraj · July 30, 2024, 7:28am

Hi, @wen1664379144

From below screenshot, your sample has memory issue. Suggest to use compute-sanitizer tool to check the details and fix it. NVIDIA Compute Sanitizer | NVIDIA Developer.
It will be installed with CUDA toolkit.

felix_dt · July 30, 2024, 7:48am

As mentioned, this error points to an illegal memory access. This can have either of three reasons:

Your application/kernel has a memory access bug. As linked, you may be able to find this with compute-sanitizer.
Your application/kernel has a memory race condition that only triggers when run under the profiler. You may be able to find this with compute-sanitizer’s racecheck tool, but it’s also possible that it won’t show anything, as it can’t cover all possible scenarios.
There is a bug in ncu’s SASS-patching code that is used to collect certain software metrics (i.e. many of those you would see on the Source page, like Instructions Executed).

Since you are not collecting any SASS-patched metrics explicitly, the only related data collection is for the SW pre-pass. You can disable this by setting the env var NV_COMPUTE_PROFILER_DISABLE_SW_PRE_PASS=1. Note that if you run under sudo, you need to set it within the privileged environment. Should it turn out that this is the issue, you would need to switch to a newer ncu version that ideally has this issue fixed. If none exists (yet), I would encourage you to file a bug with your code as a repro example so that we can debug and fix the issue internally.

wen1664379144 · July 31, 2024, 11:38am

Hi, veraj and felix_dt

Thanks for your reply.
The compute-sanitizer worked! I found the problem and solved it, thanks you.

But when I continue to analyze the program with nsight-compute, I set the replay mode to application and the nsight analyzer has been running for 8 hours so far and shows no signs of stopping.
And an intermediate file has been created in the /tmp directory, which is already 80GB, is this normal?

Without nsight-compute, my program only takes 3 minutes to execute on the GPU.
Is there something wrong with my nsight-compute? Should I keep running it?

wen1664379144 · July 31, 2024, 2:57pm

The intermediate file is now 174G and still going…

felix_dt · July 31, 2024, 3:06pm

By default, ncu will profile every kernel in your application. It appears that given the number of kernels your app is launching, this is likely not what you want. You should set some appropriate filters to limit the type/count/… of kernels to collect data for.

4. Nsight Compute CLI — NsightCompute 12.5 documentation has a good description of all possible options, just like ncu --help.

wen1664379144 · August 2, 2024, 8:29am

Hi, felix_dt

Thanks for your suggestion, but I would like to be able to run the nsight compute globally first.

However, the file stored in the /tmp directory is already 440GB (file name is nsight-compute-9552-9add) and the root directory is running out of free space.

Can you tell me how to set the location of this intermediate file? I want to run it in a hard disk with more space.

felix_dt · August 2, 2024, 9:15am

You can specify the temporary directory by setting the TMPDIR environment variable. Note that you will not be able to open this file in the UI. You can also limit the list of collected metrics to reduce the file size, using the filter options documentation linked previously.

system · August 16, 2024, 9:15am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
NSight Profiling Crashes with error code (9) Nsight Compute	11	4772	January 16, 2024
The profiler returned an error code:1 Nsight Compute	1	2058	March 2, 2022
The profiler returned an error code: 3221226505 (0xc0000409) Nsight Compute	4	188	March 26, 2025
NSight Compute: application does not run Nsight Compute	9	352	July 31, 2024
Kernel output all correct but got NAN when profiling with nsight-compute Nsight Compute cuda	5	928	January 12, 2024
==ERROR== Failed to prepare kernel for profiling (0xc00000fd) but CUDA sample works Nsight Compute kernel , nvbugs	13	2103	November 6, 2021
CUDA kernel launched from Nsight Compute gives inconsistent results Nsight Compute	1	478	October 20, 2022
Error failed to profile kernel Nsight Compute cuda , nsight	3	825	May 18, 2023
NSIGHT COMPUTE not working on simple CUDA example Nsight Compute	1	928	February 7, 2022
Nsight-compute print "the application returned an error code (249)" Nsight Compute	5	1506	February 13, 2023

Nsight Compute Error

Related topics