Thanks for your reply!
The GPU is Tesla V100 .
And the rep file(replay-kernel) is attached. (When I set replay to applition, it doesn’t produce rep files)
The program is FVCOM, and there are tens of thousands of lines of code, and I’m not sure exactly which function is at fault, so I may not be able to provide a copy program.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM3-32GB Off | 00000000:3B:00.0 Off | 0 |
| N/A 55C P0 75W / 350W | 762MiB / 32768MiB | 31% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM3-32GB Off | 00000000:86:00.0 Off | 0 |
| N/A 44C P0 50W / 350W | 2MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 35757 C ./fvg4 760MiB |
+---------------------------------------------------------------------------------------+
From below screenshot, your sample has memory issue. Suggest to use compute-sanitizer tool to check the details and fix it. NVIDIA Compute Sanitizer | NVIDIA Developer.
It will be installed with CUDA toolkit.
As mentioned, this error points to an illegal memory access. This can have either of three reasons:
Your application/kernel has a memory access bug. As linked, you may be able to find this with compute-sanitizer.
Your application/kernel has a memory race condition that only triggers when run under the profiler. You may be able to find this with compute-sanitizer’s racecheck tool, but it’s also possible that it won’t show anything, as it can’t cover all possible scenarios.
There is a bug in ncu’s SASS-patching code that is used to collect certain software metrics (i.e. many of those you would see on the Source page, like Instructions Executed).
Since you are not collecting any SASS-patched metrics explicitly, the only related data collection is for the SW pre-pass. You can disable this by setting the env varNV_COMPUTE_PROFILER_DISABLE_SW_PRE_PASS=1. Note that if you run under sudo, you need to set it within the privileged environment. Should it turn out that this is the issue, you would need to switch to a newer ncu version that ideally has this issue fixed. If none exists (yet), I would encourage you to file a bug with your code as a repro example so that we can debug and fix the issue internally.
Thanks for your reply.
The compute-sanitizer worked! I found the problem and solved it, thanks you.
But when I continue to analyze the program with nsight-compute, I set the replay mode to application and the nsight analyzer has been running for 8 hours so far and shows no signs of stopping.
And an intermediate file has been created in the /tmp directory, which is already 80GB, is this normal?
Without nsight-compute, my program only takes 3 minutes to execute on the GPU.
Is there something wrong with my nsight-compute? Should I keep running it?
By default, ncu will profile every kernel in your application. It appears that given the number of kernels your app is launching, this is likely not what you want. You should set some appropriate filters to limit the type/count/… of kernels to collect data for.
Thanks for your suggestion, but I would like to be able to run the nsight compute globally first.
However, the file stored in the /tmp directory is already 440GB (file name is nsight-compute-9552-9add) and the root directory is running out of free space.
Can you tell me how to set the location of this intermediate file? I want to run it in a hard disk with more space.
You can specify the temporary directory by setting the TMPDIR environment variable. Note that you will not be able to open this file in the UI. You can also limit the list of collected metrics to reduce the file size, using the filter options documentation linked previously.