Why this part take such a long lantency, or which factor will do affect the lantency？

huang_shifei · September 12, 2022, 10:50am

liuyis · September 14, 2022, 3:18am

Hi @huang_shifei, I’m happy to help.

Could you share your report file?

Could you collect one more report, with GPU metrics and GPU context switch enabled? If you are using Windows, could you also enable WDDM trace?

huang_shifei · September 14, 2022, 4:46am

hi @liuyis thx for your reply. we could give you some pictures, but the report file maybe not convinent, and could you tell me how to enable the GPU metrics and GPU context switch?

liuyis · September 14, 2022, 4:59am

Got it. Could you zoom in the timeline and show us full name of the CUDA API circled by the red box?

could you tell me how to enable the GPU metrics and GPU context switch?

How were you collecting the current report? Was it through GUI or CLI?

huang_shifei · September 14, 2022, 6:40am

the API name is cudaSignalExternalSemphoreAsync_v2, we use CLI to debug. and if we want to solve this issue, what we can do? which part we could do about this issue.

liuyis · September 14, 2022, 11:03am

The screenshot seems to indicate it’s something begin with “cudaW…”, could you double check the report?

To enable GPU metrics feature, adding the CLI switch “–gpu-metrics-device=all”

To enable GPU context switch feature, adding “–gpuctxsw=true”

Note that both features may require root permission

huang_shifei · September 18, 2022, 12:39pm

I have tried these input parameters， but not supported yet. we have take some photos about the report.
report photos.zip (803.3 KB)

would you mind give me some advices or directions for solving this issue. Thx very much.

liuyis · September 19, 2022, 2:47am

From the screenshot, before triggering GPU activities from the CPU thread, you’ve called cudaWaitExternalSemaphoresAsync(). According to documentation, cudaWaitExternalSemaphoresAsync() will cause the GPU stream to wait for a set of externally allocated semaphore object before moving on, and that could be why the GPU didn’t immidiately execute your model.

Please check you codes to see what semaphore objects the cudaWaitExternalSemaphoresAsync() call made GPU to wait for. The documentation has explanations for different types of objects and the actual behavior when you wait on them.

Could you clarify a bit more about “not supported yet”? What’s the Nsys version you were using? What are the output messages when you try those switches?

huang_shifei · September 19, 2022, 5:43am

I get your points， but i do not call cudaWaitExternalSemaphoresAsync this function directly, maybe this cudaWaitExternalSemaphoresAsync API is called by NVStream. another， in nsys GPU timeline, seems that GPU is not blocked.

so i still do not know how to investigate this issue , could you give me any suggestion.

nsys version ：2021.4.2.37-0d7f8f7

liuyis · September 19, 2022, 10:25am

but i do not call cudaWaitExternalSemaphoresAsync this function directly, maybe this cudaWaitExternalSemaphoresAsync API is called by NVStream

I’m not so familair with NVStream, could you provide more information about your application? Is it possible to share a minimal reproducer for us to run one our side? Or if not, if might be helpful to share the cooresponding code block for the screenshot you captured.

Google shows NVStream seems related to NvSci library, and I saw the following from the doc I shared in the last reply:

If the semaphore object is of the type cudaExternalSemaphoreHandleTypeNvSciSync then, waiting on the semaphore will wait until the cudaExternalSemaphoreSignalParams::params::nvSciSync::fence is signaled by the signaler of the NvSciSyncObj that was associated with this semaphore object. By default, waiting on such an external semaphore object causes appropriate memory synchronization operations to be performed over all external memory objects that are imported as cudaExternalMemoryHandleTypeNvSciBuf. This ensures that any subsequent accesses made by other importers of the same set of NvSciBuf memory object(s) are coherent. These operations can be skipped by specifying the flag cudaExternalSemaphoreWaitSkipNvSciBufMemSync, which can be used as a performance optimization when data coherency is not required. But specifying this flag in scenarios where data coherency is required results in undefined behavior.

Could that be related?

in nsys GPU timeline, seems that GPU is not blocked.

The GPU might be just waiting for the semaphore objects and not doing anything.

nsys version ：2021.4.2.37-0d7f8f7

Please try our latest version from https://developer.nvidia.com/nsight-systems, and use that to capture a report with GPU metrics sampling and GPU context switch trace.

Topic		Replies	Views
Error Collecting Nsys Profile Metrics Profiling Linux Targets nsight	3	542	April 18, 2024
[Error] Access counter and nsight system with performance counter Profiling Linux Targets cuda	2	11	December 5, 2024
Cannot get tensor core metrics with latest NSight system Profiling Linux Targets cuda , profiling	4	1405	June 20, 2023
Nsight system fails to connect to daemon Profiling Linux Targets	25	2659	April 12, 2023
Availability issue for GPU Metrics sampling hardware unit on WSL Profiling Linux Targets nsight , wsl	9	1307	June 26, 2024
Nsys doesn't show cuda kernel and memory data Profiling Linux Targets cuda , kernel	9	28	December 7, 2024
What is the meaning of error in Nsight UI Diagnostics Summary Profiling Linux Targets	3	912	February 2, 2023
Unexplained gaps in CUDA stream execution Profiling x86 Windows Targets	7	1299	March 29, 2023
Option for "GPU Context Switch Trace" is missing in Windows Profiling Linux Targets nsight	4	1342	October 11, 2021
Nsys command line on agx pegasus Profiling DRIVE Targets drive-devtools	13	1874	November 16, 2021

Why this part take such a long lantency, or which factor will do affect the lantency？

Related topics