Why this part take such a long lantency, or which factor will do affect the lantency?

Hi @huang_shifei, I’m happy to help.

Could you share your report file?

Could you collect one more report, with GPU metrics and GPU context switch enabled? If you are using Windows, could you also enable WDDM trace?

hi @liuyis thx for your reply. we could give you some pictures, but the report file maybe not convinent, and could you tell me how to enable the GPU metrics and GPU context switch?

Got it. Could you zoom in the timeline and show us full name of the CUDA API circled by the red box?

could you tell me how to enable the GPU metrics and GPU context switch?

How were you collecting the current report? Was it through GUI or CLI?

the API name is cudaSignalExternalSemphoreAsync_v2, we use CLI to debug. and if we want to solve this issue, what we can do? which part we could do about this issue.

The screenshot seems to indicate it’s something begin with “cudaW…”, could you double check the report?

To enable GPU metrics feature, adding the CLI switch “–gpu-metrics-device=all”

To enable GPU context switch feature, adding “–gpuctxsw=true”

Note that both features may require root permission

I have tried these input parameters, but not supported yet. we have take some photos about the report.
report photos.zip (803.3 KB)

would you mind give me some advices or directions for solving this issue. Thx very much.

From the screenshot, before triggering GPU activities from the CPU thread, you’ve called cudaWaitExternalSemaphoresAsync(). According to documentation, cudaWaitExternalSemaphoresAsync() will cause the GPU stream to wait for a set of externally allocated semaphore object before moving on, and that could be why the GPU didn’t immidiately execute your model.

Please check you codes to see what semaphore objects the cudaWaitExternalSemaphoresAsync() call made GPU to wait for. The documentation has explanations for different types of objects and the actual behavior when you wait on them.

Could you clarify a bit more about “not supported yet”? What’s the Nsys version you were using? What are the output messages when you try those switches?

I get your points, but i do not call cudaWaitExternalSemaphoresAsync this function directly, maybe this cudaWaitExternalSemaphoresAsync API is called by NVStream. another, in nsys GPU timeline, seems that GPU is not blocked.

so i still do not know how to investigate this issue , could you give me any suggestion.

nsys version :2021.4.2.37-0d7f8f7

but i do not call cudaWaitExternalSemaphoresAsync this function directly, maybe this cudaWaitExternalSemaphoresAsync API is called by NVStream

I’m not so familair with NVStream, could you provide more information about your application? Is it possible to share a minimal reproducer for us to run one our side? Or if not, if might be helpful to share the cooresponding code block for the screenshot you captured.

Google shows NVStream seems related to NvSci library, and I saw the following from the doc I shared in the last reply:

  • If the semaphore object is of the type cudaExternalSemaphoreHandleTypeNvSciSync then, waiting on the semaphore will wait until the cudaExternalSemaphoreSignalParams::params::nvSciSync::fence is signaled by the signaler of the NvSciSyncObj that was associated with this semaphore object. By default, waiting on such an external semaphore object causes appropriate memory synchronization operations to be performed over all external memory objects that are imported as cudaExternalMemoryHandleTypeNvSciBuf. This ensures that any subsequent accesses made by other importers of the same set of NvSciBuf memory object(s) are coherent. These operations can be skipped by specifying the flag cudaExternalSemaphoreWaitSkipNvSciBufMemSync, which can be used as a performance optimization when data coherency is not required. But specifying this flag in scenarios where data coherency is required results in undefined behavior.

Could that be related?

in nsys GPU timeline, seems that GPU is not blocked.

The GPU might be just waiting for the semaphore objects and not doing anything.

nsys version :2021.4.2.37-0d7f8f7

Please try our latest version from https://developer.nvidia.com/nsight-systems, and use that to capture a report with GPU metrics sampling and GPU context switch trace.