Stay in blocking status on cudaMemcpyAsync() function over 100 msec

Please provide the following info (tick the boxes after creating this topic):
Software Version
[o] DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)

Target Operating System
[o] Linux

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
[o] other : 940-63710-0010-300

SDK Manager Version

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
[o] native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers

Hello, I’m trying to run our inference software in Drive AGX Orin newly.
But I met unaccountable problem that our S/W sometimes paused in idle status.
I found this problem is caused by cudaMemcpyAsync() function that makes blocking status after function call.
I checked it with nsight tool and I attached captured image.

You can also find red boxs that named cudaMemcpyAsync.
They spend long time, 273 msec, 104 msec and 1016 msec, in our processing.

And I found some error messages from linux dmesg that printed at same time.

[Feb17 07:42] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:90   [ERR]  Reported err_id(0x2e0) to Safety_Services
[  +0.000008] nvgpu: 17000000.ga10b      ga10b_fifo_ctxsw_timeout_isr:277  [ERR]  Host pfifo ctxsw timeout error

I can’t guess any solutions for this problem. Please check it.

Dear @cheolwoong.eom,
What is the size for those memcpy calls ? Is it h2d, d2h or d2d direction? Also, is this time indiate CUDA API launch overhead or memory copy time?

Hi, @SivaRamaKrishnaNV

I just tested my S/W again, It takes 975 msec for 2048 bytes D2H cudaMemcpy.
But I believe it is not related with copy size and all issue operations are D2H.
I think this time is caused by memory copy, but I’m sure. I just found this issue at outside of given CUDA API functions.
When I replaced cudaMemcpyAsync() to cudaMemcpy(), I got the same result.

ioctl() calls have any relevant to cudaMemcpy() ? In timeline these functions are state in same place.

Dear @cheolwoong.eom,
Could you attach the full nsys report here?

Dear @SivaRamaKrishnaNV
OK. I attached nsys-rep file here.
Report1.nsys-rep.tgz (82.0 MB)

Dear @cheolwoong.eom,
We will look into the report. Just want to confirm if the cudamemcpyAsync is using pinned memory buffers?
Also share the Nsight Systems version used to create the report?

Dear @SivaRamaKrishnaNV

We don’t use pinned memory, just make buffer using “new” operation.
And Nsight version is :
Version: 2022.3.2.34-3e5e9a1 Linux.

Thank you for your support.

Dear @cheolwoong.eom,
It is strange, I could not load your report on my nsys 2022.3.2.

We don’t use pinned memory, just make buffer using “new” operation.

Is cudaMemcpyAsync() is used to overlap data transfer with computation? Also, Is it possible to move the memory allocations to initialization module so that during execution of application, it does not cause any latencies like these.

Dear @SivaRamaKrishnaNV

I can’t understand why you got the fail to load my report file. I could reload it well.

We are using cudaMemcpyAsync() for post-processing in CPU computation after ML processing in GPU.
Because buffer size is not always same, our software often re-allocate memory before cudaMemcpyAsync().
We have no plan to re-architect our software right now.
I already tested same code in other nVidia development kit like Jetson AGX Orin, Dirve AGX Pagasus, etc… And I’ve never experienced a problem like this.

Thank you for your support.

Dear @cheolwoong.eom,
I could open the report. Could you confirm if the cudaMemcpyAsync() calls launched in default stream in code? Did you set cudaDeviceScheduleBlockingSync flag in code?

Dear @SivaRamaKrishnaNV

Nice to hear that you got the success to open my report file.
I’m using default stream only, and do not set cudaDeviceScheduleBlockingSync flag.

Thank you for your support.

Dear @cheolwoong.eom ,
Can you check moving this cudaMemcpyAsync() calls to non default stream?
Also, is it possible to share a simple repro code?

I could not find this issue, in Drive OS 6.0.6
Thank you for your support.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.