Stay in blocking status on cudaMemcpyAsync() function over 100 msec

cheolwoong.eom · February 17, 2023, 8:30am

Please provide the following info (tick the boxes after creating this topic):
Software Version
[o] DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
[o] Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
[o] other : 940-63710-0010-300

SDK Manager Version
[o] 1.9.1.10844
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
[o] native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Hello, I’m trying to run our inference software in Drive AGX Orin newly.
But I met unaccountable problem that our S/W sometimes paused in idle status.
I found this problem is caused by cudaMemcpyAsync() function that makes blocking status after function call.
I checked it with nsight tool and I attached captured image.

You can also find red boxs that named cudaMemcpyAsync.
They spend long time, 273 msec, 104 msec and 1016 msec, in our processing.

And I found some error messages from linux dmesg that printed at same time.

[Feb17 07:42] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:90   [ERR]  Reported err_id(0x2e0) to Safety_Services
[  +0.000008] nvgpu: 17000000.ga10b      ga10b_fifo_ctxsw_timeout_isr:277  [ERR]  Host pfifo ctxsw timeout error

I can’t guess any solutions for this problem. Please check it.

SivaRamaKrishnaNV · February 17, 2023, 11:27am

Dear @cheolwoong.eom,
What is the size for those memcpy calls ? Is it h2d, d2h or d2d direction? Also, is this time indiate CUDA API launch overhead or memory copy time?

cheolwoong.eom · February 20, 2023, 2:14am

Hi, @SivaRamaKrishnaNV

I just tested my S/W again, It takes 975 msec for 2048 bytes D2H cudaMemcpy.
But I believe it is not related with copy size and all issue operations are D2H.
I think this time is caused by memory copy, but I’m sure. I just found this issue at outside of given CUDA API functions.
When I replaced cudaMemcpyAsync() to cudaMemcpy(), I got the same result.

ioctl() calls have any relevant to cudaMemcpy() ? In timeline these functions are state in same place.

SivaRamaKrishnaNV · February 20, 2023, 5:47am

Dear @cheolwoong.eom,
Could you attach the full nsys report here?

cheolwoong.eom · February 20, 2023, 5:55am

Dear @SivaRamaKrishnaNV
OK. I attached nsys-rep file here.
Report1.nsys-rep.tgz (82.0 MB)

SivaRamaKrishnaNV · February 20, 2023, 9:50am

Dear @cheolwoong.eom,
We will look into the report. Just want to confirm if the cudamemcpyAsync is using pinned memory buffers?
Also share the Nsight Systems version used to create the report?

cheolwoong.eom · February 21, 2023, 1:47am

Dear @SivaRamaKrishnaNV

We don’t use pinned memory, just make buffer using “new” operation.
And Nsight version is :
Version: 2022.3.2.34-3e5e9a1 Linux.

Thank you for your support.

SivaRamaKrishnaNV · February 22, 2023, 3:49am

Dear @cheolwoong.eom,
It is strange, I could not load your report on my nsys 2022.3.2.

We don’t use pinned memory, just make buffer using “new” operation.

Is cudaMemcpyAsync() is used to overlap data transfer with computation? Also, Is it possible to move the memory allocations to initialization module so that during execution of application, it does not cause any latencies like these.

cheolwoong.eom · February 22, 2023, 5:37am

Dear @SivaRamaKrishnaNV

I can’t understand why you got the fail to load my report file. I could reload it well.

We are using cudaMemcpyAsync() for post-processing in CPU computation after ML processing in GPU.
Because buffer size is not always same, our software often re-allocate memory before cudaMemcpyAsync().
We have no plan to re-architect our software right now.
I already tested same code in other nVidia development kit like Jetson AGX Orin, Dirve AGX Pagasus, etc… And I’ve never experienced a problem like this.

Thank you for your support.

SivaRamaKrishnaNV · March 8, 2023, 11:06am

Dear @cheolwoong.eom,
I could open the report. Could you confirm if the cudaMemcpyAsync() calls launched in default stream in code? Did you set cudaDeviceScheduleBlockingSync flag in code?

cheolwoong.eom · March 10, 2023, 1:58am

Dear @SivaRamaKrishnaNV

Nice to hear that you got the success to open my report file.
I’m using default stream only, and do not set cudaDeviceScheduleBlockingSync flag.

Thank you for your support.

SivaRamaKrishnaNV · March 16, 2023, 10:09am

Dear @cheolwoong.eom ,
Can you check moving this cudaMemcpyAsync() calls to non default stream?
Also, is it possible to share a simple repro code?

cheolwoong.eom · March 31, 2023, 1:31am

I could not find this issue, in Drive OS 6.0.6
Thank you for your support.

system · April 14, 2023, 1:32am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about cudaMemcpyAsync measurement in nsys and async meaning Profiling Linux Targets nsight	0	859	February 22, 2022
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	707	October 19, 2023
Questions about when using cudaMemcpyAsync(), the host is blocked CUDA Programming and Performance	6	3704	April 5, 2018
cudaMemcpyAsync Func Used too long time. CUDA Programming and Performance	5	2521	July 15, 2019
cudaMemcpyAsync() cost time is same with cudaMemcpy() CUDA Programming and Performance	1	630	November 16, 2018
cudaMemcpyAsync problem CUDA Programming and Performance	9	3310	May 26, 2020
Odd cudaMemcpyAsync() behavior with Kepler K20c and CUDA 5.0 CUDA Programming and Performance	0	967	January 14, 2013
cudaMemcpyAsync waiting for another unrelated cudaMemcpyAsync CUDA Programming and Performance cuda	10	177	December 10, 2024
why cudaMemcpyAsync is waiting ? CUDA Programming and Performance	0	444	May 17, 2019
Is it possible cudaMemcpy can consume more than 100 milliseconds for just a few bytes of data? CUDA Programming and Performance cuda	4	379	October 14, 2021

Stay in blocking status on cudaMemcpyAsync() function over 100 msec

Related topics