[Jetson AGX Orin] Kernel launch is frequently delayed after kernel launch

I am using iGPU on Jetson AGX Orin for Matrix Vector Multiplication.

Issue is captured in below pictures from nsys.
Top: Overview of 2 processes using CUDA HW (Orin).
Lower-left: Kernel is executed right after kernel launch, which is normal behavior.
Lower-right: Kernel execution is delayed after kenerl launch. The delay happens frequently and appears to be approximately 100ms cyclical.

How could I stop this occurance of kernel execution delay?

Condition is:

  • Zero copy (mapped memory) is used with cudaHostAllocMapped in cudaHostAlloc(). Same issue is observed when using cudaMemcpy or unified memory.
  • sudo nvpmodel -m 0 and sudo jetson_clocks are executed.
  • MPS is enabled.

Hi,

It looks like your GPU resource is fully occupied.
So is it possible that the kernel needs to wait for the resource to execute?

Are you able to verify this?
For example, reducing the task loading with a simpler kernel to see if the same behavior occurs.

Thanks.

Hi @AastaLLL ,

As far as I checked with jtop, GPU is not fully occupied.
There is still resource for GPU as well as memory.

In my application, cublassSgemv() is used for Matrix Vector Multiplication.
The same issue is observed when reducing the size of matrix and vector (just 4x2 matrix and 2x1 vector).

Do you have any thought about the occurance which appear to be approximately cyclical?

Hi,

One possible cause is that the tasks are waiting in the queue when submitting to the GPU.
Please try to increase the work queue number to see if it helps:

$ export CUDA_DEVICE_MAX_CONNECTIONS=32

Here is the corresponding document for your reference:

If this doesn’t help, could you share a reproducible source and steps so we can try it in our environment as well?

Thanks.

Hi @AastaLLL ,

Setting this variable did not work for the issue.

export CUDA_DEVICE_MAX_CONNECTIONS=32

I tried with simple application that does similar to the original application.

From nsys report ioctl appears to block kenel execution after kernel launch periodically.
What is ioctl running for periodically and is there any way to stop it?

Reproducible source is packed in below attachment.
source.zip (2.8 KB)

For compilation,

nvcc process1.c -o process1 -lcublas
nvcc process2.c -o process2 -lcublas

For run with MPS and nsys,

CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log nsys profile -o ./nsys/process1 ./process1 10000
CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log nsys profile -o ./nsys/process2 ./process2 10000

Hi,

It looks like you are using MPS on Jetson.
Is there any dependency between processes?

Do you also meet the same issue without using MPS?
For example, two standalone GPU tasks?

Thanks.

Hi @AastaLLL

Is there any dependency between processes?

One process uses outputs from the other one.
It is to simulate process communication in the original application.
MPS is used to reduce the overhead for context swtich on GPU when 2 processes requests GPU calculation.

Do you also meet the same issue without using MPS?

Yes, the same issue is observed without MPS too.

Similar issue is reported in below as well
When executing cuda API occasionally, ioctl blocks for tens of milliseconds - CUDA / CUDA NVCC Compiler - NVIDIA Developer Forums

Hi,

If the same issue also occurs without MPS, could you share the source without using it?
Thanks.

Hi @AastaLLL ,

I runned only one process without MPS.
Blocks by ioctl are observed.
Could you help to check?

Nsys report:
process_wo_mps.zip (776.3 KB)

Source:
process_wo_mps.zip (1.3 KB)

Command:
nvcc process_wo_mps.c -o process_wo_mps -lcublas
nsys profile -o ./process_wo_mps ./process_wo_mps 10000

Hi,

Thanks for sharing the resources.
We tried to reproduce this issue locally but met a missing file error.

Could you also share the source with us?

$ nvcc process_wo_mps.c -o process_wo_mps -lcublas -I /usr/local/cuda-12.6/include
process_wo_mps.c:9:10: fatal error: common.h: No such file or directory
    9 | #include "common.h"
      |          ^~~~~~~~~~
compilation terminated.

Thanks.

Hi @AastaLLL ,

Sorry for inconveniency.
Could you fetch common.h from this package?
source.zip (2.8 KB)

Thanks,

Hi,

We double-checked the source and the profiling output shared on Apr 9.
Setting CUDA_DEVICE_MAX_CONNECTIONS does help.

The below experiment is done with the non-MPS sample.
Before increasing the max queue number, we do see the launch latency increase from ~12 µs to 1ms.

This matches the value of the nsys output you shared on Apr 9.

Once we set CUDA_DEVICE_MAX_CONNECTIONS to 32 (the maximum number).
The periodically long latency disappears which is always around 11µs-13µs.

We also validate the same with the MPS sample and the launch latency is around 13x µs.

Attaching nsys output with default and 32 max queue numbers for your reference.
process_wo_mps_default.nsys-rep (1.3 MB)
process_wo_mps_queue32.nsys-rep (1.3 MB)

Thanks.

Hello @AastaLLL ,

Thank you very much for the result.

Just to be sure, commands were executed as follows on your environemnt?

Without MPS:

export CUDA_DEVICE_MAX_CONNECTIONS=32
nvcc process_wo_mps.c -o process_wo_mps -lcublas
nsys profile -o ./process_wo_mps ./process_wo_mps 10000

WIth MPS:

export CUDA_DEVICE_MAX_CONNECTIONS=32
nvcc process1.c -o process1 -lcublas
nvcc process2.c -o process2 -lcublas
CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log nsys profile -o ./nsys/process1 ./process1 10000
CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log nsys profile -o ./nsys/process2 ./process2 10000

Is there any way to check the number of total quenes submitted to GPU?
It may be that other applications on my environment are submitting more quene than your environemnt, which lead to ioctl’s blocking although increasing limit with CUDA_DEVICE_MAX_CONNECTIONS=32.

Hi,

1.
Yes, but we ran the profiler with root authority.

export CUDA_DEVICE_MAX_CONNECTIONS=32
sudo /opt/nvidia//nsight-systems/2024.5.4/bin/nsys profile -o ./process_wo_mps_no_max ./process_wo_mps 10000

2.
Unfortunately, this is controlled by the low-level GPU scheduler and no info can be retrieved.
Usually, this comes from the smaller kernel like DNNs.

But you can try if the CUDA graph helps for your use case.
The library is designed to minimize launch latency.

Thanks.

Hello @AastaLLL ,

Thank you for the link.
Using tracing tool like strace, I found that Xorg and gnome-shell are calling ioctl() system call, which can be processes interrupting my application.

Setting up jetson with CUI mode disables those applications using GPU for GUI, and worked to reduce latency due to ioctl().

Closing the item.

command for set up with CUI:

sudo systemctl set-default multi-user.target

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.