[Jetson AGX Orin] Kernel launch is frequently delayed after kernel launch

aakksc · March 31, 2025, 1:53am

I am using iGPU on Jetson AGX Orin for Matrix Vector Multiplication.

Issue is captured in below pictures from nsys.
Top: Overview of 2 processes using CUDA HW (Orin).
Lower-left: Kernel is executed right after kernel launch, which is normal behavior.
Lower-right: Kernel execution is delayed after kenerl launch. The delay happens frequently and appears to be approximately 100ms cyclical.

How could I stop this occurance of kernel execution delay?

Condition is:

Zero copy (mapped memory) is used with cudaHostAllocMapped in cudaHostAlloc(). Same issue is observed when using cudaMemcpy or unified memory.
sudo nvpmodel -m 0 and sudo jetson_clocks are executed.
MPS is enabled.

AastaLLL · March 31, 2025, 4:31am

Hi,

It looks like your GPU resource is fully occupied.
So is it possible that the kernel needs to wait for the resource to execute?

Are you able to verify this?
For example, reducing the task loading with a simpler kernel to see if the same behavior occurs.

Thanks.

aakksc · March 31, 2025, 6:55am

Hi @AastaLLL ,

As far as I checked with jtop, GPU is not fully occupied.
There is still resource for GPU as well as memory.

In my application, cublassSgemv() is used for Matrix Vector Multiplication.
The same issue is observed when reducing the size of matrix and vector (just 4x2 matrix and 2x1 vector).

Do you have any thought about the occurance which appear to be approximately cyclical?

AastaLLL · April 2, 2025, 3:41am

Hi,

One possible cause is that the tasks are waiting in the queue when submitting to the GPU.
Please try to increase the work queue number to see if it helps:

$ export CUDA_DEVICE_MAX_CONNECTIONS=32

Here is the corresponding document for your reference:

If this doesn’t help, could you share a reproducible source and steps so we can try it in our environment as well?

Thanks.

aakksc · April 3, 2025, 3:49am

Hi @AastaLLL ,

Setting this variable did not work for the issue.

export CUDA_DEVICE_MAX_CONNECTIONS=32

I tried with simple application that does similar to the original application.

From nsys report ioctl appears to block kenel execution after kernel launch periodically.
What is ioctl running for periodically and is there any way to stop it?

Reproducible source is packed in below attachment.
source.zip (2.8 KB)

For compilation,

nvcc process1.c -o process1 -lcublas
nvcc process2.c -o process2 -lcublas

For run with MPS and nsys,

CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log nsys profile -o ./nsys/process1 ./process1 10000
CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log nsys profile -o ./nsys/process2 ./process2 10000

AastaLLL · April 7, 2025, 6:37am

Hi,

It looks like you are using MPS on Jetson.
Is there any dependency between processes?

Do you also meet the same issue without using MPS?
For example, two standalone GPU tasks?

Thanks.

aakksc · April 7, 2025, 8:16am

Hi @AastaLLL

Is there any dependency between processes?

One process uses outputs from the other one.
It is to simulate process communication in the original application.
MPS is used to reduce the overhead for context swtich on GPU when 2 processes requests GPU calculation.

Do you also meet the same issue without using MPS?

Yes, the same issue is observed without MPS too.

Similar issue is reported in below as well
When executing cuda API occasionally, ioctl blocks for tens of milliseconds - CUDA / CUDA NVCC Compiler - NVIDIA Developer Forums

AastaLLL · April 9, 2025, 4:13am

Hi,

If the same issue also occurs without MPS, could you share the source without using it?
Thanks.

aakksc · April 9, 2025, 8:15am

Hi @AastaLLL ,

I runned only one process without MPS.
Blocks by ioctl are observed.
Could you help to check?

Nsys report:
process_wo_mps.zip (776.3 KB)

Source:
process_wo_mps.zip (1.3 KB)

Command:
nvcc process_wo_mps.c -o process_wo_mps -lcublas
nsys profile -o ./process_wo_mps ./process_wo_mps 10000

AastaLLL · April 10, 2025, 6:14am

Hi,

Thanks for sharing the resources.
We tried to reproduce this issue locally but met a missing file error.

Could you also share the source with us?

$ nvcc process_wo_mps.c -o process_wo_mps -lcublas -I /usr/local/cuda-12.6/include
process_wo_mps.c:9:10: fatal error: common.h: No such file or directory
    9 | #include "common.h"
      |          ^~~~~~~~~~
compilation terminated.

Thanks.

aakksc · April 10, 2025, 6:21am

Hi @AastaLLL ,

Sorry for inconveniency.
Could you fetch common.h from this package?
source.zip (2.8 KB)

Thanks,

AastaLLL · April 11, 2025, 6:47am

Hi,

We double-checked the source and the profiling output shared on Apr 9.
Setting CUDA_DEVICE_MAX_CONNECTIONS does help.

The below experiment is done with the non-MPS sample.
Before increasing the max queue number, we do see the launch latency increase from ~12 µs to 1ms.

This matches the value of the nsys output you shared on Apr 9.

Once we set CUDA_DEVICE_MAX_CONNECTIONS to 32 (the maximum number).
The periodically long latency disappears which is always around 11µs-13µs.

We also validate the same with the MPS sample and the launch latency is around 13x µs.

Attaching nsys output with default and 32 max queue numbers for your reference.
process_wo_mps_default.nsys-rep (1.3 MB)
process_wo_mps_queue32.nsys-rep (1.3 MB)

Thanks.

aakksc · April 11, 2025, 8:03am

Hello @AastaLLL ,

Thank you very much for the result.

Just to be sure, commands were executed as follows on your environemnt?

Without MPS:

export CUDA_DEVICE_MAX_CONNECTIONS=32
nvcc process_wo_mps.c -o process_wo_mps -lcublas
nsys profile -o ./process_wo_mps ./process_wo_mps 10000

WIth MPS:

export CUDA_DEVICE_MAX_CONNECTIONS=32
nvcc process1.c -o process1 -lcublas
nvcc process2.c -o process2 -lcublas
CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log nsys profile -o ./nsys/process1 ./process1 10000
CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log nsys profile -o ./nsys/process2 ./process2 10000

Is there any way to check the number of total quenes submitted to GPU?
It may be that other applications on my environment are submitting more quene than your environemnt, which lead to ioctl’s blocking although increasing limit with CUDA_DEVICE_MAX_CONNECTIONS=32.

AastaLLL · April 14, 2025, 6:08am

Hi,

1.
Yes, but we ran the profiler with root authority.

export CUDA_DEVICE_MAX_CONNECTIONS=32
sudo /opt/nvidia//nsight-systems/2024.5.4/bin/nsys profile -o ./process_wo_mps_no_max ./process_wo_mps 10000

2.
Unfortunately, this is controlled by the low-level GPU scheduler and no info can be retrieved.
Usually, this comes from the smaller kernel like DNNs.

But you can try if the CUDA graph helps for your use case.
The library is designed to minimize launch latency.

github.com/NVIDIA/cuda-samples

Samples/4_CUDA_Libraries/conjugateGradientCudaGraphs/conjugateGradientCudaGraphs.cu

master

/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,

This file has been truncated. show original

Thanks.

aakksc · April 15, 2025, 5:10am

Hello @AastaLLL ,

Thank you for the link.
Using tracing tool like strace, I found that Xorg and gnome-shell are calling ioctl() system call, which can be processes interrupting my application.

Setting up jetson with CUI mode disables those applications using GPU for GUI, and worked to reduce latency due to ioctl().

Closing the item.

command for set up with CUI:

sudo systemctl set-default multi-user.target

system · May 6, 2025, 2:28am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Concurrent kernel and events on Kepler CUDA Programming and Performance	16	11122	January 29, 2014
KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ? CUDA Programming and Performance	12	5440	August 17, 2009
Launch Timeouts CUDA Programming and Performance	32	22137	May 4, 2011
cpu0 locked out for 3ms every 100ms Jetson TX1	16	1106	October 18, 2021
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12926	July 26, 2010
Variable run time for cuda kernel Jetson AGX Orin cuda	3	1291	March 2, 2023
Is there any timeout or lifetime in cuda ioctl? CUDA Programming and Performance cuda	6	1083	July 12, 2023
Why does my kernel take too long occasionally? CUDA Programming and Performance	21	9031	October 13, 2010
Any way to measure the latency of a kernel launch? CUDA Programming and Performance	13	7897	July 22, 2022
When executing cuda API occasionally, ioctl blocks for tens of milliseconds CUDA NVCC Compiler cuda	1	550	August 21, 2023

[Jetson AGX Orin] Kernel launch is frequently delayed after kernel launch

Related topics