CUDA Kernels Not Executing on GPU - Jetson AGX Orin with JetPack 6

CALL FOR HELP

My CUDA application was working perfectly on cloud-based H100 GPUs, but after migrating to a Jetson AGX Orin with JetPack 6, the GPU kernels appear to be running on CPU cores instead of the GPU. Simple CUDA test programs (matrix multiplication) work correctly on the same hardware.

System Information

  • Hardware: Jetson AGX Orin
  • OS: JetPack 6 (flashed for C++20 support)
  • CUDA Version: 12.2
  • Driver: 540.3.0

Build Configuration

  • CMake with -DUSE_GPU=ON
  • CUDA compute capability: sm_87 (architecture 87)
  • Mixed C++20/CUDA C++17 compilation
  • NVCC flags: --std=c++17 -arch=sm_87

Symptoms

  1. Simple CUDA test: Matrix multiplication kernels execute on GPU correctly (verified with nsys profiler)
  2. My application: Kernels appear to launch without errors but nsys shows no CUDA trace data
  3. Profiler output: nsys profile shows only CPU activity, reports “does not contain CUDA trace data”
  4. No error messages: All CUDA API calls return success, kernels launch without reported errors
  5. `Generating ‘/tmp/nsys-report-4783.qdstrm’
    [1/8] [========================100%] report4.nsys-rep
    [2/8] [========================100%] report4.sqlite
    [3/8] Executing ‘nvtx_sum’ stats report
    SKIPPED: /home/build_gpu/report4.sqlite does not contain NV Tools Extension (NVTX) data.
    [4/8] Executing ‘osrt_sum’ stats report

Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name


100.0           29,024          1  29,024.0  29,024.0    29,024    29,024          0.0  fwrite

[5/8] Executing ‘cuda_api_sum’ stats report
SKIPPED: /home/build_gpu/report4.sqlite does not contain CUDA trace data.
[6/8] Executing ‘cuda_gpu_kern_sum’ stats report
SKIPPED: /home/build_gpu/report4.sqlite does not contain CUDA kernel data.
[7/8] Executing ‘cuda_gpu_mem_time_sum’ stats report
SKIPPED: /home/build_gpu/report4.sqlite does not contain GPU memory data.
[8/8] Executing ‘cuda_gpu_mem_size_sum’ stats report
SKIPPED: /home/build_gpu/report4.sqlite does not contain GPU memory data.
Generated:`

Code Structure

  • Template-based CUDA kernels with explicit instantiation
  • Singleton GPU manager for device memory management
  • Proper error checking after kernel launches
  • Memory transfers and kernel launches appear successful

What I’ve Tried

  • Verified CUDA installation with working test programs
  • Checked compute capability matches hardware (8.7)
  • Added comprehensive error checking - no CUDA errors reported
  • Confirmed proper linking and compilation flags
  • Verified GPU memory allocation and transfers work correctly

Question

What could cause CUDA kernels to silently not execute on GPU while appearing to launch successfully, especially when transitioning from cloud GPU to Jetson hardware? Are there Jetson-specific configuration issues or runtime differences I should investigate?

Any diagnostic approaches or common pitfalls with JetPack 6 and template-based CUDA code would be appreciated.

I’m not able to answer what is required to solve this, but most CUDA programs are designed to work with a discrete GPU on the PCI bus (dGPU). Jetsons have an integrated GPU directly wired to the memory controller (an iGPU). Detecting a dGPU is via nvidia-smi most of the time, and the iGPU has a tendency to not show up as available when set up for use with a dGPU. There is a minimal nvidia-smi program on the Jetson, but it might not be sufficient. Do check if you have that minimal nvidia-smi via “which nvidia-smi” (or just running the command to look at what it puts out).

Also, if one were to add CUDA libraries and drivers intended for a dGPU, then this would break the iGPU.

Someone else would need to answer any specific questions on how to transition from dGPU to iGPU.

Hi,

Do you get the expected result when running on the Orin?
Or the kernel somehow being skipped on the Jetson?

If so, please check if you have added the corresponding synchronization call.
Thanks.

I wondered it cuda-samples might have some pointers but the current version uses cuda-12.9.

If you want newer cuda and nsight-systems you could install Jetpack 6.2 or the recently release 6.2.1.

Here’s cuda-12.9 nvcc docs

and there’s

nvcc --cuda --help

The kernels are not being skipped because the output is being computed, but when I profile the output I see the kernels are not being run on GPU but maybe on another CPU…? I’m not sure I understand this behaviour. ANd yes I have the right synchronization call

Hi,

If the output is correct and your implementation is a CUDA code.
It should be run on the GPU. (CPU cannot run CUDA implementation)

Or do you use a high-level API that contains both a C++ and a CUDA implementation?

How about the performance?
Depending on the setting, some kernels might run slower on the Jetson.
In such a case, the GPU utilization can be low due to the long idle time (for example, a memory-bound task).

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.