CALL FOR HELP
My CUDA application was working perfectly on cloud-based H100 GPUs, but after migrating to a Jetson AGX Orin with JetPack 6, the GPU kernels appear to be running on CPU cores instead of the GPU. Simple CUDA test programs (matrix multiplication) work correctly on the same hardware.
System Information
- Hardware: Jetson AGX Orin
- OS: JetPack 6 (flashed for C++20 support)
- CUDA Version: 12.2
- Driver: 540.3.0
Build Configuration
- CMake with
-DUSE_GPU=ON - CUDA compute capability:
sm_87(architecture 87) - Mixed C++20/CUDA C++17 compilation
- NVCC flags:
--std=c++17 -arch=sm_87
Symptoms
- Simple CUDA test: Matrix multiplication kernels execute on GPU correctly (verified with nsys profiler)
- My application: Kernels appear to launch without errors but nsys shows no CUDA trace data
- Profiler output:
nsys profileshows only CPU activity, reports “does not contain CUDA trace data” - No error messages: All CUDA API calls return success, kernels launch without reported errors
- `Generating ‘/tmp/nsys-report-4783.qdstrm’
[1/8] [========================100%] report4.nsys-rep
[2/8] [========================100%] report4.sqlite
[3/8] Executing ‘nvtx_sum’ stats report
SKIPPED: /home/build_gpu/report4.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing ‘osrt_sum’ stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
100.0 29,024 1 29,024.0 29,024.0 29,024 29,024 0.0 fwrite
[5/8] Executing ‘cuda_api_sum’ stats report
SKIPPED: /home/build_gpu/report4.sqlite does not contain CUDA trace data.
[6/8] Executing ‘cuda_gpu_kern_sum’ stats report
SKIPPED: /home/build_gpu/report4.sqlite does not contain CUDA kernel data.
[7/8] Executing ‘cuda_gpu_mem_time_sum’ stats report
SKIPPED: /home/build_gpu/report4.sqlite does not contain GPU memory data.
[8/8] Executing ‘cuda_gpu_mem_size_sum’ stats report
SKIPPED: /home/build_gpu/report4.sqlite does not contain GPU memory data.
Generated:`
Code Structure
- Template-based CUDA kernels with explicit instantiation
- Singleton GPU manager for device memory management
- Proper error checking after kernel launches
- Memory transfers and kernel launches appear successful
What I’ve Tried
- Verified CUDA installation with working test programs
- Checked compute capability matches hardware (8.7)
- Added comprehensive error checking - no CUDA errors reported
- Confirmed proper linking and compilation flags
- Verified GPU memory allocation and transfers work correctly
Question
What could cause CUDA kernels to silently not execute on GPU while appearing to launch successfully, especially when transitioning from cloud GPU to Jetson hardware? Are there Jetson-specific configuration issues or runtime differences I should investigate?
Any diagnostic approaches or common pitfalls with JetPack 6 and template-based CUDA code would be appreciated.