|
Verify ai performance by cutlass_profiler,but it was too slow,why?
|
|
1
|
18
|
March 4, 2026
|
|
Custom FP4 CUDA Kernel - 129 TFLOPS on DGX Spark with Pre-Quantized Weight Cache
|
|
4
|
244
|
February 25, 2026
|
|
OpenACC use_device / OpenMP use_device_ptr / use_device_addr in combination with cuBLAS
|
|
5
|
44
|
February 19, 2026
|
|
Anyone know if CUDA 12.6.2 is coming to JetPack?
|
|
1
|
29
|
February 16, 2026
|
|
Pytorch matmul vs cudaTensorCoreGemm on Jetson Orin NX
|
|
2
|
41
|
February 12, 2026
|
|
Which tool can accurately obtain kernel performance, ncu or nsys?
|
|
1
|
29
|
February 9, 2026
|
|
Results divergence between cuBLAS Sgemm and cuSPARSE BLOCKED-ELL SpMM
|
|
1
|
29
|
February 9, 2026
|
|
Performance Inquiry: Optimizing Qwen3-VL 2B Inference for 2 QPS Target on Orin Nano Super
|
|
3
|
140
|
February 9, 2026
|
|
cuBLASDx large matrix multiplication performance
|
|
3
|
48
|
February 9, 2026
|
|
Suboptimal PyTorch Performance on Jetson Orin Nano Super
|
|
2
|
65
|
February 5, 2026
|
|
Consistent "CUDA error: an illegal memory access was encountered" Error
|
|
8
|
554
|
January 29, 2026
|
|
cudaErrorIllegalAddress Encountered: "CUDA error: an illegal memory access was encountered"
|
|
2
|
425
|
January 20, 2026
|
|
Is CublasDX compatible with per-block global-pitch or stride values in a batched-gemm kernel?
|
|
3
|
24
|
January 15, 2026
|
|
NvRmMemInitNvmap failed / NVMAP permission denied when launching nvcr.io/nvidia/vllm:25.11-py3 container on Jetson Orin NX + JetPack 6.2 (L4T 36.4.3)
|
|
5
|
139
|
January 21, 2026
|
|
Nsys profile not showing any GPU data
|
|
2
|
101
|
January 5, 2026
|
|
Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s)
|
|
22
|
610
|
January 21, 2026
|
|
Support for per-multiplication m, n, k, lda, ldb, and ldc in batched gemm
|
|
0
|
21
|
January 3, 2026
|
|
Will Cublas support arbitrary (row-major) pitched memory for A, B, C matrices in future?
|
|
4
|
50
|
January 16, 2026
|
|
Does cublaslt batch mode for Pointer Arrays apply for scaling factors as well?
|
|
0
|
24
|
January 2, 2026
|
|
Understanding Tensor Pipe Throughput and Throttle Stalls
|
|
4
|
127
|
January 29, 2026
|
|
Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS
|
|
2
|
63
|
December 17, 2025
|
|
Wmma vs Wgmma On H100 GPU
|
|
5
|
169
|
December 15, 2025
|
|
Conditions on NVJet kernels on Jetson Thor
|
|
14
|
311
|
December 30, 2025
|
|
Running a repo using jax on Jetson Orin AGX 64GB GPU
|
|
2
|
124
|
December 31, 2025
|
|
Example code of Outer Vector Scaling for FP8 data types
|
|
0
|
33
|
December 1, 2025
|
|
Pointers align requirement for api:cublasGemmBatchedEx
|
|
1
|
54
|
November 26, 2025
|
|
Accessing kernel call stack
|
|
10
|
141
|
December 9, 2025
|
|
cuSPARSELt: Strict Output Layout Constraints for Optimal Performance in Sparse-Dense GEMM
|
|
2
|
90
|
November 21, 2025
|
|
CMake Linking Issues
|
|
1
|
129
|
November 20, 2025
|
|
cuBLAS failing on Jetpack 6.2 + dGPU
|
|
6
|
142
|
November 19, 2025
|