|
cublasSgemmGroupedBatched requires host-side synchronization after preceding TRSM on A5000 (device-side ordering insufficient)
|
|
0
|
9
|
April 17, 2026
|
|
RTX Pro 6000 Backwell Card Crash
|
|
4
|
81
|
April 17, 2026
|
|
Gemma 4 VLM VRAM/Host Memory Leak — Full Investigation Report
|
|
1
|
224
|
April 10, 2026
|
|
cuBLAS batched FP32 SGEMM dispatcher picks suboptimal kernel on RTX 5090 (sm_120)
|
|
0
|
27
|
April 10, 2026
|
|
CUDA 13.2 DGX Spark impact
|
|
8
|
1273
|
March 29, 2026
|
|
GB10 (SM12.1) vLLM FP8 inference — any progress on native SM12.1 kernels?
|
|
4
|
590
|
March 27, 2026
|
|
cublasDx batched gather gemm
|
|
2
|
33
|
March 26, 2026
|
|
PyTorch CUDA Incompatibility on NVIDIA Thor (L4T 38.4, CUDA 13)
|
|
2
|
69
|
March 23, 2026
|
|
Jetson AGX Thor: official PyTorch 25.08 container works for Conv2d and ResNet18, but pip-installed PyTorch 2.12.0.dev+cu128 fails with "no kernel im
|
|
2
|
71
|
March 21, 2026
|
|
Issues generating 64T64R testMAC vectors via cuMAC (thread-block limit & 32-bit integer overflow)
|
|
0
|
27
|
March 17, 2026
|
|
Verify ai performance by cutlass_profiler,but it was too slow,why?
|
|
2
|
34
|
March 4, 2026
|
|
Custom FP4 CUDA Kernel - 129 TFLOPS on DGX Spark with Pre-Quantized Weight Cache
|
|
4
|
476
|
February 25, 2026
|
|
OpenACC use_device / OpenMP use_device_ptr / use_device_addr in combination with cuBLAS
|
|
5
|
55
|
February 19, 2026
|
|
Anyone know if CUDA 12.6.2 is coming to JetPack?
|
|
1
|
39
|
February 16, 2026
|
|
Pytorch matmul vs cudaTensorCoreGemm on Jetson Orin NX
|
|
2
|
49
|
February 12, 2026
|
|
Which tool can accurately obtain kernel performance, ncu or nsys?
|
|
2
|
59
|
March 30, 2026
|
|
Results divergence between cuBLAS Sgemm and cuSPARSE BLOCKED-ELL SpMM
|
|
1
|
36
|
February 9, 2026
|
|
Performance Inquiry: Optimizing Qwen3-VL 2B Inference for 2 QPS Target on Orin Nano Super
|
|
4
|
223
|
February 9, 2026
|
|
cuBLASDx large matrix multiplication performance
|
|
3
|
75
|
February 9, 2026
|
|
Suboptimal PyTorch Performance on Jetson Orin Nano Super
|
|
3
|
87
|
February 5, 2026
|
|
Consistent "CUDA error: an illegal memory access was encountered" Error
|
|
8
|
916
|
January 29, 2026
|
|
cudaErrorIllegalAddress Encountered: "CUDA error: an illegal memory access was encountered"
|
|
2
|
728
|
January 20, 2026
|
|
Is CublasDX compatible with per-block global-pitch or stride values in a batched-gemm kernel?
|
|
3
|
35
|
January 15, 2026
|
|
NvRmMemInitNvmap failed / NVMAP permission denied when launching nvcr.io/nvidia/vllm:25.11-py3 container on Jetson Orin NX + JetPack 6.2 (L4T 36.4.3)
|
|
5
|
163
|
January 21, 2026
|
|
Nsys profile not showing any GPU data
|
|
2
|
116
|
January 5, 2026
|
|
Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s)
|
|
22
|
701
|
January 21, 2026
|
|
Support for per-multiplication m, n, k, lda, ldb, and ldc in batched gemm
|
|
0
|
29
|
January 3, 2026
|
|
Will Cublas support arbitrary (row-major) pitched memory for A, B, C matrices in future?
|
|
4
|
57
|
January 16, 2026
|
|
Does cublaslt batch mode for Pointer Arrays apply for scaling factors as well?
|
|
0
|
31
|
January 2, 2026
|
|
Understanding Tensor Pipe Throughput and Throttle Stalls
|
|
4
|
187
|
January 29, 2026
|