|
Architectural insights needed: Why is the MIG 3g.71gb instance consistently the "Efficiency Sweet Spot" on H200?
|
|
5
|
212
|
February 18, 2026
|
|
Why does SW128 / Swizzle<3,4,3> produce identical bank patterns across all 8 rows?
|
|
3
|
69
|
February 17, 2026
|
|
Grace Hopper CPU-GPU bandwidth with MIG
|
|
5
|
487
|
February 17, 2026
|
|
SMs busy vs achieved occupancy
|
|
4
|
53
|
March 2, 2026
|
|
How to achieve the functionality of `stmatrix` on devices below SM90 while avoiding issues like non-coalesced memory access?
|
|
1
|
91
|
February 12, 2026
|
|
Pytorch matmul vs cudaTensorCoreGemm on Jetson Orin NX
|
|
2
|
42
|
February 12, 2026
|
|
Is it expected on to see many NOPs in double precision code on Blackwell CC 12?
|
|
16
|
175
|
February 12, 2026
|
|
cudaMemcpyBatchAsync
|
|
3
|
74
|
February 11, 2026
|
|
Unstable CUDA timing on Jetson AGX Orin compared to Windows GPU
|
|
3
|
58
|
February 11, 2026
|
|
Assessing the Impact of High Launch Latency in CUDA Applications
|
|
14
|
94
|
February 10, 2026
|
|
cudaMemcpyAsync (P2P D2D) serializes with kernel execution
|
|
1
|
52
|
February 8, 2026
|
|
The flag -gencode is not recognized
|
|
4
|
39
|
February 7, 2026
|
|
Distributed Shared Memory
|
|
0
|
30
|
February 7, 2026
|
|
Single-Bit Corruption Detected by Device-Side Compare in Trivial Global Copy Kernel on RTX 3060 Ti (memcheck/racecheck clean)
|
|
6
|
41
|
February 20, 2026
|
|
Sequential SM Resource Splitting with CUDA Green Contexts
|
|
0
|
32
|
February 6, 2026
|
|
Clarification: bank_conflicts metric vs wavefronts for shared memory LDS.128
|
|
1
|
33
|
February 6, 2026
|
|
LSU Wavefront Scheduling and Shared Memory Bank Utilization on Blackwell
|
|
7
|
104
|
February 20, 2026
|
|
CUDA - Make a specific memory access skip the cache
|
|
2
|
65
|
February 4, 2026
|
|
Understanding warp scheduling on a Streaming multiprocessor
|
|
3
|
96
|
February 4, 2026
|
|
Disable Logging of CUDA APIs
|
|
0
|
39
|
February 3, 2026
|
|
CUDA-Vulkan image interop broken on Windows
|
|
3
|
144
|
February 2, 2026
|
|
Clarification on legacy CUDA Toolkit EOL/EOS policy?
|
|
1
|
92
|
February 2, 2026
|
|
CUDA Programming Guide v13.1: Missing kernel argument in 2.1.5 “Explicit Memory Management” example
|
|
2
|
38
|
February 2, 2026
|
|
Getting started with parallel programming Suggested reading
|
|
9
|
29623
|
February 2, 2026
|
|
cudaMemPrefetchAsync does not migrate managed memory back to host (device -> host)
|
|
3
|
70
|
February 1, 2026
|
|
BUG: workqueue lockup - pool cpus=7 stuck for 37589s
|
|
4
|
85
|
January 29, 2026
|
|
cudaExecutionCtxGetDevResource VS cudaStreamGetDevResource difference?
|
|
0
|
17
|
January 29, 2026
|
|
Can you post a PDF of "CUDA Techniques to Maximize Memory Bandwidth and Hide Latency"?
|
|
1
|
74
|
January 28, 2026
|
|
Unresolved externals when using thrust
|
|
4
|
43
|
January 28, 2026
|
|
Usage of CU_STREAM_NON_BLOCKING
|
|
2
|
83
|
January 27, 2026
|