|
How to set feature for 64 Bits from32 Bits
|
|
1
|
45
|
February 26, 2026
|
|
Is there a documentation about Cutlass to see what instructions are used to store a tile to global memory with TMA in SM90a (in C++)?
|
|
0
|
39
|
February 26, 2026
|
|
Performance state switches from P0 to P2 when starting program
|
|
17
|
14171
|
February 26, 2026
|
|
Custom FP4 CUDA Kernel - 129 TFLOPS on DGX Spark with Pre-Quantized Weight Cache
|
|
4
|
458
|
February 25, 2026
|
|
Registering a numa-allocated host buffer with the GPU DMA engine for peak transfer performance in OpenCL
|
|
2
|
66
|
February 25, 2026
|
|
Implement all supported matrix shapes for wmma::bmma_sync
|
|
2
|
58
|
February 24, 2026
|
|
BUG: CUDA Programming Guide memcpy_async pipeline example is incorrect
|
|
1
|
49
|
February 24, 2026
|
|
Using constant memory in template kernels, undefined behaviour
|
|
4
|
54
|
February 22, 2026
|
|
Dynamic shared memory for more than one array
|
|
5
|
59
|
March 7, 2026
|
|
'setmaxnreg' ignored; unable to determine register count at entry
|
|
3
|
53
|
March 6, 2026
|
|
Visual Studio not compiling a kernel call
|
|
0
|
33
|
February 18, 2026
|
|
Does SMEM swizzle mode affect tcgen05.mma throughput? (SM100, fp8 SS MMA)
|
|
4
|
98
|
March 4, 2026
|
|
cudaGraph_t and multiple devices
|
|
2
|
49
|
February 18, 2026
|
|
L2 cache line misaligned impact on memory-bound kernels
|
|
2
|
57
|
February 18, 2026
|
|
Architectural insights needed: Why is the MIG 3g.71gb instance consistently the "Efficiency Sweet Spot" on H200?
|
|
5
|
237
|
February 18, 2026
|
|
Why does SW128 / Swizzle<3,4,3> produce identical bank patterns across all 8 rows?
|
|
3
|
107
|
February 17, 2026
|
|
Grace Hopper CPU-GPU bandwidth with MIG
|
|
5
|
502
|
February 17, 2026
|
|
SMs busy vs achieved occupancy
|
|
4
|
65
|
March 2, 2026
|
|
How to achieve the functionality of `stmatrix` on devices below SM90 while avoiding issues like non-coalesced memory access?
|
|
1
|
99
|
February 12, 2026
|
|
Pytorch matmul vs cudaTensorCoreGemm on Jetson Orin NX
|
|
2
|
48
|
February 12, 2026
|
|
Is it expected on to see many NOPs in double precision code on Blackwell CC 12?
|
|
16
|
204
|
February 12, 2026
|
|
cudaMemcpyBatchAsync
|
|
3
|
140
|
February 11, 2026
|
|
Unstable CUDA timing on Jetson AGX Orin compared to Windows GPU
|
|
3
|
66
|
February 11, 2026
|
|
Assessing the Impact of High Launch Latency in CUDA Applications
|
|
14
|
135
|
February 10, 2026
|
|
cudaMemcpyAsync (P2P D2D) serializes with kernel execution
|
|
1
|
66
|
February 8, 2026
|
|
The flag -gencode is not recognized
|
|
4
|
57
|
February 7, 2026
|
|
Distributed Shared Memory
|
|
0
|
39
|
February 7, 2026
|
|
Single-Bit Corruption Detected by Device-Side Compare in Trivial Global Copy Kernel on RTX 3060 Ti (memcheck/racecheck clean)
|
|
6
|
53
|
February 20, 2026
|
|
Sequential SM Resource Splitting with CUDA Green Contexts
|
|
0
|
41
|
February 6, 2026
|
|
Clarification: bank_conflicts metric vs wavefronts for shared memory LDS.128
|
|
1
|
49
|
February 6, 2026
|