|
How to report a bug
|
|
2
|
19557
|
May 27, 2024
|
|
cudaMemPrefetchAsync does not migrate managed memory back to host (device -> host)
|
|
2
|
34
|
February 1, 2026
|
|
CUDA-Vulkan image interop broken on Windows
|
|
0
|
14
|
February 1, 2026
|
|
BUG: workqueue lockup - pool cpus=7 stuck for 37589s
|
|
4
|
56
|
January 29, 2026
|
|
cudaExecutionCtxGetDevResource VS cudaStreamGetDevResource difference?
|
|
0
|
13
|
January 29, 2026
|
|
Can you post a PDF of "CUDA Techniques to Maximize Memory Bandwidth and Hide Latency"?
|
|
1
|
30
|
January 28, 2026
|
|
Unresolved externals when using thrust
|
|
4
|
34
|
January 28, 2026
|
|
Usage of CU_STREAM_NON_BLOCKING
|
|
2
|
51
|
January 27, 2026
|
|
Is it expected on to see many NOPs in double precision code on Blackwell CC 12?
|
|
7
|
85
|
January 26, 2026
|
|
How to create green context with both SM and work queue partition
|
|
0
|
20
|
January 24, 2026
|
|
A simple CUDA thrust project not compiling
|
|
3
|
50
|
January 24, 2026
|
|
The description of cuCtxFromGreenCtx
|
|
1
|
40
|
January 23, 2026
|
|
Understanding the CTA-local Requirements of fence.proxy.async, as documented by "Mixed-proxy extensions for the NVIDIA PTX memory consistency model"
|
|
0
|
26
|
January 21, 2026
|
|
A new GPU-accelerated prime sieve using constant-cost structural elimination to overcome memory bandwidth limits at massive scales
|
|
5
|
104
|
January 21, 2026
|
|
CUDA Graph Conditional Nodes implementation
|
|
2
|
29
|
January 21, 2026
|
|
Behavior of cuEventSynchronize()
|
|
0
|
20
|
January 21, 2026
|
|
Performance impact when CU_EVENT_DISABLE_TIMING isn't specified
|
|
0
|
18
|
January 21, 2026
|
|
Exploring what it means to embed CUDA directly into a high-level language runtime
|
|
1
|
45
|
January 20, 2026
|
|
CUDA memory permission
|
|
3
|
38
|
January 20, 2026
|
|
cudaErrorIllegalAddress Encountered: "CUDA error: an illegal memory access was encountered"
|
|
1
|
166
|
January 20, 2026
|
|
Nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
|
|
1
|
30
|
January 20, 2026
|
|
Why aren't there explicit async proxy<->generic proxy fences in the cuda guide TMA prefetching example?
|
|
4
|
56
|
January 19, 2026
|
|
At what point do model reload time and GPU memory pressure become more significant than compute in multi-model inference on DGX Spark?
|
|
0
|
11
|
January 19, 2026
|
|
Error propagation between different thread using different context
|
|
7
|
51
|
January 19, 2026
|
|
PCIe5 P2P GPU via NICs faster than PCIe switch?
|
|
4
|
60
|
January 19, 2026
|
|
FP64 Performance - Power Limitation - H100 vs A100
|
|
13
|
206
|
January 19, 2026
|
|
Clarification on cudaMemcpy synchronization behavior with pageable memory and non-blocking streams
|
|
6
|
93
|
January 19, 2026
|
|
I made 64 swarm agents compete to write gpu kernels
|
|
3
|
54
|
January 17, 2026
|
|
No overlap between communication and computation across CUDA streams in PyTorch
|
|
1
|
45
|
January 15, 2026
|
|
Why does CUDA Graph improve end-to-end performance more for Triton kernels than for custom CUDA kernels?
|
|
0
|
35
|
January 15, 2026
|