|
How to report a bug
|
|
2
|
19898
|
May 27, 2024
|
|
Amber24 GPU run fails on RTX 5090 – “no kernel image is available for execution on the device”
|
|
1
|
67
|
May 15, 2026
|
|
Pinned memory uploads not being asynchronous on RTX 5060 Ti
|
|
5
|
22
|
May 15, 2026
|
|
Stream sync behaving like a device sync on first use of device API fns printf, cudaMalloc etc
|
|
15
|
217
|
May 14, 2026
|
|
Native Time-Slicing vs vGPU latency due to context switching
|
|
0
|
26
|
May 14, 2026
|
|
Jetson Orin Nano Super hard resets when WiFi drops under CUDA load
|
|
3
|
45
|
May 14, 2026
|
|
Is there a disadvantage to compile against an architecture family rather than a single arch
|
|
0
|
19
|
May 13, 2026
|
|
Can MPI_Scatter scatter from a pinned host pointer to GPU memory?
|
|
0
|
18
|
May 12, 2026
|
|
Why is cuda Synchronize() taking so long even with batched GPU→CPU copies, and how can I profile what in the stream queue is causing the delay?
|
|
5
|
84
|
May 12, 2026
|
|
Clarification on cooperative_groups::tiled_partition<64>::sync() behavior in a 128-thread block
|
|
5
|
47
|
May 11, 2026
|
|
RTX 4070 (AD104) GSP firmware crash (Xid 120 @ pc:0x1a92c96) under sustained CUDA workload — Windows BSOD + Linux GPU reset
|
|
0
|
46
|
May 11, 2026
|
|
Cycle reduction in chained SHA-256/RIPEMD-160 device function (Ada / sm_89)
|
|
14
|
116
|
May 9, 2026
|
|
About green context in cuda13.2.1
|
|
8
|
129
|
May 8, 2026
|
|
About thrust in cuda 13.2
|
|
3
|
108
|
May 8, 2026
|
|
Squeezing the last 17.5% out of a compute-bound 256-bit modular arithmetic kernel (sm_89, 82.5% SM throughput)
|
|
55
|
310
|
May 8, 2026
|
|
What is the inter-SM linkage of DSM(cluster)?
|
|
9
|
776
|
May 8, 2026
|
|
Cuda graphs issue when updating kernel node dynamic shared memory size - Cooperative group synchronization
|
|
10
|
171
|
May 8, 2026
|
|
Implement all supported matrix shapes for wmma::bmma_sync
|
|
5
|
125
|
May 6, 2026
|
|
First batch after idle / between workloads is much slower (even with preloaded data)
|
|
1
|
26
|
May 5, 2026
|
|
St.cg getting cached in L1 cache
|
|
0
|
31
|
May 4, 2026
|
|
Difference between cudamallocmanaged and malloc/new
|
|
3
|
385
|
May 3, 2026
|
|
Why is cuda Synchronize() taking so long even with batched GPU→CPU copies, and how can I profile what in the stream queue is causing the delay?
|
|
0
|
26
|
May 2, 2026
|
|
Throughput degradation under sustained load in distributed AI workloads
|
|
4
|
49
|
April 30, 2026
|
|
Detect highest supported PTX version
|
|
11
|
1861
|
April 29, 2026
|
|
cudaDeviceProp.sharedMemPerBlockOptin returns incorrect value (0x100000001) on RTX 5090 (SM120)
|
|
3
|
63
|
April 29, 2026
|
|
Does CUDA GPU Enforce W^X?
|
|
1
|
47
|
April 28, 2026
|
|
Fortran derived type on device
|
|
2
|
762
|
April 27, 2026
|
|
NCCL P2P hang on dual RTX PRO 6000 Blackwell Workstation Edition (WRX90E-SAGE SE)
|
|
7
|
314
|
May 10, 2026
|
|
PRPLL NTT now supports both CUDA and OpenCL
|
|
0
|
64
|
April 26, 2026
|
|
Maxing out dense FP16 FMA/FP32 accumulation (TFLOPS) on H200 GPU
|
|
8
|
110
|
April 24, 2026
|