How to report a bug

To report bugs to Nvidia, you will need to first register with our developer program here. Doing this enables you to file and get feedback on bugs at the following link.

Please be prepared to provide the following details:

  • Summary
  • Relevant Area
  • Description
  • NVIDIA GPU or System
  • NVIDIA Software Version
  • OS
  • Other Details
Calling cuSparse library on Tesla A100 with CUDA11.1 is much slower than that on Tesla P100 with CUDA9.0
NPP - functions that perform an operation where a constant is on the device?
CUDA Toolkit 11.3 could not find Visual Studio 2019 Community
Got out of memory from cudaMemcpy
Dynamic SM with Dynamic Parallelism
Cuda memory pool performance issue
RGB to YUV conversion Color convertion
Can we specify a CUDA core dump location?
Bug in cudaMemsetAsync or in Nsight VS Edition when visualizing cudaMemsetAsync execution
Where is the ptxas documentation?
Creating a compressed texture object
Cannot use Stream Ordered Async Memory Allocator with CUDA MPS
We are going to Abandon Cuda without Mingw support on windows
Malloc in Kernel Complexity (10.2)
Bug Report for nppiNV32ToBGR_8u_P2C4R_Ctx and nppiNV21ToRGB_8u_P2C4R_Ctx
Does CUDA/NVCC support precompiled headers?
OMP offloading crash with nvc
Libdevice functions causing PTXAS segfault
Cusparse cholesky & structural zeros - preconditioned conjugate gradient
Sharing Cuda contexts among Linux processes
Debug segfault in libnvvm
Performance varies greatly with different nvcc compilers
cudaExternalMemoryGetMappedMipmappedArray for ID3D11Texture3D fails in most cases
Inconsistent performance on the A100
Order of registers in MMA calls
Local memory layout and 32-bit words
Ubuntu 20.04, GCC 9.3, Cuda Toolkit 11.3 - not a supported combination?
Impact of cudaMalloc() on CPU LLC
Is cudaMemcpyDeviceToDevice between a WDDM device and a TCC device possible?
VPI Gaussian Blur Max Kernel Size Limitation?
Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ kernel problem or driver issue?
CUDA sample bicubicTexture not working
cuModuleLoadData Segment Fault Using cuda 11.4, Driver 470.57.02
Why is this CUDA kernel repeating indices with a 3D grid?
Peaks and slow performance with cudaDeviceSynchronize
Best practices for cudaDeviceScheduleBlockingSync usage pattern on Linux
Consuming a populated JIT cache with read-only permissions
Different output of code when not unrolling loop
Cuda slow performance after process sleep/wait
Single cudaMemcpy across multiple allocations
cuBLAS GEMM INT8 is much slower than FP16 in T4
Issue with cooperative_groups::memcpy_async
cudaOccupancyMaxActiveBlocks returns the blocks by taking into acccount other co-running kernels?
Nppi resize doesn't work with 1x1px
Are there any branch non-divergence hints for the compiler?
What is the stream-ordered equivalent of cudaMallocPitch?
NPPI Label MakersUF Return Incorrect results in Cuda 11.4
cudaMemset in 11.4: what causes it to give cudaErrorInvalidValue?
cufftPlan creation deadly slow on CUDA 11+
Get function/global name from pointer using CUDA Driver API
nvmlDeviceGetMigDeviceHandleByIndex return wrong MIG devices when some MIG devices deleted
C++ 17 variadic template folding expressions in device methods
Calling NPP helper with large image gives kernel execution error
All CUDA-capable devices busy or unavailable
Suq.*.b32 other than suq.widht.b32 and suq.height.b32 causes cudaError 801/500
Reducing binary size while using accelerated libs
cudaArray, used size and layout
Question about getting libcuda debug symbols
Compute Capability support in desktop NVIDIA RTX A2000
cusolverDnSgetrf() fails on A100 (but not on A10) when called in a tight loop
__nanosleep not working as expected
Cannot peek at last error after a call to a dlsym()-ed function
Significant speedup of OpenCL vs CUDA
Speed difference between different driver versions
Task Manager GPU usage disabled: Windows Server 2019, Tesla V100
Feature Request: Host and Heap allocated memory transfers
NVJPEG issues and inconsistencies with transcoding
Very poor performance with NPP CrossCorrValid
Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor
Cublas Bug
Constexpr Partial Function Specialization
Only static version of nvptxcompiler installed with CUDA 11.6 on Linux?
A100 Hardware NVJPEG Batch Decoding takes ~5ms before decoding and why
cuBLAS gemv incx != 0 restriction
A100 Hardware NVJPEG Batch Decoding takes ~5ms before decoding and why
cufftCreateAsync / cufftDestroyAsync
Register spilling
Newer Drivers fail when allocating Memory Chunks of 2MB + 1 byte on multiple devices
CUFFT_INTERNAL_ERROR during creation of a 1D Plan in CUFFT
Ampere 16x8x256 BMMA
Macros disabling half-precision functionality - how to use them?
Can I call cuModuleLoadData in a Non-blocking way?
Some inconsistencies in the CUDA documentation?
cuSolver handle GPU memory use