Unable to run several CUDA samples.

I have CUDA 10.0.130 installed on Windows 10, GTX 1070 Ti and ran a few sample tests but failed in some of them. The results of each test are not the same every time I rerun it. I really have no idea and very much appreciate your help.


reduction.exe
reduction.exe Starting…

GPU Device 0: “GeForce GTX 1070 Ti” with compute capability 6.1

Using Device 0: GeForce GTX 1070 Ti

Reducing array of type int

16777216 elements
256 threads (max)
64 blocks

reduction.cpp(264) : getLastCudaError() CUDA error : Kernel execution failed : (77) an illegal memory access was encountered.

********************************************** (After adding cuda-memcheck, the execution took very long to complete, the result isn’t consistent, sometimes failed.)
cuda-memcheck reduction.exe
========= CUDA-MEMCHECK
reduction.exe Starting…

GPU Device 0: “GeForce GTX 1070 Ti” with compute capability 6.1

Using Device 0: GeForce GTX 1070 Ti

Reducing array of type int

16777216 elements
256 threads (max)
64 blocks

Reduction, Throughput = 0.0554 GB/s, Time = 1.21219 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256

GPU result = 2139095040
CPU result = 2139095040

Test passed


simpleCUBLAS.exe
GPU Device 0: “GeForce GTX 1070 Ti” with compute capability 6.1

simpleCUBLAS test running…
!!! device access error (read C)

********************************************** (2nd time)
simpleCUBLAS.exe
GPU Device 0: “GeForce GTX 1070 Ti” with compute capability 6.1

simpleCUBLAS test running…
simpleCUBLAS test failed.

********************************************** (Adding cuda-memcheck)
cuda-memcheck simpleCUBLAS.exe
========= CUDA-MEMCHECK
GPU Device 0: “GeForce GTX 1070 Ti” with compute capability 6.1

simpleCUBLAS test running…
!!! device access error (read C)
========= Invalid shared read of size 16
========= at 0x00000090 in sgemm_32x32x32_NN
========= by thread (0,0,0) in block (2,2,0)
========= Address 0x00004200 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2DAsync + 0x1b9ff9) [0x1c8735]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasGemmStridedBatchedEx + 0x1e37c) [0x45de8c]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasGemmStridedBatchedEx + 0x21334) [0x460e44]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasZtrttp + 0x80d4b) [0x39b51b]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasGemmStridedBatchedEx + 0x6434) [0x445f44]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasZhpr2_v2 + 0x5672) [0x206862]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasSgemm_v2 + 0x5dd) [0x20788d]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (main + 0x4bd) [0xf26dd]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (invoke_main + 0x34) [0xf37e4]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (__scrt_common_main_seh + 0x127) [0xf3687]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (__scrt_common_main + 0xe) [0xf354e]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (mainCRTStartup + 0x9) [0xf3809]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x181f4]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6a251]

========= Invalid shared read of size 0
========= at 0x00000060 in sgemm_32x32x32_NN
========= by thread (113,0,0) in block (1,0,0)
========= Address 0x00004200 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2DAsync + 0x1b9ff9) [0x1c8735]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasGemmStridedBatchedEx + 0x1e37c) [0x45de8c]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasGemmStridedBatchedEx + 0x21334) [0x460e44]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasZtrttp + 0x80d4b) [0x39b51b]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasGemmStridedBatchedEx + 0x6434) [0x445f44]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasZhpr2_v2 + 0x5672) [0x206862]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasSgemm_v2 + 0x5dd) [0x20788d]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (main + 0x4bd) [0xf26dd]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (invoke_main + 0x34) [0xf37e4]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (__scrt_common_main_seh + 0x127) [0xf3687]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (__scrt_common_main + 0xe) [0xf354e]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (mainCRTStartup + 0x9) [0xf3809]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x181f4]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6a251]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2DAsync + 0x2fa12f) [0x30886b]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasGemmStridedBatchedEx + 0x217a0) [0x4612b0]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasGetVector + 0x224) [0x103444]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (main + 0x558) [0xf2778]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (invoke_main + 0x34) [0xf37e4]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (__scrt_common_main_seh + 0x127) [0xf3687]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (__scrt_common_main + 0xe) [0xf354e]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\simpleCUBLAS.exe (mainCRTStartup + 0x9) [0xf3809]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x181f4]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6a251]

========= ERROR SUMMARY: 3 errors


matrixMul.exe
[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “GeForce GTX 1070 Ti” with compute capability 6.1

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
CUDA error at D:/Downloads/VatesSetup/cuda-samples-master/Samples/matrixMul/matrixMul.cu:226 code=77(cudaErrorIllegalAddress) “cudaEventSynchronize(stop)”

********************************************** (Adding cuda-memcheck)

cuda-memcheck matrixMul.exe
========= CUDA-MEMCHECK
[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “GeForce GTX 1070 Ti” with compute capability 6.1

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
CUDA error at D:/Downloads/VatesSetup/cuda-samples-master/Samples/matrixMul/matrixMul.cu:226 code=4(cudaErrorLaunchFailure) “cudaEventSynchronize(stop)”
========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaEventSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuD3D9UnmapVertexBuffer + 0x2e2c85) [0x2f105b]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\matrixMul.exe (cudaEventSynchronize + 0x103) [0xf843]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\matrixMul.exe (MatrixMultiply + 0x638) [0x65798]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\matrixMul.exe (main + 0x283) [0x65e53]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\matrixMul.exe (invoke_main + 0x34) [0x6a374]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\matrixMul.exe (__scrt_common_main_seh + 0x127) [0x6a237]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\matrixMul.exe (__scrt_common_main + 0xe) [0x6a0fe]
========= Host Frame:D:\Downloads\VatesSetup\cuda-samples-master\bin\win64\Debug\matrixMul.exe (mainCRTStartup + 0x9) [0x6a399]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x181f4]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6a251]

========= ERROR SUMMARY: 1 error


I have increased WDDM TRD Delay to 10 according to some posts, too. This is the output of nvidia-smi.exe:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 417.35 Driver Version: 417.35 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107… WDDM | 00000000:01:00.0 On | N/A |
| 41% 40C P8 13W / 180W | 281MiB / 8192MiB | 0% Default |
±------------------------------±---------------------±---------------------+

I think it may still be a WDDM TDR issue. cuda-memcheck slows down kernel execution
I understand you say you’ve increased it but my best guess is either you did not do it correctly or you need to increase it more.

Here is what I did to change the WDDM TDR:

  1. Open Nsight Options as administrator
  2. Select General
  3. Under Microsoft Display Driver, set WDDM TDR Delay 10 and WDDM TDR Enabled True
  4. Restart the PC

The CUDA samples were downloaded from github, compiled in the GTX1070 Ti machine. These tests were passed on another PC running Windows 10, GTX 1070, Driver Version: 417.35, and CUDA Version: 10.0.