Hello Yanxu, thanks for responding. I’ve just noticed I didn’t mention in my original post that I’ve actually already run cuda-memcheck.
- The output of cuda-memcheck is not fully reproducible - I noticed three types of errors that are thrown under the exactly same circumstances. The output is pretty long, so I’ll try to shorten it down a bit:
1.1:
cudnnGetVersion() : 7104 , CUDNN_VERSION from cudnn.h : 7104 (7.1.4)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 28 Capabilities 6.1, SmClock 1620.0 Mhz, MemSize (Mb) 11177, MemClock 5505.0 Mhz, Ecc=0, boardGroupID=0
Using device 0
Testing single precision
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:558
Aborting...
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
1.2:
cudnnGetVersion() : 7104 , CUDNN_VERSION from cudnn.h : 7104 (7.1.4)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 28 Capabilities 6.1, SmClock 1620.0 Mhz, MemSize (Mb) 11177, MemClock 5505.0 Mhz, Ecc=0, boardGroupID=0
Using device 0
Testing single precision
Loading image data/one_28x28.pgm
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 8
========= at 0x00000098 in void fermiPlusCgemmLDS128_batched<bool=1, bool=0, bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(float2* const *, float2* const *, float2* const *, float2*, float2 const *, float2 const *, int, int, int, int, int, int, __int64, __int64, __int64, float2 const *, float2 const *, float2, float2, int)
========= by thread (7,5,0) in block (0,0,99)
========= Address 0x7fe6cf645518 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x2486ed]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134d952]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134db47]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x137c8d5]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99abc]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99b99]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9acfc]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9a6cb]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe7345b]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe6abce]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac2be]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac948]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb210c]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb3921]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x780fa3]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x842c7]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x846e6]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnConvolutionForward + 0x2cc) [0x854ec]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x89368]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x8e993]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnFindConvolutionForwardAlgorithm + 0x248) [0x7fa78]
========= Host Frame:mnistCUDNN [0x189bb]
========= Host Frame:mnistCUDNN [0x10d67]
========= Host Frame:mnistCUDNN [0xe23b]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
========= Host Frame:mnistCUDNN [0x74d9]
=========
### similar as above, 6 times more ###
=========
========= Invalid __global__ read of size 8
========= at 0x00000098 in void fermiPlusCgemmLDS128_batched<bool=1, bool=0, bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(float2* const *, float2* const *, float2* const *, float2*, float2 const *, float2 const *, int, int, int, int, int, int, __int64, __int64, __int64, float2 const *, float2 const *, float2, float2, int)
========= by thread (1,5,0) in block (0,0,15)
========= Address 0x7fe6cf645278 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x2486ed]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134d952]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134db47]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x137c8d5]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99abc]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99b99]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9acfc]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9a6cb]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe7345b]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe6abce]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac2be]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac948]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb210c]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb3921]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x780fa3]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x842c7]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x846e6]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnConvolutionForward + 0x2cc) [0x854ec]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x89368]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x8e993]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnFindConvolutionForwardAlgorithm + 0x248) [0x7fa78]
========= Host Frame:mnistCUDNN [0x189bb]
========= Host Frame:mnistCUDNN [0x10d67]
========= Host Frame:mnistCUDNN [0xe23b]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:558
Aborting...
========= Host Frame:mnistCUDNN [0x74d9]
=========
========= ERROR SUMMARY: 8 errors
1.3: as above, but with 4 errors of the same type.
-
Thanks, I ran memtest86 as advised - it passed with no errors.
-
I’ve just tested it with cuDNN v7.1.4 + CUDA 9.0 and cuDNN v7.1.4 + CUDA 9.2. The problem persists.
If you have any ideas, please let me know. If it’s a hardware issue, I can submit a warranty claim to the distributor. It’s very important for me to solve this as soon as possible. Thanks for help!