oukore
March 11, 2020, 11:16am
1
Hi folks,
I had strange errors related to cufft when I feed my program to cuda-memcheck. The results were correct and no errors were detected by cuda-gdb. nvprof worked fine, no privilege-related errors. I then decided to test
NVIDIA_CUDA-10.1_Samples/bin/x86_64/linux/release/simpleCUFFT
from cuda samples (NVIDIA_CUDA-10.1_Samples/7_CUDALibraries/simpleCUFFT) and received the same errors. (attached at the end)
Solutions I’ve tried including:
My laptop is running ubuntu 18.04, “GeForce RTX 2070 with Max-Q Design” with compute capability 7.5 and driver 435.21, cuda/cuda-memcheck version:
$ cuda-memcheck --version
CUDA-MEMCHECK version 10.1.243 ID:(46)
Any suggestions?
Thank you.
$ cuda-memcheck ./simpleCUFFT
========= CUDA-MEMCHECK
[simpleCUFFT] is starting...
GPU Device 0: "GeForce RTX 2070 with Max-Q Design" with compute capability 7.5
========= Internal Memcheck Error: Initialization failed
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x13ba7c]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3d7e4a]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3caf70]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3d719a]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3dae9f]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3db60a]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3cec3c]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3bed7e]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3f022c]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x379a2]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x37fa6]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x39af2]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 (cufftXtMakePlanMany + 0x63a) [0x4d0ca]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 (cufftMakePlanMany64 + 0xfd) [0x4e02d]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 (cufftMakePlanMany + 0x193) [0x4aaf3]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 (cufftPlanMany + 0xd2) [0x4b082]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 (cufftPlan1d + 0x48) [0x4b1a8]
========= Host Frame:./simpleCUFFT [0x7358]
========= Host Frame:./simpleCUFFT [0x711e]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./simpleCUFFT [0x6efa]
1 Like
tobiw
May 26, 2020, 7:44am
2
Hi,
I currently face the same problem, did you solve the issue?
Thanks,
Tobi
1 Like
As much as I’d like to have it solved, I still don’t have a solution.
I am also seeing this, CentOS 8, CUDA 10.2 on a 2080Ti with driver 440.100.
In fact, I get a similar error as you in cuda-memcheck with just this:
// bug.cu
#include<cufft.h>
int main(){
cufftHandle plan;
cufftPlan1d(&plan,1, CUFFT_C2C, 1);
cufftDestroy(plan);
return 0;
}
Running cuda-memcheck on this results in:
$ nvcc bug.cu -lcufft && cuda-memcheck ./a.out
========= CUDA-MEMCHECK
========= Internal Memcheck Error: Initialization failed
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/lib64/libcuda.so.1 [0x1403fc]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3d887a]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3cb9a0]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3d7bca]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3db8cf]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3dc03a]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3cf66c]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3bf16e]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x3f138c]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x37b82]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x38186]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 [0x39cd2]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 (cufftXtMakePlanMany + 0x63a) [0x4d2aa]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 (cufftMakePlanMany64 + 0xfd) [0x4e20d]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 (cufftMakePlanMany + 0x193) [0x4acd3]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 (cufftPlanMany + 0xd2) [0x4b262]
========= Host Frame:/usr/local/cuda/lib64/libcufft.so.10 (cufftPlan1d + 0x48) [0x4b388]
========= Host Frame:./a.out [0x33c5]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf3) [0x236a3]
========= Host Frame:./a.out [0x32be]
=========
========= ERROR SUMMARY: 1 error
Making virtually impossible to debug any cuda code containing a cuFFT call…
As a last resort, running this example via cuda-gdb does work without error:
$ cuda-gdb -q ./a.out
Reading symbols from ./a.out...(no debugging symbols found)...done.
(cuda-gdb) set cuda memcheck on
(cuda-gdb) r
Starting program: ./a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: Cannot parse .gnu_debugdata section; LZMA support was disabled at compile time
warning: Cannot parse .gnu_debugdata section; LZMA support was disabled at compile time
[New Thread 0x7fffe63f4700 (LWP 44592)]
[New Thread 0x7fffe5bf3700 (LWP 44593)]
[New Thread 0x7fffe5371700 (LWP 44594)]
[Thread 0x7fffe5371700 (LWP 44594) exited]
[Thread 0x7fffe5bf3700 (LWP 44593) exited]
[Thread 0x7fffe63f4700 (LWP 44592) exited]
[Inferior 1 (process 44577) exited normally]
(cuda-gdb)
But this seems to fail with other examples with no obvious pattern…
Seems I should have search further: similar issues on dual RTX 2070s heres: Trivial cuFFT causes cuda-memcheck errors on RTX 2070 SUPER
@ RaulPPelaez, the link to nvidia-bug #3050187 doesn’t appear publicly accessible: can you post relevant details/workaround?
Yeah, sorry I am able to cuda-memcheck cufft code using this environmental variable:
CUDA_MEMCHECK_PATCH_MODULE=1
According to the bug page it is a known issue with this release and it is solved in CUDA 11.
1 Like
Thanks for the info: I will try that. Yeap, that worked: thanks!
On my dual RTX 2070 SUPERs I ugpraded to Cuda SDK 11 RC and rebuilt & tried with 11 driver & runtime – same issue without the above patch-module flag added, so looks like still broken in 11.
1 Like
Starting from CUDA 11.0, the compute-sanitizer tool should be used as a replacement for cuda-memcheck for most use cases.
See the known issues section for more details: CUDA-MEMCHECK :: CUDA Toolkit Documentation
1 Like