Hello there guys,
I am developing my master’s final project and everything was doing fine when I came across an error related to the launch of cuBLAS kernels. I have warped cuBLAS calls inside C++ objects to make the project easy(in some inspired by Eigen and OpenCV), and when performing unit tests with gtest everything went fine. But when I started integrating functions cuBLAS has begun to crash with no given error.
I, then, started to verify every cuBLAS and CUDA API runtime printing errors, but none appeared even with cudaGetLastError(). When performing cuda-memcheck it says it is a host-side problem and also says to use cuda-dgb to see the problem. Ok, I used Nsight to debug and also the only thing it says its that the problem with a function called cuEGLInit() from which I found nothing on the web.
Here follows the output off cuda-memcheck , cuda-dbg running on the command line and also Nsight
spades@pedro-BhtmLnx:~/kinetic_ws$ rosrun --prefix "cuda-memcheck" cgmapping basic_matrix_tests --gtest_filter=camera_motion_check.camera_motion_test ========= CUDA-MEMCHECK Note: Google Test filter = camera_motion_check.camera_motion_test [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from camera_motion_check [ RUN ] camera_motion_check.camera_motion_test ========= Error: process didn't terminate successfully ========= The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host side errors. ========= Internal error (20) ========= No CUDA-MEMCHECK results found
spades@pedro-BhtmLnx:~/kinetic_ws$ rosrun --prefix "cuda-gdb --args" cgmapping basic_matrix_tests --gtest_filter=camera_motion_check.camera_motion_test NVIDIA (R) CUDA Debugger 8.0 release Portions Copyright (C) 2007-2016 NVIDIA Corporation GNU gdb (GDB) 7.6.2 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/spades/kinetic_ws/devel/lib/cgmapping/basic_matrix_tests...done. (cuda-gdb) run Starting program: /home/spades/kinetic_ws/devel/lib/cgmapping/basic_matrix_tests --gtest_filter=camera_motion_check.camera_motion_test [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [New Thread 0x7fffae22d700 (LWP 13056)] [New Thread 0x7fffada2c700 (LWP 13057)] [New Thread 0x7fffad22b700 (LWP 13058)] Note: Google Test filter = camera_motion_check.camera_motion_test [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from camera_motion_check [ RUN ] camera_motion_check.camera_motion_test Program received signal SIGSEGV, Segmentation fault. 0x00007fffcdcd1c40 in cuEGLInit () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuda-gdb) continue Continuing. [Thread 0x7fffada2c700 (LWP 13057) exited] [Thread 0x7fffad22b700 (LWP 13058) exited] [Thread 0x7fffae22d700 (LWP 13056) exited] Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_UNKNOWN(0x1). (cuda-gdb) quit A debugging session is active. Inferior 1 [process 13041] will be killed. Quit anyway? (y or n) n Not confirmed. (cuda-gdb) continue Continuing. Cannot execute this command while the selected thread is running. (cuda-gdb) quit A debugging session is active. Inferior 1 [process 13041] will be killed. Quit anyway? (y or n) y
Here is the image when the crash happens.
And here is the image after I pressed to continue after crashing.
Does anyone has any clue about what this could be? I had suspected that it could be caused by bad sync between streams but I had also put cudaDeviceSyncronize() after every kernel call, but the error persists.
Also all kernels I made and OpenCV’s kernels work like a charm so far.