------------below is gdb output:
Thread 2 “StlTextureO_rea” received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb3a181b0 (LWP 17273)]
0x0000000000000000 in ?? ()
(gdb) bt #0 0x0000000000000000 in ?? () #1 0x0000007fb25d7f3c in ?? ()
from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 #2 0x0000007fb26807c4 in ?? ()
from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 #3 0x0000007fb2680910 in ?? ()
from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 #4 0x0000007fb267ebc0 in ?? ()
from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 #5 0x0000007fb267ee00 in ?? ()
from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 #6 0x0000007fb25c6b84 in ?? ()
from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 #7 0x0000007fb25c8658 in ?? ()
from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 #8 0x0000007fb238b518 in ?? ()
from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 #9 0x0000007fb263ee34 in cuDevicePrimaryCtxRetain ()
from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 #10 0x00000000004828dc in cudart::contextStateManager::initPrimaryContext(cudart::device*) () #11 0x0000000000482b44 in cudart::contextStateManager::initDriverContext() () #12 0x0000000000483598 in cudart::contextStateManager::getRuntimeContextState(cu—Type to continue, or q to quit—
dart::contextState**, bool) () #13 0x0000000000478674 in cudart::doLazyInitContextState() () #14 0x000000000045fd18 in cudart::cudaApiMalloc(void**, unsigned long) () #15 0x000000000048fe38 in cudaMalloc () #16 0x000000000043d3a4 in update_buffer::update_buffer (this=0x7fac001010,
Id=0) at …/updatebuffer.cpp:21 #17 0x00000000004104bc in CaptureGroup::GetPanoCaptureGroup ()
at …/CaptureGroup.cpp:168 #18 0x0000000000410770 in CaptureGroup::GetExtCaptureGroup ()
at …/CaptureGroup.cpp:198 #19 0x000000000043cda8 in thread_scanner () at …/scanner.cpp:109 #20 0x0000007fb716afb4 in start_thread (arg=0x43ccdc <thread_scanner(void*)>)
at pthread_create.c:335 #21 0x0000007fb6e76390 in thread_start ()
at …/sysdeps/unix/sysv/linux/aarch64/clone.S:89
One possible cause could be running out of available memory (On GPU side, no swap is available).
If it is the case, you may see some traces of this in dmesg output.
Be sure to free memory as soon as it is no longer needed. You can also check available memory before allocating to see if enough bytes are available.
I also notice that your code doesn’t check the result of malloc. If malloc fails, it returns a NULL pointer.
Trying to access the buffer with NULL address will make a seg fault.
There’s a NULL pointer dereference once kernel code is reached. Other than that I couldn’t tell you anything specific (@Honey_Patouceul mentioned a malloc which did not get a return value check…possibly this is the source of the NULL pointer).
In the gdb stack frame you gave the top-most part of the call which is still controlled by your application has this:
…I’d have to guess the device argument (or a member of the device if the device itself is not NULL) is not valid. The reason the error shows up in a kernel message (instead of your gdb backtrace) is because the NULL pointer dereference was not in the user application…the dereference took place in the kernel after going through libcuda.so.1.1 (also as a pointer which libcuda.so.1.1 did not try to dereference, but instead passed on). Make sure cudart::device* is non-NULL, and that any member needing to be initialized in cudart::device is non-NULL.
It is hard to tell much with so few information about the context.
Can you tell :
How many previous calls have succeeded before failing ?
How many threads are running in this application ?
Do they share buffers, and if yes what are the locking mechanisms ?
Same if you have several processes sharing memory.
Do you know how much memory is available before launching your app ?
Do you know how much should be the maximum that your app could allocate/use ?
Are you using recursive calls ?
One possibility could be a stack trashed by another thread or that failed to grow correctly.
Maybe -fstack-check and -fstack-protector flags for gcc can help to detect that.
Could you isolate a skeleton of your code triggering this fault that you could share ?
Just a note that I just struggled finding a very similar SIGSEGV problem when calling cudaHostMalloc() on a Jetson Tx1 with essentially a identical stack trace that I eventually identified as caused by calling cudaGLSetGLDevice() in my initialization code (or at least the crash went away when I removed the call).
I now recognize that cudaGLSetGLDevice() is a deprecated interface, but I was reusing a older class and the call returned no error and the cudaHostMalloc() actually occurred some distance away in code space so it took me a while to isolate.
This problem was observed on a Jetson Tx1 running L4T 24.2 and CUDA Version 8.0.34.
Perhaps this will help someone else down the road.
Just want to confirm:
Alougth you met error at cudaMallocHost(), the real cause is some function needs to be applied before calling the cudaGLSetGLDevice().