Hi all - I just solved an error where a kernel was operating on a shared memory space that was not fully initialized.
Some pseudocode of erroneous code:
static device forceinline void processing (float *shared_array)
shared float array_copy [array_size];if (threadIdx.x < array_size / 2 )
array_copy[threadIdx.x] = shared_array[threadIdx.x];
__syncthreads();if (threadIdx.x == 0)
find_median(array_copy);__syncthreads();
When running this code the find_median function (which is recursive and operates on the bounds of array_size) would hang after several recursive iterations while operating on array_copy without fail. (I’m aware this small snippet of code lacks context and thus is not fully reproducible. )
The first thing you might notice is that not all of array_copy is being initialized (this was the bug I fixed that solved the issue).
So while I fixed the issue but I am interested in finding out why the previous version would hang. With the older version it would hang unless I included the -G -g tags, in which case it would return
Unspecified launch failure - cudaError 719
when running the kernel.
Using cuda-gdb I would get
Thread 1 “test_controller” received signal SIGTRAP, Trace/breakpoint trap
during the find_median function.
if continued:
CUDA Exception: Warp Out-of-range Address
I am thinking the shared memory is somehow deallocated during runtime which leads to the recursive median function hitting a out-of-range address after a few loops (17 to be exact). This seems to be the case since it does happen to execute successfully if I reduce the memory load (removing other kernels, increasing time between kernels using sleep(), etc) .
Is this “runtime shared memory deallocation” even possible?