The problem with sticking termination of CUDA cores until the end of the program.

Here is illustration of our problem.

We run a simple kernel many times.
global_
void sa1 (u_int8_t * in, u_int8_t * digest, const uint sesCount)
{

if EXISTS

… some big code …

endif

}

If it is used EXISTS = 1 the kernel have big code, then all is well. Termination is in the right place. And the dump by cuda-gdb is:

[Thread debugging using libthread_db enabled]
testing CUDA platform
Callback version
init … ok
devices count cpu : 0; gpu : 1; all : 1
Error devNumber - return -1203
Valid devNumber 1 - return 0
Device INFO:
max_alloc_mem_size : 1610153984
mem_size : 1610153984
nr_cores : 16
wavefront_size : 32
id : 0

[New Thread 0x7ffff5b6e700 (LWP 5538)]
QUEUE Ok
[New Thread 0x7ffff536d700 (LWP 5542)]
[Context Create of context 0x7674e0 on Device 0]
warning: Cuda API error detected: cudaEventQuery returned (0x22)

warning: Cuda API error detected: cudaEventQuery returned (0x22)

[Launch of CUDA Kernel 0 (sa1) on Device 0]
warning: Cuda API error detected: cudaStreamQuery returned (0x22)

warning: Cuda API error detected: cudaStreamQuery returned (0x22)

warning: Cuda API error detected: cudaStreamQuery returned (0x22)

[Termination of CUDA Kernel 0 (sa1) on Device 0]
warning: Cuda API error detected: cudaStreamQuery returned (0x22)

num buffers=3
warning: Cuda API error detected: cudaEventQuery returned (0x22)

warning: Cuda API error detected: cudaEventQuery returned (0x22)

ctx->stream=-267423360
[Launch of CUDA Kernel 1 (sa1) on Device 0]
warning: Cuda API error detected: cudaStreamQuery returned (0x22)

warning: Cuda API error detected: cudaStreamQuery returned (0x22)

[Termination of CUDA Kernel 1 (sha1) on Device 0]
warning: Cuda API error detected: cudaStreamQuery returned (0x22)

num buffers=3
warning: Cuda API error detected: cudaEventQuery returned (0x22)

warning: Cuda API error detected: cudaEventQuery returned (0x22)

[Launch of CUDA Kernel 2 (sa1) on Device 0]
warning: Cuda API error detected: cudaStreamQuery returned (0x22)

warning: Cuda API error detected: cudaStreamQuery returned (0x22)

warning: Cuda API error detected: cudaStreamQuery returned (0x22)

[Termination of CUDA Kernel 2 (sa1) on Device 0]
warning: Cuda API error detected: cudaStreamQuery returned (0x22)

Time on 0KB = 0.001000mls 0.000000Mbit
num buffers=3
test end!
[Thread 0x7ffff5b6e700 (LWP 5538) exited]
[Thread 0x7ffff536d700 (LWP 5542) exited]

Program exited normally.
(cuda-gdb)

If it with EXISTS = 0 the kernel have svallest code, then the kernel termination occurs when you exit the program. The kernel is stuck, the driver does not release resources pattern during operation.
And the dump is:

[Thread debugging using libthread_db enabled]
testing CUDA platform
Callback version
init … ok
devices count cpu : 0; gpu : 1; all : 1
Error devNumber - return -1203
Valid devNumber 1 - return 0
Device INFO:
max_alloc_mem_size : 1610153984
mem_size : 1610153984
nr_cores : 16
wavefront_size : 32
id : 0

[New Thread 0x7ffff5b6e700 (LWP 4663)]
QUEUE Ok
[New Thread 0x7ffff536d700 (LWP 4665)]
[Context Create of context 0x7614e0 on Device 0]
warning: Cuda API error detected: cudaEventQuery returned (0x22)

warning: Cuda API error detected: cudaEventQuery returned (0x22)

[Launch of CUDA Kernel 0 (sa1) on Device 0]
warning: Cuda API error detected: cudaStreamQuery returned (0x22)

num buffers=3
warning: Cuda API error detected: cudaEventQuery returned (0x22)

warning: Cuda API error detected: cudaEventQuery returned (0x22)

[Launch of CUDA Kernel 1 (sa1) on Device 0]
warning: Cuda API error detected: cudaStreamQuery returned (0x22)

num buffers=3
warning: Cuda API error detected: cudaEventQuery returned (0x22)

warning: Cuda API error detected: cudaEventQuery returned (0x22)

[Launch of CUDA Kernel 2 (sa1) on Device 0]
warning: Cuda API error detected: cudaStreamQuery returned (0x22)

Time on 0KB = 0.000000mls -nanMbit
num buffers=3
test end!
[Thread 0x7ffff5b6e700 (LWP 4663) exited]
[Thread 0x7ffff536d700 (LWP 4665) exited]

Program exited normally.
[Termination of CUDA Kernel 2 (sa1) on Device 0]
[Termination of CUDA Kernel 1 (sa1) on Device 0]
[Termination of CUDA Kernel 0 (sa1) on Device 0]
(cuda-gdb)

It is incorrect.

Why is this effect?

For us it is important.