When to use cudaThreadExit()

I have two kernels to run in sequence. After the first is finished, should I use cudaThreadExit() and then start over with the other kernel?

If I don’t use cudaThreadExit, I have this error message:

Cuda error: cudaGetDeviceProperties: setting the device when a process is active is not allowed.

How to check if a process is still active?

If I use cudaThreadExit, my screen keep flashing at each call.

If I use cudaThreadExit at the end of each kernel, not only my screen is flashing but the memory usage is growing until segfault.

Both kernels are working fine when alone.
Am I missing something? or is this a mode I should not operate in?

The short answer is: don’t run cudaThreadExit()

You should only use cudaThreadExit() when you decide you are done with the thread.

If you need run several kernels, don’t run cudaThreadExit() in between.

If you are not sure what you are doing, just don’t run cudaThreadExit() in anywhere in your code.

Is there any other way to stop the process when I want to use another kernel with another set of flags to the GPU? The only way I found was using cudaThreadExit().

You should use cudaThreadSyncronise() to wait for the first kernel to finish before you start the next one

You don’t need to use cudaThreadSynchronize() because kernels submitted to the same stream are by default ordered and synchronized.

I think the original poster wants to abort a kernel in progress, which I do not believe is possible currently.

Edit: To clarify the first statement, I mean that the second kernel will not start until the first one is done and all writes flushed to global memory. Of course the second kernel submission is asynchronous and will return to the CPU immediately just like the first one.

Here is what I want to do :

…using cudaHostAlloc cudaHostAllocMapped cudaHostAllocWriteCombined
<<< …>>>
----------------------------------------< cudaThreadExit() ?
…using cudaMemcpyToSymbol
----------------------------------------< cudaThreadExit() ?


repeat for each iteration

To only way I can make this working a while is to insert those cudaThreadExit(), but then the memory usage keeps increasing until crashing.


Once device flags are set (either implicitly or explicitly as in cudaSetDevice etc…), it cannot be reset again… You need to spawn a separate thread if you want to use different flags…

And btw, cudaThreadExit() tears down the CUDA context maintained by the driver… It automatically happens when the thread exits. You dont need to call it… but then, why should 1 have such an API? – Only god knows and a few NV guys know…

It has been known to help in cases where the profiler writes out empty csvs. Explicitly adding a cudaThreadExit() seems to get the profiler buffers flushed. But that is the only time I have seen it used,

Then why my program as shown needs cudaThreadExit() to continue? Maybe because I need to open some new graphic windows after the calculation and I am using the same graphic card for both display and computation.

Can I set more than one device flag at the same time?

May be, a buggy driver…

Even if I remove the graphic window operation in between kernels calls, I still need to put cudaThreadExit() to start the next call.
The last problem I have is that the device memory does not get freed when I do the command cuddaFree() / free() for one of the kernel. This one uses
cudaMemcpyToSymbol() and multidimensions allocation.