Using CUDA from nested thread Is it safe or not ?

Say, I have a thread that create one more thread that create one more thread.

In that 3rd thread I call cudaSetDevice and launch my kernel. Are there any limitations or necessary tricks to be done when using nested threads ?

Yes. Each thread has its own associanted CUDA context. You cannot share device memory pointers, or cudaArrays between contexts. Additionally, textures bound, constant memory set, etc… is all per-context.

So you are fine as long as you keep all of your CUDA stuff in the same thread.

or work with the driver api and pass the context from one thread to another