Reusing GPU threads created by cuda kernel

Hi all. I am executing an image processing function using cuda kernel. I have to process multiple frames.
My code looks something like this. The arguments are input and output frames, dimensions of frames.

for(int i = 0; i < nFrames; i++){
  ...
  kernell<<< nBlocks, nThreads>>> (args...)
  ...
}

In my understanding, for every single kernel launch, the threads are created and they are destroyed after execution. Will that add any significant overhead?
Can I create threads at the start and destroy them at the end of the program?
I was expecting something like this

createGPUThreads();

for(int i = 0; i < nFrames; i++){
  launchKernel(); // With already created threads
}

destroyGPUThreads();

I’m not entirely sure if I understood the context of your questions because you asked very different things, so I will show some pieces of information, some reading suggestions and then you link the pieces together in a way that makes sense to your specific problem:

  • Your #1 snippet is correct in the sense that it will launch “kernel1” nFrames-1 times, each of which will process a different input passed in args. In your case, the frames.
  • Regarding how threads are created/destroyed, search for thread scheduling in CUDA to see various explanations.
  • Regarding the overhead, search for CUDA kernel launch overhead/time, I think there is a fixed time taken to launch a kernel but I don’t recall by memory.
  • Regarding your #2 snippet, no. It looks like it is trying to launch GPU/device threads from the CPU, and that from within this CPU function you determine what the threads will do. It is not how it works. Think of a kernel function as the way to tell the GPU what and how to do. The launch parameters request the device to create these many threads and these many blocks to do the computation defined in the kernel. Simple as that.
  • Read some of the CUDA blogs, they give concrete examples of real-world tasks suitable for GPU computing. I particularly got started there.

I am sorry about the confusion in snippet #2.

I just want to know that can I launch a kernel with threads already created so that I can reduce the overhead. It was not meant to be a CPU call to launch the threads

Kernel launches take ~5us, but thread creation is extremely cheap and not worth bothering about.

If you can process all frames in parallel, it is worth launching only a single kernel that uses a larger number of blocks. This provides the GPU with more parallelism as well as avoiding the launch overhead for additional kernels.

==== EDIT

PS: listen to @saulocpp’s advice - it is important that you understand the difference between CPU threads and GPU threads that he is pointing out.

No, each kernel is the container of its own threads. Once the kernel is finished running, the threads are gone, just like kernel-related data like shared memory. Search for scope of threads and memory in CUDA, you will find really useful information.

Regarding the overhead, check this: https://stackoverflow.com/questions/27038162/how-bad-is-it-to-launch-many-small-kernels-in-cuda

Check for the time needed to launch a kernel and compare it to your kernel run time. You can also find good information here: https://stackoverflow.com/questions/35628624/cuda-thread-scheduling-latency-hiding

If you are planning to launch “kernel1” multiple times, I suggest you first check how long it takes to run with the profiler, and then you compare to the time taken to launch a kernel and how many times you plan to launch it.

==== EDIT

PS: listen to @tera’s advice