Avoid thread launch overhead?

The program I am currently working on basically boils down to two sets of kernels that are called thousands of times each. I must pause between each iteration of each set to let the CPU do some work. Is there any way to avoid launching a new thread each time?

Each iteration of the kernel could be on the order of 2e-4 s. I don’t want thread management to swamp my speedup…

I’m thinking I probably should try to demonstrate what I mean…

I tend to think about threading as:

for ( i=0; i<2000; ++i )

{

	LaunchThreadsForFirstBatchOfCUDA();

	WaitForThreads();

	CloseAllThreads();

	LetCPUDoSomething();

	LaunchThreadsForSecondBatchOfCUDA();

	WaitForThreads();

	CloseAllThreads();

}

In the above layout, a lot of threads are launched and closed/joined during execution. Is there a way to approach it more as:

LaunchOneThreadPerDevice();

for ( i=0; i<2000; ++i )

{

	TellThreadsToDoFirstBatchOfWork();

	WaitForThreads();

	LetCPUDoSomething();

	TellThreadsToDoSecondBatchOfWork();

	WaitForThreads();

}

CloseAllThreads();

Please assume that there is no reasonable way to bypass returning to the CPU for now.

Thanks!

You can do exactly that, though there are many ways to do the same thing. On this forum its often referred to as a thread pool, where you have one thread managing the data flow to and from the gpu threads. If you ant a suggestion on how to do it, you could take a look at mutexes, which are used for thread sync. One guy did that at work, cause we found the thread launch overhead was hurting the real-time system. But I don’t know the best way to handle a thread pool wrt to gpu’s anyways, so you might need to look around.

The latter case is pretty much the design of MisterAnderson42’s GPUWorker class. It creates the threads, and performs the appropriate synchronization to start and stop them (without destroying them) as needed:

http://forums.nvidia.com/lofiversion/index.php?t66598.html

Doesn’t poster mean that he doesn’t want to launch the kernels a ton of times? Eg let the kernel sit idle on the device until it is launched again. If so, then no I don’t think that’s possible. If not so, and you’re talking about CPU threads then I’m wrong and the posts above mine should be read.

Actually, the worker thread concept is what I meant. I need to avoid launching pthreads/winthreads over and over again. I will take a look at the worker thread topic.

Thanks!