If I understand things right, the thread scheduler is more or less a busy wait loop running on the CPU. If I am designing an application preparing GPU data for kernel launch N+1 during the execution of kernel N, then I need twice the number of CPU cores compared to CUDA capable cards, right? E.g. a quad core machine can only keep up with two CUDA cards?
It’s not necessarily busy waiting. The CPU will not go into busy waiting after the kernel call. It will busy wait if you call CudaThreadSynchronize() or if you request a synchronous memory copy which calls CudaThreadSynchronize() implicitly.
the kernel and the prepare_kernel_data() will run in parallel in the second pass (a pipeline if you will). However a “normal” memory copy will send the cpu into busy waiting.
With the new devices and CUDA 1.1 there’s asynchronous memory copy functionality.
So the bottom line is: depending on the ratio of kernelruntime/preparationruntime one CPU per GPU should suffice (just what AndreiB said).
GPU scheduling is free (or almost free) for CPU. It’s syncronization functions which keep CPU busy.
This is different situation. You have to maintain as many CPU threads as GPUs installed in system. And if each thread spends most of its time inside syncronization (which is typical) then you will feel lack of CPU power as number of threads becomes larger than number of physical cores.
Yes, the busy looping is terrible. Luckily, the various async calls in CUDA 1.1 make this somewhat less of an issue, as you can do other things on the CPU and poll the GPU for completion only once in a while.