Number of CPU cores vs CUDA cards Do I need twice the number of cores?

If I understand things right, the thread scheduler is more or less a busy wait loop running on the CPU. If I am designing an application preparing GPU data for kernel launch N+1 during the execution of kernel N, then I need twice the number of CPU cores compared to CUDA capable cards, right? E.g. a quad core machine can only keep up with two CUDA cards?

– Kuisma

If data preparation takes less time than executing kernel you do not need twice as much CPUs:

  1. Launch kernel N
  2. Prepare data for kernel N+1
  3. cudaThreadSyncronize()

It’s not necessarily busy waiting. The CPU will not go into busy waiting after the kernel call. It will busy wait if you call CudaThreadSynchronize() or if you request a synchronous memory copy which calls CudaThreadSynchronize() implicitly.

So if you do something like this:

prepare_kernel_data();

copy_data_to_gpu();

launch_kernel();

prepare_kernel_data();

copy_data_from_gpu();

copy_data_to_gpu();

launch_kernel();

the kernel and the prepare_kernel_data() will run in parallel in the second pass (a pipeline if you will). However a “normal” memory copy will send the cpu into busy waiting.

With the new devices and CUDA 1.1 there’s asynchronous memory copy functionality.

So the bottom line is: depending on the ratio of kernelruntime/preparationruntime one CPU per GPU should suffice (just what AndreiB said).

So GPU thread scheduling is for free, from the CPU point of view?

I recall someone got a severe CUDA performance hit adding CUDA card #3 to a dual core machine, but it was solved by replacing it with a quad core.

GPU scheduling is free (or almost free) for CPU. It’s syncronization functions which keep CPU busy.

This is different situation. You have to maintain as many CPU threads as GPUs installed in system. And if each thread spends most of its time inside syncronization (which is typical) then you will feel lack of CPU power as number of threads becomes larger than number of physical cores.

Yes, the busy looping is terrible. Luckily, the various async calls in CUDA 1.1 make this somewhat less of an issue, as you can do other things on the CPU and poll the GPU for completion only once in a while.