Multi kernels


How many kernels I can execute in some time?

Only one kernel can execute on the machine at a time, see the FAQ Q29:

Then that means “Asynchronous Launches”? In fact, kernels can’t run in parallel and CPU usage at the moment of the kernel execution is 100%.

Why nobody from Nvidia developers, can’t replay on this simple question?

Sorry, I didn’t quite understand your question.

Yes, kernel launches are asynchronous in CUDA 1.0 - control is returned to the application as soon as the kernel is launched.

CPU usage should not be 100% during kernel execution.

As I mentioned earlier, multiple kernels cannot execute in parallel.


All my tests shows 100% CPU usage. Can you give an example of the kernel (as working program) which unload a host? May be I doing something wrong?

you’re not using CUDA 0.9 and after, are you? if you do, you should add

cudaThreadSynchronize() before and after the kernel-lauching line. Please refer to sdk samples.

cudaMemcpy(d_buffinD,h_buffinH,DATASIZE, cudaMemcpyHostToDevice);

	for (int i = 0; i<100; i++)

  ProcessDeviceSuccess<<<BLOCK_N, THREAD_N, ELEMENT_N*ALIGN*THREAD_N>>>(d_buffoutD,d_buffinD);

	cudaMemcpy(h_buffD,d_buffoutD,DATASIZE, cudaMemcpyDeviceToHost);

Thanks, yk_cadcg.

I already tried to use cudaThreadSynchronize(), but it does not influence on CPU

occupancy. Can you modify my test (or provide your simple test) to unload CPU?

Kernel launches are asynchronous, but memory copies are not asynchronous. So in your example, you launch the kernel, followed by a memory copy which is blocking and causes the high CPU load you noticed. As mentioned in other topics, there is no way to see if a kernel launch is finished. The only thing you can do is sleep your process for a specified time duration to unload the CPU if you have a rough idea of the execution time of the kernel…


to be more clear:

MyKernel<<<Dg, Db, Ns>>>(args1);

DoOtherThings(args2); //as long as args2 isn't dependent to the output of myKernel, the CUDA 0.9's "asyncronize kernels" feature could assure that the CPU could immediatelly DoOtherThings without hanging up to wait for the output of MyKernel. As we know, the CUDA 0.8 or older versions locks CPU to wait for the end of MyKernel.

Yes. But in my examel “TempFunc()” takes all processor time (it process only one variable in register). I tried to comment all memory copy functions but… CPU usage is 100%. Can you give a simpel exampel which unload CPU? I only heard that it is possible. I very thant to see program which do this. :) (Attach this code, please.)

Many thanks.

Your code calls many kernels one right after the other. I’m pretty sure that calling another kernel will block until the first one returns, apparently at 100% CPU usage.

It seems asynchronous kernel launching is quite confusing. IMO the same results can be achieved smartly by doing a simple synchronous kernel launch, but launching another host thread. The operating system won’t consume 100% CPU time while doing some thread synchronization. Furthermore you have all coding options like polling and other.

The Kernel execution time more than 200ms. It is so long and enough to unload CPU between kernel launches (if it is possible).

Who can exactly answer on the question: - Is it really possible to unload CPU when kernel is runing (not synchronized)? (Yes/No). If Yes then it is intersting to see a code. I suppose - NO! But I can’t understand my colleagues from this forum - here is no exactly answer (only suggestion).