send two kernels at one time what will happen?

what will happen if I do this:
suppoose foo1 and foo2 are two kernels.

now, I sent foo1 and foo2:
foo1<<<blocks, threads>>>(…)
foo2<<<blocks1, threads1>>>(…)

does foo1 still running when I sent foo2?
how can i make foo1 and foo2 copoeratoering?
I means some grids are excuesing foo1, and others are excuseing foo2 at the same time.

and, If i didnot cufree() the device memory which i cumalloced in my application, will it stay in the device forever? I think the device memory deosnot get protection from the operator system.

Yes, foo1 could potentially still be running on the device when you launch foo2 from the host. Launching foo2 will add it to a queue on the device: after foo1 completes, foo2 will then be executed.

You cannot.

The device has the same types of memory protection that the host does. In particular, one CUDA context cannot read/write another CUDA context’s memory and all allocated memory when the program exits and/or crashes. It is still good programming practice to free memory yourself, memory leaks in a long running program can be difficult to find.

Hmm, is this always true? What if foo1 has completed all but 2 blocks, which are still cooking. Those two blocks are using 2 SPs. Couldn’t the GPU start running foo2 on the remaing unused GPU SPs?

Or does the GPU (or firmware) simplify its scheduling by making all tasks truly exclusive? That fits the behavior so it’s probably true. There could also be some hardware limit that forces all SPs to all use identical kernel code.

Thinking about it, it’d be COOL to allow allocated SP use for deliberate GPU coprocessing. The kernels would be independent and uncommunicating, but that’s still useful. I’m thinking of stuff like “Run this kernel, but use no more than 2 SPs, so other jobs can get done simultaneously”. That could even solve problems like using a display card for CUDA. “Run my kernel, but leave a couple SPs free for the OS graphics.” Or a game may have physics run on 1 SP, a particle system on 1 SP, collision dynamics on 2 SPs, leaving all remaining SPs free for OpenGL.

A good one for the wishlist thread probably.

I believe you will already find it there, this is an often requested feature going back to the first public beta CUDA 0.8. To my knowledge, NVIDIA has made no comments whether it could be a possibility or not.

I see. Thank you for your reply.

You could do this by having 2 GPUs. Might not be worth it for what you are doing, but if you have the money and a critical application that can has 2 kernels that can be run at once it might be worth it. You probably already know this, but the two kernels would not be able to effectively communicate in a setup like this.


This would require among other things, a tiny operating system to execute on the GPU, with it’s own drivers and kernel. It;s not a simple thing to do.


I’m pretty sure this is not true unless you explicitly pass a stream to along with the grid and block dimensions. If you don’t do this, then execution of the second kernel won’t start until the first one has completed.

?? Isn’t that what I said? (except the bit about streams)

To make it more clear:

Kernel calls are asynchronous. I can call











And it will take ~0 time on the host. Up to 16 calls (even more on compute 1.1 hardware) are queued up with no need for using streams. As I said before (and you reiterated), on the device the next kernel in queue will not even start until the previous one completes.

Using streams still doesn’t allow more than one kernel to execute simultaneously on the device. The device still serializes kernel executions.

Yes, but there already is a tiny operating system in the card firmware, handing streams, async mem copying, queued kernels, etc. We don’t know how versatile that controller is, but the existing programming model fits the possibility of having multiple simultaneous kernel execution, where some multiprocessors are evaluating blocks from one kernel, and other multiprocessors are evaluating blocks from another kernel.

There a very small tiny bit of support for this idea in the programming guide, which says "When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with

available execution capacity."

Ok, that’s a tenuous hint, but it still fits. They could have just written “each multiprocessor is assigned one or more blocks from the grid.”