Parallel execution of multiple kernels possible?

Hi,

this is a newbie question but I found no definite answer in the Cuda prog. guide. Is it possible to execute multiple different kernels in parallel using different streams for each kernel? The guide says that the memcpy and kernel execution of different streams can run concurrently which kind of alludes that only one kernel can be active.

Is this an architectural limitation and is it possible that with next GPU generations this will change?

Streams do not currently allow two kernels to execute at the same time. People have requested this feature in the past, but there was no word whether it was feasible with current hardware.

Certainly, I think it will become desirable for certain use cases to partition the card as the number of multiprocessors increases.