The programming guide clearly states that all threads in a block have to execute the same code (not necessarily simultaneously). Is it possible to run multiple kernels across multiprocessors, one in each multiprocessor, simultaneously? The document was not very specific regarding this
For example, if I issue two kernels of one block each across two different streams, can they run on two multiprocessors simultaneously, or is it that the second one will stall until the first one is done, even though each kernel requires just one multiprocessor?
Ok. Section 4.5.2.4 made it seem like it might be possible to issue multiple kernels. Is it just that a kernel from one stream can be overlapped with a memory transfers from other streams?
I’ll try doing some tests. Do you know if any of the newer chips might support this?