async memcopy/kernel from different contexts overlaping operations from different contexts..

Hi,

I want to write a multi-threaded program where I have each Host thread attached to a different Cuda context. In each context I want to use one or more streams.
Now the question is, if Cuda overlaps the kernel executions from stream1 (which should exist in context1) with async. memory copies in stream2 (context2).

So for example if I have to synchronize stream2 and block the Host thread until the memory copy has completed, would the other host thread be able to start a kernel in stream1? And would I gain some additional performance from async. operations, or does this only work for streams within the same Cuda context?

I am using Cuda 2.0 on a GTX 280.

Thanks in advance.

That’s a good question. Anyone have an answer?

CUDA Streams execute concurrently with each other, so long as no CUDA operations are being executed on ‘stream 0’ (eg: not associated with a stream).

So in your case, assuming the memory copy is between page-locked HOST memory and DEVICE memory - then yes, your kernel on stream A should execute concurrently with your memory copy on stream B - as long as you don’t have another thread/context somewhere executing something on stream 0, and stream A != stream B.

Note: your memory copy on stream B will have to finish before your kernel on stream A can begin processing (or visa-versa) IF your memory copy is a device<->device memory, or pageable<->device memory transfer.

Some older cards (Compute 1.0 capable cards) don’t support concurrent memory page-locked->device copying while executing a kernel, however as far as I’m aware all Compute 1.1+ cards support this feature.

Edit: I misread the programming guide - and updated my statement above…

You misread the question

I don’t see how you got that impression.

He said"if Cuda overlaps the kernel executions from stream1 (which should exist in context1) with async. memory copies in stream2 (context2)" and then asked if he synchronized stream2 (thus waiting for the memory copy), if the other thread would be able to start a kernel execution on stream1 without having to wait for stream2’s memory copy to complete.

To which my previous reply answers in full. If he’s doing a page-locked<->device memory transfer (eg: DMA transfer), yes, he can start executing a kernel on stream1 before stream2’s memory copy completes.

The fact he’s got one context on each host thread is irrelevant, so long as a) he’s doing a DMA transfer, b ) his card supports concurrent execution and DMA transfers, and c) the kernel he’s about to execute doesn’t rely on the memory being copied on stream2 (and even in this case, the kernel would run, but would likely have issues) - then:

“would the other host thread be able to start a kernel in stream1?” -> Yes

&

“And would I gain some additional performance from async. operations, or does this only work for streams within the same CUDA context?” -> Yes (you’re essentially amortizing the time it takes to do the DMA transfer, with little/no performance impact on your executing kernel).

Edit: Removed stupid smileys

Second Edit: Refer to section 4.5.1.5 in the programming guide, there are additional limitations (eg: this doesn’t work for DMA transfers to CUDA arrays or aligned device memory (eg: cuMemAllocPitch)).

You can have overlap between different contexts.

Good to know.

Of course one would expect this behavior, but I’ve learned not to assume anything with cuda. There’s a million reasons why cross-context overlap might not have gotten implemented.

Another question: Do both contexts have to use streams explicitly, or will stream 0 from context 0 overlap stream 0 from context 1?

Thank you all for your detailed answers.

Good question.

Since streams are per context, there would be a default stream 0 in each context and thus async operation (with 0 passed as stream argument?) should overlap.

But this is just a guess. Maybe someone else has a more well-founded answer?

Considering the programming guide explicitly says operations will not overlap when any operation is running on stream 0 - and doesn’t say this is on a per-context basis (hence it implies cross-context streams may overlap), I’d assume stream 0 is somewhat ‘global’ (eg: there’s no stream0 for EACH context, but rather the same stream0 for all contexts - which would also help them enforce the synchronous behavior implied by using stream0).

Although as Alex said, assumptions aren’t the best things to put your money on - so all we can really do is wait for nVidia to answer this one.

Edit: Of course, you could just write up a test case to prove/disprove it without nVidia’s help…

Stream 0 is per-context, or so I’m told.