async memcopy/kernel from different contexts overlaping operations from different contexts..

beymar · December 15, 2008, 9:53am

Hi,

I want to write a multi-threaded program where I have each Host thread attached to a different Cuda context. In each context I want to use one or more streams.
Now the question is, if Cuda overlaps the kernel executions from stream1 (which should exist in context1) with async. memory copies in stream2 (context2).

So for example if I have to synchronize stream2 and block the Host thread until the memory copy has completed, would the other host thread be able to start a kernel in stream1? And would I gain some additional performance from async. operations, or does this only work for streams within the same Cuda context?

I am using Cuda 2.0 on a GTX 280.

Thanks in advance.

alex_dubinsky · December 17, 2008, 2:15am

That’s a good question. Anyone have an answer?

Smokey · December 17, 2008, 3:20am

CUDA Streams execute concurrently with each other, so long as no CUDA operations are being executed on ‘stream 0’ (eg: not associated with a stream).

So in your case, assuming the memory copy is between page-locked HOST memory and DEVICE memory - then yes, your kernel on stream A should execute concurrently with your memory copy on stream B - as long as you don’t have another thread/context somewhere executing something on stream 0, and stream A != stream B.

Note: your memory copy on stream B will have to finish before your kernel on stream A can begin processing (or visa-versa) IF your memory copy is a device<->device memory, or pageable<->device memory transfer.

Some older cards (Compute 1.0 capable cards) don’t support concurrent memory page-locked->device copying while executing a kernel, however as far as I’m aware all Compute 1.1+ cards support this feature.

Edit: I misread the programming guide - and updated my statement above…

alex_dubinsky · December 17, 2008, 4:43am

You misread the question

Smokey · December 17, 2008, 10:13pm

I don’t see how you got that impression.

He said"if Cuda overlaps the kernel executions from stream1 (which should exist in context1) with async. memory copies in stream2 (context2)" and then asked if he synchronized stream2 (thus waiting for the memory copy), if the other thread would be able to start a kernel execution on stream1 without having to wait for stream2’s memory copy to complete.

To which my previous reply answers in full. If he’s doing a page-locked<->device memory transfer (eg: DMA transfer), yes, he can start executing a kernel on stream1 before stream2’s memory copy completes.

The fact he’s got one context on each host thread is irrelevant, so long as a) he’s doing a DMA transfer, b ) his card supports concurrent execution and DMA transfers, and c) the kernel he’s about to execute doesn’t rely on the memory being copied on stream2 (and even in this case, the kernel would run, but would likely have issues) - then:

“would the other host thread be able to start a kernel in stream1?” → Yes

&

“And would I gain some additional performance from async. operations, or does this only work for streams within the same CUDA context?” → Yes (you’re essentially amortizing the time it takes to do the DMA transfer, with little/no performance impact on your executing kernel).

Edit: Removed stupid smileys

Second Edit: Refer to section 4.5.1.5 in the programming guide, there are additional limitations (eg: this doesn’t work for DMA transfers to CUDA arrays or aligned device memory (eg: cuMemAllocPitch)).

tmurray · December 17, 2008, 10:43pm

You can have overlap between different contexts.

alex_dubinsky · December 18, 2008, 2:49am

Good to know.

Of course one would expect this behavior, but I’ve learned not to assume anything with cuda. There’s a million reasons why cross-context overlap might not have gotten implemented.

Another question: Do both contexts have to use streams explicitly, or will stream 0 from context 0 overlap stream 0 from context 1?

beymar · December 18, 2008, 8:48am

Thank you all for your detailed answers.

Good question.

Since streams are per context, there would be a default stream 0 in each context and thus async operation (with 0 passed as stream argument?) should overlap.

But this is just a guess. Maybe someone else has a more well-founded answer?

Smokey · December 18, 2008, 9:40pm

Considering the programming guide explicitly says operations will not overlap when any operation is running on stream 0 - and doesn’t say this is on a per-context basis (hence it implies cross-context streams may overlap), I’d assume stream 0 is somewhat ‘global’ (eg: there’s no stream0 for EACH context, but rather the same stream0 for all contexts - which would also help them enforce the synchronous behavior implied by using stream0).

Although as Alex said, assumptions aren’t the best things to put your money on - so all we can really do is wait for nVidia to answer this one.

Edit: Of course, you could just write up a test case to prove/disprove it without nVidia’s help…

tmurray · December 18, 2008, 9:50pm

Stream 0 is per-context, or so I’m told.

Topic		Replies	Views
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	971	February 1, 2022
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	886	July 7, 2017
Asynchronous stream only within a context? asynchronous streams and CUcontext CUDA Programming and Performance	0	2340	October 15, 2008
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1044	December 15, 2022
Overhead of using more than one streams? CUDA Programming and Performance	5	6175	April 14, 2009
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2237	October 26, 2016
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13126	March 30, 2011
Confusion about implicit inter-stream synchronization brought by cudaMemsetAsync CUDA Programming and Performance	5	603	December 30, 2023
Processing Order with Cuda Streams in 7.5 CUDA Programming and Performance	13	1990	June 24, 2016
confusions about CUDA streams CUDA Programming and Performance	5	805	July 30, 2017

async memcopy/kernel from different contexts overlaping operations from different contexts..

Related topics