Kernel Queueing

bbales2 · June 26, 2009, 7:23pm

How fancy is the CUDA queueing mechanism that decides how to execute copies and kernels asynchronously?

For instance, do I have to submit my kernels like:

Copy on stream 0
Kernel on stream 1
Copy on stream 2
Kernel on stream 0
Copy on stream 1

Where all neighboring kernels are independent? Or is there a lookahead mechanism that lets me do:

Copy on stream 0
Copy on stream 0
Copy on stream 0
Kernel on stream 0
Kernel on stream 1

And have 1, 2, and 3 overlap as they can with 5?

If that is not possible, would this work? Or would 4 only overlap with 3 (instead of 1, 2, and 3)?

Copy on stream 0
Copy on stream 0
Copy on stream 0
Kernel on stream 1

Sorry if this is in a manual somewhere, I really didn’t look too hard :). Playing with simpleStreams now, but this seems like something someone would know.

Ben

bbales2 · June 29, 2009, 1:33am

I’m just bumping this because I’m still looking for this information.

Originally I was doing calls like this:

Async Copy 0
Async Copy 0
Async Copy 0
Async Copy 1
Async Copy 1
Async Copy 1
…
SGEMM 0
SGEMM 1
…

NOTE: SGEMM included a small async copy on the same stream right before it. I realized that may have been screwing up asynchronous execution, so I replaced it in favor of having the sgemm kernel copy through mapped/zero-copy memory.

After that I clumped the copies into big blocks, so that I was basically running my code as given in the simple examples:

for(i = 0; i < STREAMS; i++)
COPY();

for(i = 0; i < STREAMS; i++)
COMPUTE();

So I get:

COPY 0
COPY 1
COPY 2
COPY 3
COMPUTE 0
COMPUTE 1
COMPUTE 2
COMPUTE 3

I’m still not getting much overlap though (though it does exist). Any thoughts? Suggestions?

It looks to me like only the copy and compute that lay adjacent to each other are executing asynchronously.

Ben

Nico · June 29, 2009, 6:51am

Are you using page-locked host memory (cudaMallocHost)?

N.

Tobi_W · June 29, 2009, 7:05am

How much memory do you copy per call? Streams have some overhead, so you have to transfer a specific amount of data to get a performance improvement. On my platform (GTX 285 on nForce 780i) it’s about 32 KB with 2 streams…

bbales2 · June 29, 2009, 11:27am

Nico: Yes, and everything appears to be working fine there (the copies didn’t work at all until I used cudaMallocHost).

Tobi_w: I’m approaching a megabyte for even the small calls. I’ll probably try increasing this some more just to see what happens.

Edit: Thanks for the responses. I appreciate them.

Ben

Nico · June 29, 2009, 12:30pm

What type of asynchronous copies are you using, are they all of type cudaMemcpyHostToDevice/cudaMemcpyDeviceToHost or are there also cudaMemcpyDeviceToDevice copies?

N.

bbales2 · June 29, 2009, 2:05pm

They’re all host->device (though the device->host copies will be added soon enough).

I increased my problem sizes and I’m getting better overlap now.

It appears that running in any of these configurations produces similar overlap (thankfully) [first column is the name, second column is the stream]:
COPY 0
COPY 1
COPY 2
…
COMPUTE 0
COMPUTE 1
COMPUTE 2
…

COPY 0
COMPUTE 0
COPY 1
COMPUTE 1
COPY 2
COMPUTE 2
…

Small COPY 0
Small COPY 0
Small COPY 0
Small COPY 1
Small COPY 1
Small COPY 1
Small COPY 2
Small COPY 2
Small COPY 2
COMPUTE 0
COMPUTE 1
COMPUTE 2
…

Thanks for the help all.

Ben

Nico · June 29, 2009, 2:17pm

That looks about right. The reason I asked about the memcpy type was that async cudaMemcpyDeviceToDevice calls can not overlap with kernel execution.

N.

bbales2 · June 29, 2009, 2:25pm

Ah, good to know.

Thanks,
Ben

Topic		Replies	Views
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1761	June 23, 2010
asynchronous memory transfer CUDA Programming and Performance	2	1649	October 29, 2008
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	964	February 1, 2022
Weird behaviour of CUDA streams CUDA Programming and Performance	0	1889	June 17, 2010
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2947	December 18, 2008
memory copy overlap CUDA Programming and Performance	7	14717	March 29, 2008
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	872	July 7, 2017
Syncronization with cuda Streams CUDA Programming and Performance cuda	8	418	October 12, 2021
Concurrent Memory Copy and Kernel Execution CUDA Programming and Performance	0	2336	February 26, 2010
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	69	November 18, 2024

Kernel Queueing

Related topics