Kernel Queueing

How fancy is the CUDA queueing mechanism that decides how to execute copies and kernels asynchronously?

For instance, do I have to submit my kernels like:

  1. Copy on stream 0
  2. Kernel on stream 1
  3. Copy on stream 2
  4. Kernel on stream 0
  5. Copy on stream 1

Where all neighboring kernels are independent? Or is there a lookahead mechanism that lets me do:

  1. Copy on stream 0
  2. Copy on stream 0
  3. Copy on stream 0
  4. Kernel on stream 0
  5. Kernel on stream 1

And have 1, 2, and 3 overlap as they can with 5?

If that is not possible, would this work? Or would 4 only overlap with 3 (instead of 1, 2, and 3)?

  1. Copy on stream 0
  2. Copy on stream 0
  3. Copy on stream 0
  4. Kernel on stream 1

Sorry if this is in a manual somewhere, I really didn’t look too hard :). Playing with simpleStreams now, but this seems like something someone would know.

Ben

I’m just bumping this because I’m still looking for this information.

Originally I was doing calls like this:

Async Copy 0
Async Copy 0
Async Copy 0
Async Copy 1
Async Copy 1
Async Copy 1

SGEMM 0
SGEMM 1

NOTE: SGEMM included a small async copy on the same stream right before it. I realized that may have been screwing up asynchronous execution, so I replaced it in favor of having the sgemm kernel copy through mapped/zero-copy memory.

After that I clumped the copies into big blocks, so that I was basically running my code as given in the simple examples:

for(i = 0; i < STREAMS; i++)
COPY();

for(i = 0; i < STREAMS; i++)
COMPUTE();

So I get:

COPY 0
COPY 1
COPY 2
COPY 3
COMPUTE 0
COMPUTE 1
COMPUTE 2
COMPUTE 3

I’m still not getting much overlap though (though it does exist). Any thoughts? Suggestions?

It looks to me like only the copy and compute that lay adjacent to each other are executing asynchronously.

Ben

Are you using page-locked host memory (cudaMallocHost)?

N.

How much memory do you copy per call? Streams have some overhead, so you have to transfer a specific amount of data to get a performance improvement. On my platform (GTX 285 on nForce 780i) it’s about 32 KB with 2 streams…

Nico: Yes, and everything appears to be working fine there (the copies didn’t work at all until I used cudaMallocHost).

Tobi_w: I’m approaching a megabyte for even the small calls. I’ll probably try increasing this some more just to see what happens.

Edit: Thanks for the responses. I appreciate them.

Ben

What type of asynchronous copies are you using, are they all of type cudaMemcpyHostToDevice/cudaMemcpyDeviceToHost or are there also cudaMemcpyDeviceToDevice copies?

N.

They’re all host->device (though the device->host copies will be added soon enough).

I increased my problem sizes and I’m getting better overlap now.

It appears that running in any of these configurations produces similar overlap (thankfully) [first column is the name, second column is the stream]:
COPY 0
COPY 1
COPY 2

COMPUTE 0
COMPUTE 1
COMPUTE 2


COPY 0
COMPUTE 0
COPY 1
COMPUTE 1
COPY 2
COMPUTE 2


Small COPY 0
Small COPY 0
Small COPY 0
Small COPY 1
Small COPY 1
Small COPY 1
Small COPY 2
Small COPY 2
Small COPY 2
COMPUTE 0
COMPUTE 1
COMPUTE 2

Thanks for the help all.

Ben

That looks about right. The reason I asked about the memcpy type was that async cudaMemcpyDeviceToDevice calls can not overlap with kernel execution.

N.

Ah, good to know.

Thanks,
Ben