How fancy is the CUDA queueing mechanism that decides how to execute copies and kernels asynchronously?
For instance, do I have to submit my kernels like:
Copy on stream 0
Kernel on stream 1
Copy on stream 2
Kernel on stream 0
Copy on stream 1
Where all neighboring kernels are independent? Or is there a lookahead mechanism that lets me do:
Copy on stream 0
Copy on stream 0
Copy on stream 0
Kernel on stream 0
Kernel on stream 1
And have 1, 2, and 3 overlap as they can with 5?
If that is not possible, would this work? Or would 4 only overlap with 3 (instead of 1, 2, and 3)?
Copy on stream 0
Copy on stream 0
Copy on stream 0
Kernel on stream 1
Sorry if this is in a manual somewhere, I really didn’t look too hard :). Playing with simpleStreams now, but this seems like something someone would know.
NOTE: SGEMM included a small async copy on the same stream right before it. I realized that may have been screwing up asynchronous execution, so I replaced it in favor of having the sgemm kernel copy through mapped/zero-copy memory.
After that I clumped the copies into big blocks, so that I was basically running my code as given in the simple examples:
How much memory do you copy per call? Streams have some overhead, so you have to transfer a specific amount of data to get a performance improvement. On my platform (GTX 285 on nForce 780i) it’s about 32 KB with 2 streams…
What type of asynchronous copies are you using, are they all of type cudaMemcpyHostToDevice/cudaMemcpyDeviceToHost or are there also cudaMemcpyDeviceToDevice copies?
They’re all host->device (though the device->host copies will be added soon enough).
I increased my problem sizes and I’m getting better overlap now.
It appears that running in any of these configurations produces similar overlap (thankfully) [first column is the name, second column is the stream]:
COPY 0
COPY 1
COPY 2
…
COMPUTE 0
COMPUTE 1
COMPUTE 2
…