I have been reading the documentation and the simpleStream example files, but they seam to have left me with more questions then answers. My first question is the simple stream example said that if I had a card with 1.1 that it would be X amount faster. Is that only for this example for memory copying or does all streaming on a 1.0 card work serially. The next question I have is about getting a kernel and memory copy operation to happen at the same time. The documentation briefly says you ca do this, but doesn’t give many specifics. Is this something that only 1.1 cards can do as well, and is it something that must be done with the driver API?
Yes, overlapping kernel execution and memory copy is possible only on devices with compute capability 1.1. On 1.0 devices such streams will be serialized. For async memory functions check programming manual, it is something like cudaMemcpyAsync (I don’t have Programming Manual right now).
I have been reading about the async options and had one more question. It only mentioned the ability to have a asyn execution to return control to the processor in the cu section. Is there a way to do this that is not listed with cuda commands, or is it only possible in the driver api?
Kernel executions are always async when using the runtime API.
O, didn’t realize that.
And with Driver API too, by the way.