I want to write a program which gets several rather small data packages, where the same kernel should be executed on each data item.
Now the question is, whether I should copy the data to device and then execute the kernel consecutively, i.e.
copy data1
execute kernel on data1
copy data2
execute kernel on data2
…
or if it might be more efficient to copy a bunch of data to device (perhaps in parallel?) and then invoke the kernel on all data items, i.e.
copy data1
copy data2
…
execute kernel on data1
execute kernel on data2
…
The memory copies should be performed async in both cases.
IMHO the latter case could save the time loading the same kernel to device in each run (at least in theory). Does this increase performance in practice? Or is CUDA smart enough to reorder the first approach in an efficient way?
Run the first approach with each separate set of dependant calls each in their own stream. Then CUDA is free to reorder and interleave them (assuming you use pinned memory and cudaMemcpy*Async()).
But could you give some reference how CUDA reorders these commands.
Because there is something I still do not understand concerning streams.
The programming guide states that all operations of one stream are executed in order.
So for example the following operations are in stream1:
1: cuMemcpyHtoDAsync(data1, stream1)
2: execute kernel on data1 (stream1)
3: cuMemcpyDtoHAsync(data1, stream1)
But I want to do cuMemcpyHtoDAsync(data2, stream1) in step 4 concurrently with the kernel execution in step 2.
Can I rely on CUDA to do this automatically (because I am using *Async operations) or do I have to assign the memory copies and the kernel executions to different streams such that they can overlap (because operations in one stream are executed in order)?
Of course, for the second case with 2 seperate streams I somehow have to notify the kernel when the data can be processed, and on the other hand notify when the kernel is finished and data can be copied back to device. Could this be done using events?
Or is the second approach completely nonsense and I should just put dependent operations in one stream and let CUDA do the optimizations?
CUDA will not reorder any commands which are in the same stream, so you won’t get any optimizations there. To get what you want, you will need to put steps 1,2,3 in one stream and 4,5,6 in another stream.
As you already said, all commands in each independent stream are processed in order. Thus there is no need for events to flag when a kernel is to execute. As long as steps 4,5,6 are all in the same stream CUDA guarantees that they will be executed in order and one after another.
You only need to use events and/or other synchronization to query when the last step (the device->host async memcpy) is complete so that you know the data is good to read on the host.