An idea about streams CPU-side streams!


Lucky enough to have access to a new generation GTX200, I decided to try and overlap computation and communication on my machine, as well as some CPU code.

The problem is that streams, while very nice when used in the case of independant executions, seem to be quite limited when there is some dependencies happening.

Here is an idea that I feel could help a bit programmers: “CPU-side streams”.

Let’s consider the following program:

1 - CPU side pack

2 - memcopy from host to device

3 - GPU kernel computation

4 - memcopy from device to host

5 - CPU side unpack

The idea is that this code has to be launched on consecutive blocks, with the following dependency: one block can be computer only if it has received its local data (of course) and if the previous block has been finished (step 3): the data computer has to stay on the GPU for future blocks to use.

You can see that there is a lot of inherent parallelism: when we compute one block, we can get the data from the previous one and unpack it, while packing data for the next block and sending it.

With streams and Async, that proves very hard to achieve.

What if we had host side streams?

Let’s imagine the following syntax:


… code, including kernel calls / communication calls


Then we could simply write something like:

for (blockId) {

    cudaHostCode(streamId[blockId]) {


        cudaMemcpyAsyncHostToDevice(blockId, streamId[blockId]);




        cudaMemcpyAsyncDeviceToHost(blockId, streamId[blockId]);





Of course that is only pseudo-code, but I think it would provide an interface way easier than the one currently available, which in this case would require some polling/thread management.

This should be possible - and I agree, something easier to wrap your head around is probably worth it.

This is a “me too” post - Can’t commit to work on it right now … :)