host streams

I wonder if host threads could be wrapped into an API similar to CUDA device streams, with compatible events, to make programming easier, especially when there are multiple host threads doing their own work?

“similar to CUDA device streams”

in what way…?

Similar API, i.e. both device and host streams being able to register and wait for events, including device streams waiting for host events.

“both device and host streams being able to register and wait for events, including device streams waiting for host events.”

to date, i have had little need for the device to wait for host events per se; however, i have many times felt that the device being able to wait on device events - the device being able to wait on itself - would be as much fun as a female santa

this is handy when you throw/ juggle work in different streams, but with dependencies between the work

thus, from my point of view, your suggestion is well received

to date, my work-around is to use more host threads, and have these act as intermediaries

Isn’t this how cudaEvent_t is typically used? I’m pretty sure streams can wait for events registered on streams from the same or other devices. (Perhaps I’m misunderstanding what you wrote)

Also, the host can wait for device events using cudaEventSynchronize, of course. However, to make the device wait for the host, the host just wouldn’t launch a kernel until after it should run. To me, this seems OK when there is only one host “stream”. However, this gets unwieldy when there are multiple computations (“streams”) happening on the host as well.

If there were a consistent host/device streaming API, I think programming this kind of parallelism would get much easier.

So I was wondering if such an API could be implemented on top of the existing CUDA and pthreads/Boost::Thread API. This seems like a fairly common-sense idea, so I suspect that other people may have thought about this already.

“I’m pretty sure streams can wait for events registered on streams from the same or other devices”

from the host of course

consider 4 tasks - A, B, C, D; B, C, D can run concurrently, but are dependent on A

now, you have 2 options:
launch a, b, c, d in the same stream - this ensures the dependency is met, implies the host can issue all work simultaneously, but the host can not also parallelize the work by issuing b, c, d each in a separate stream

launch a, record an event, wait for the event, launch b, c, d - the dependency is met, the host sufficiently parallelizes the work by launching b, c, d in separate streams, but the host can not issue all work at once, and must wait for the event first

the ideal would be to be able to instruct the device that a task in a stream is conditional on an event - i.e. have the device wait on events, instead of the host
in the example, you would then issue a, b, c, d all at once, each in a different stream, but instruct the device to wait on a, before commencing with b, c, d
i suppose i want my cake, eat it, and also give some to my friends

“If there were a consistent host/device streaming API,”

so, what would you require of the api - list some of its functions/ tasks

I don’t see a problem here. For clarity, I’ll assume that a, b, c, d are kernels.

cudaStream_t s1, s2, s3;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
cudaStreamCreate(&s3);

cudaEvent_t e;
cudaEventCreate(&e);


// describe the dependencies and launch everything:

a<<<1024, 256, 512, s1>>>(x, y, z);
cudaEventRecord(e, s1);

b<<<1024, 256, 512, s1>>>(x, y, z);

cudaStreamWaitEvent(s2, e, 0);
c<<<1024, 256, 512, s2>>>(x, y, z);

cudaStreamWaitEvent(s3, e, 0);
d<<<1024, 256, 512, s3>>>(x, y, z);

// wait for completion of all tasks:

cudaDeviceSynchronize();

APIs require careful design, while I haven’t thought about this much, but here’s one possibility:

For every cudaXYZ function or type, add another called, say, hostXYZ that describes host events and streams. Additionally, let cudaStreamWaitEvent accept hostEvent_t via overloading, and hostStreamWaitEvent accept cudaEvent_t. We would also need a function like hostStreamLaunch that accepts a function (thunk).

“I don’t see a problem here.”

you, sir, are of course right; i was so so so fixated on cudaEventSynchronize() that i completely overlooked cudaStreamWaitEvent()

epiphanies as such makes one want to drown yourself in something - female santas holding jugs filled with punch comes to mind…

“For every cudaXYZ function or type, add another called, say, hostXYZ that describes host events and streams. Additionally, let cudaStreamWaitEvent accept hostEvent_t via overloading, and hostStreamWaitEvent accept cudaEvent_t. We would also need a function like hostStreamLaunch that accepts a function (thunk)”

you actually want to create ’ host streams’
[well, that is indeed the label of the topic; but only now is it starting to sink in…]

my first reaction was: “are you drinking…?”

my second reaction: “warning: brain overload”

my third reaction: “perhaps i need to sit down and mull this over”

this would be an entirely different beast - a lot would need to happen under the hood as well; for one, i suppose host streams would need to be ‘aware’ of each other/ each others’ states as well…

i think i am going to need one or two use cases to make full sense of ‘host streams’

I hope the discovery of cudaStreamWaitEvent will simplify your future code!

Here is a motivating example for host streams:

I’ll use

A < B

to denote that B should run only after A finished. Consider a 2D array of tasks that must run on the device:

D[i, j]

and a 2D array of tasks that must run on the host:

H[i, j]

The dependencies are as follows (for any i and j):

// the host task must run after the device task:

D[i, j] < H[i, j]

// "next" device tasks must run after the host task:

H[i, j] < D[i+1, j]
H[i, j] < D[i, j+1]

Note that despite the dependencies, there are still opportunities for parallelism. For example, H[0, 2], H[1, 1] and D[2, 0] can all run in parallel.

Managing all this parallelism using device streams and host threads seems to be a challenge, but with combined host/device streams specifying the dependencies and launching all tasks would just be a double loop.

i forecast that i would likely go bald in a few days, as i can not stop scratching my head

it sorrows me, but i must admit that i am simply not at now evidently the advanced level of 2D host/device task arrays
i regularly use multiple host threads and numerous device streams, but generally manage some variable such that one of the array dimensions never gets that deep; thus, i hardly surpass 1.5D task arrays
i pause from time to time; it seems that you hardly pause at all!

perhaps it is that cursed cudaStreamWaitEvent yielding you unfathomed powers

if i may ask: what are your typical 2D task array dimensions - how many host/ device tasks?

I don’t think that there is anything conceptually difficult about 2D arrays of tasks like this, but they illustrate the “accidental complexity” involved in juggling host threads and device streams at the same time (Unless I’m overlooking an elegant solution here, in which case, please let me know)

“Unless I’m overlooking an elegant solution here”

i doubt. either you have significant redundancy in your code, implying inefficient code, and the possibility of reducing the number of streams, etc when removing such redundancy; or, you have minimal/ no redundancy in your code, implying efficient code, and the number of streams is a reflection on the code efficiency - i.e. improving your code/ writing efficient code increased/ led to the number of streams, etc

whenever i rework algorithms, the code generally increases (proxy for complexity), the number of kernels generally increases (task specialization), kernels tend to become more sophisticated, and the number of streams tend to increase

it is perhaps a common phenomenon of hpc - increasing hardware ‘collaborating’ and ‘collaboration’ may improve performance, but at the cost of brain-numbing overhead

2 things i have picked up by now, perhaps pertinent to the feasibility of your ‘host streams’
a) as the flow becomes increasingly asynchronous, debugging changes (becomes slightly more challenging; for one, the point when breakpoints are triggered in the flow, changes)
[thus, how would ‘host streams’ impact debugging]
b) i run into lower level host threads and/ or mechanisms more and more, as i increase asynchronous flow - for example, i now run into mutex locks from time to time when increasing asynchronous api calls, when debugging