Signalling between cuda thread and host thread cuda-thread host-thread communication

Hello,

I wonder if the following is possible:

  1. A cuda thread/kernel is done with processing and it sends a signal to the host. For example it queues itself onto some kind of event queue or so.

(The other cuda threads/kernels of the warp/thread blocks are still busy)

A few others might also be done and also queue themselfes onto the queue.

The threads which are done processing and sent a signal to the host, now wait for a signal from the host to continue with new data.

  1. The host/cpu is listening on the queue, for example via a blocking event/asynchronous. As the queue gets filled with signals from the the cuda threads the host/cpu
    becomes active, there could be another queue in global memory shared by host and gpu (unified addressing or else something else/memcopy) which could specify/give more informatie
    about which thread was queued, for example thread idx, block idx etc is copied into that memory.

So then the host knows that those threads are for example done processing and now the host can supply new data to be processed so it places new data into the device memory.

The cuda threads are now waiting for a signal from the host to continue, so it should be save for the host to copy new data into the device memory.

The host now signals the cuda threads to continue running.

This idea would give more individual control over each cuda thread.

This might be better, otherwise the host has to wait until all cuda threads are done running. Some threads might be done real fast, while others might take a long time…

So then perhaps those other threads have nothing to do anymore, since all data so far was processed, there is only a limited amount of data available in the device memory, so it’s done.

But the host can generate new data, so it would be interesting if it can immediatly feed that new data to the done cuda threads as soon as those cuda threads are done, so they can immediately
start running again.

Now the big question:

Is this possible somehow ? Does cuda have some kind of signalling facility which will work inside the cuda kernels and also outside/in the host/cpu ?

I am still noobie though, so I haven’t looked deeply into this yet, but any tips/ideas/tricks are welcome.

I looked into it in the programming guide and chances are slim… all atomic functions only work inside the devices, the host is unaware of it.

The same can be said for host/device memory overlapping and such… no possibility to safely access/read/write memory for host/device. So race conditions will probably occur.

The event system seems to be for after kernels are done or so… it’s a bit vague how the event system works… that could use some more documentation… perhaps I will look to see if there is an example… but it’s probably ment for rough scheduling and not the fine-grained event system I am looking forward.

Therefore I can probably safely say that this idea/this feature/this request could be interesting for Cuda 5.0.

Unless somebody comes up with a really smart trick like spinning the host on the same memory address and have the thread spin on same memory address as well and then hope for the best… but such solutions are probably a bit unreliable and would make me nervous External Image

Bye,
Skybuck.

Hello,

I wonder if the following is possible:

  1. A cuda thread/kernel is done with processing and it sends a signal to the host. For example it queues itself onto some kind of event queue or so.

(The other cuda threads/kernels of the warp/thread blocks are still busy)

A few others might also be done and also queue themselfes onto the queue.

The threads which are done processing and sent a signal to the host, now wait for a signal from the host to continue with new data.

  1. The host/cpu is listening on the queue, for example via a blocking event/asynchronous. As the queue gets filled with signals from the the cuda threads the host/cpu
    becomes active, there could be another queue in global memory shared by host and gpu (unified addressing or else something else/memcopy) which could specify/give more informatie
    about which thread was queued, for example thread idx, block idx etc is copied into that memory.

So then the host knows that those threads are for example done processing and now the host can supply new data to be processed so it places new data into the device memory.

The cuda threads are now waiting for a signal from the host to continue, so it should be save for the host to copy new data into the device memory.

The host now signals the cuda threads to continue running.

This idea would give more individual control over each cuda thread.

This might be better, otherwise the host has to wait until all cuda threads are done running. Some threads might be done real fast, while others might take a long time…

So then perhaps those other threads have nothing to do anymore, since all data so far was processed, there is only a limited amount of data available in the device memory, so it’s done.

But the host can generate new data, so it would be interesting if it can immediatly feed that new data to the done cuda threads as soon as those cuda threads are done, so they can immediately
start running again.

Now the big question:

Is this possible somehow ? Does cuda have some kind of signalling facility which will work inside the cuda kernels and also outside/in the host/cpu ?

I am still noobie though, so I haven’t looked deeply into this yet, but any tips/ideas/tricks are welcome.

I looked into it in the programming guide and chances are slim… all atomic functions only work inside the devices, the host is unaware of it.

The same can be said for host/device memory overlapping and such… no possibility to safely access/read/write memory for host/device. So race conditions will probably occur.

The event system seems to be for after kernels are done or so… it’s a bit vague how the event system works… that could use some more documentation… perhaps I will look to see if there is an example… but it’s probably ment for rough scheduling and not the fine-grained event system I am looking forward.

Therefore I can probably safely say that this idea/this feature/this request could be interesting for Cuda 5.0.

Unless somebody comes up with a really smart trick like spinning the host on the same memory address and have the thread spin on same memory address as well and then hope for the best… but such solutions are probably a bit unreliable and would make me nervous External Image

Bye,
Skybuck.

I think that sentence best summarizes the problem. It’s certainly not impossible to do these things, but there are little guarantees in host<->device communication, so everything in that direction is deep into undocumented land. It might work on one mainboard, and not on another.

However, there is a straightforward and documented solution to the problem: Launch separate kernels for the different things you might want to wait on, each kernel in it’s own stream. Kernel launches are hardly slower than any other form of host<->device communication (with maybe the exception of windows not running the TCC driver). And if you have a compute capability 2.x device, multiple kernels from different streams can even run in parallel.

I think that sentence best summarizes the problem. It’s certainly not impossible to do these things, but there are little guarantees in host<->device communication, so everything in that direction is deep into undocumented land. It might work on one mainboard, and not on another.

However, there is a straightforward and documented solution to the problem: Launch separate kernels for the different things you might want to wait on, each kernel in it’s own stream. Kernel launches are hardly slower than any other form of host<->device communication (with maybe the exception of windows not running the TCC driver). And if you have a compute capability 2.x device, multiple kernels from different streams can even run in parallel.

Cool interesting possibility/solution. It would be a lot of kernels/streams like 1000 to 2000 or so, ofcourse some of them could be put into the same stream to cut back on the kernel launching/stream overhead if that would be a problem.

I didn’t think of this solution yet, so I will definetly check it out.

However my instinct tells me it will probably not be worth it because of lot’s of overhead and lost oppertunities for stalled warps to execute other blocks, especially if all streams would have just one block. Ofcourse this could be increased to two blocks… but that wouldn’t be much better.

Unless warps/execution units of different streams can execute other streams as well ?

So what happens if a block/warp in a stream stalls ?!? What will the gpu do ?!? Will it continue executing other streams ? If so then that would be excellent and not a problem (as long as stream context switch not to bad(?)… otherwise it would probably be bad.

Also using streams might make programming a little bit more easy… only 32 threads(/warp size) per streams or perhaps 64… but no longer huge arrays necessary. (?)

So each stream probably restarts the block indexes and thread indexes at zero ?

Also what happens on the host side… will a host thread per stream be necessary to be woken up by stream/kernels events… like “kernel done” processing ?

Perhaps a host thread can listen for multiple kernels to complete and know which ones were complete…

Perhaps knowing which ones were complete is not important anymore… but I think it still is a little bit important since it most know which memory was complete as well so it might store/use some of that… as result… then it can be replaced with new memory/new inputs for new kernels to be launched.

So lots of things to look into concerning streams External Image

I think it would be best to perform a multiple “almost empty/simple” streams test/benchmark or so… to see how it’s overhead compares to one gigant kernel launch.

Thanks,
Bye,
Skybuck.

Cool interesting possibility/solution. It would be a lot of kernels/streams like 1000 to 2000 or so, ofcourse some of them could be put into the same stream to cut back on the kernel launching/stream overhead if that would be a problem.

I didn’t think of this solution yet, so I will definetly check it out.

However my instinct tells me it will probably not be worth it because of lot’s of overhead and lost oppertunities for stalled warps to execute other blocks, especially if all streams would have just one block. Ofcourse this could be increased to two blocks… but that wouldn’t be much better.

Unless warps/execution units of different streams can execute other streams as well ?

So what happens if a block/warp in a stream stalls ?!? What will the gpu do ?!? Will it continue executing other streams ? If so then that would be excellent and not a problem (as long as stream context switch not to bad(?)… otherwise it would probably be bad.

Also using streams might make programming a little bit more easy… only 32 threads(/warp size) per streams or perhaps 64… but no longer huge arrays necessary. (?)

So each stream probably restarts the block indexes and thread indexes at zero ?

Also what happens on the host side… will a host thread per stream be necessary to be woken up by stream/kernels events… like “kernel done” processing ?

Perhaps a host thread can listen for multiple kernels to complete and know which ones were complete…

Perhaps knowing which ones were complete is not important anymore… but I think it still is a little bit important since it most know which memory was complete as well so it might store/use some of that… as result… then it can be replaced with new memory/new inputs for new kernels to be launched.

So lots of things to look into concerning streams External Image

I think it would be best to perform a multiple “almost empty/simple” streams test/benchmark or so… to see how it’s overhead compares to one gigant kernel launch.

Thanks,
Bye,
Skybuck.

You can make allocate an unsigned int as host memory. Let’s call it signal.
Then GPUs do something like if (signal != 0) /* CPU Want me to do something */
And CPU does a write to signal AS A VOLATILE POINTER.
You can also do atomic operations on host memory.
I must tell you that these kind of stuff is highly undocumented and experimental. Use it at your own risk.

To answer some of my own questions about the stream and event usage, here is an idea which might work, but I am not sure about it’s performance, which would probably not be too good:

Each kernel could be launched and attached to a stream, after the attach an event is attached as well to the same stream.

So each kernel could get it’s own event object.

The host thread could then spin on the event objects and check if they are done in a polling mechanism.

If the host thread sees that a particular event was done then it knows that that particular kernel was finished and now the host thread could launch a new kernel and attach another event.

To drawback/disadventage of this is that the host thread needs to spin a lot and consumes cpu… this could be slightly reduced by introducing some sleep(10) or so after a check… but it’s far from ideal External Image none the less not to bad External Image