I wonder if the following is possible:
- A cuda thread/kernel is done with processing and it sends a signal to the host. For example it queues itself onto some kind of event queue or so.
(The other cuda threads/kernels of the warp/thread blocks are still busy)
A few others might also be done and also queue themselfes onto the queue.
The threads which are done processing and sent a signal to the host, now wait for a signal from the host to continue with new data.
- The host/cpu is listening on the queue, for example via a blocking event/asynchronous. As the queue gets filled with signals from the the cuda threads the host/cpu
becomes active, there could be another queue in global memory shared by host and gpu (unified addressing or else something else/memcopy) which could specify/give more informatie
about which thread was queued, for example thread idx, block idx etc is copied into that memory.
So then the host knows that those threads are for example done processing and now the host can supply new data to be processed so it places new data into the device memory.
The cuda threads are now waiting for a signal from the host to continue, so it should be save for the host to copy new data into the device memory.
The host now signals the cuda threads to continue running.
This idea would give more individual control over each cuda thread.
This might be better, otherwise the host has to wait until all cuda threads are done running. Some threads might be done real fast, while others might take a long time…
So then perhaps those other threads have nothing to do anymore, since all data so far was processed, there is only a limited amount of data available in the device memory, so it’s done.
But the host can generate new data, so it would be interesting if it can immediatly feed that new data to the done cuda threads as soon as those cuda threads are done, so they can immediately
start running again.
Now the big question:
Is this possible somehow ? Does cuda have some kind of signalling facility which will work inside the cuda kernels and also outside/in the host/cpu ?
I am still noobie though, so I haven’t looked deeply into this yet, but any tips/ideas/tricks are welcome.
I looked into it in the programming guide and chances are slim… all atomic functions only work inside the devices, the host is unaware of it.
The same can be said for host/device memory overlapping and such… no possibility to safely access/read/write memory for host/device. So race conditions will probably occur.
The event system seems to be for after kernels are done or so… it’s a bit vague how the event system works… that could use some more documentation… perhaps I will look to see if there is an example… but it’s probably ment for rough scheduling and not the fine-grained event system I am looking forward.
Therefore I can probably safely say that this idea/this feature/this request could be interesting for Cuda 5.0.
Unless somebody comes up with a really smart trick like spinning the host on the same memory address and have the thread spin on same memory address as well and then hope for the best… but such solutions are probably a bit unreliable and would make me nervous