I am trying to find a way to have the CPU block when GPUs have run out of streams.
I have arrays:
and also separate the data and results out into other arrays, again dimensioned by [MAX_GPU_COUNT][STREAMS_MAX].
And I want to use the streams to launch multiple kernels (locking the appropriate mutex in the mutexes array) on my Titan cards under Windows 7. Then I wanted to use cudaStreamAddCallback to process the results and to unlock the mutex to mark the stream as available again for another kernel launch.
Then I discovered that boost/thread.hpp does not compile using nvcc. I tried the advice to separate device and host code but was not able to do that because I am using Thrust as well.
I wonder if anyone knows of an elegant way to do the locking and unlocking?
Hope that makes sense,