Hi,
I’m using the --default-stream per-thread compilation flag with multiple threads then using cudaStreamSynchronize(cudaStreamPerThread) to synchronize the appropriate stream.
It seems that the code hangs. Using gdb it seems the threads are all on cudaStreamSynchronize.
Anyone has an idea? is this a bug in this configuration?
Hi Robert,
Thanks for the answer. I was indeed unable to reproduce the issue with the code you’ve sent.
However I think I’ve found the root cause. Inside the thread function I was calling the nvtxXXX functions that NVIDIA provides. I was using it as defined in the documentation and here:
It seems that what caused the issue was that the parameter that is being passed to the nvtxRangePushEx function by reference, seems to have gone out of scope too soon.
I was not yet able to reproduce it with the code you’ve sent - will try later this week. However creating the nvtxEventAttributes_t variable that was passed to the nvtxRangePushEx on the stack and making sure it will not go out of scope prior to the pop operation + cudaStreamSynchronize(cudaStreamPerThread) calls has completed, cleared the lock down.
So I guess I’m now wondering about two things:
Could it be that if there’s a corruption of the nvtxEventAttribute_t parameter the cudaStreamSynchronize locks? why?
Why does all the profile function take the parameters (for example the nvtxEventAttribute_t to nvtxPushRangeEx) by reference and not by value?