Like if I’m using for_each or transform or sort to do some device-side stuff, I don’t need to call cudaDeviceSynchronize, do I?
I only ask because I was printing some code out today and it seemed like it printed in a different order if I sync’d or not.
Some thrust calls are asynchronous, some are synchronous (i.e. blocking)
In general the behavior is derived from CUDA, so notions of streams and default stream behavior should be instructive here.
It’s not possible for me to give a comprehensive answer. In general thrust is designed to work in a straightforward fashion for straightforward use cases. For example, in general, many/most CUDA programs don’t require usage of cudaDeviceSynchronize() at all, ever. But you can certainly defeat the behavior - just like you can with ordinary CUDA if you work hard enough at it.
Thrust is open-source, so with some effort, or with a profiler, you can quickly answer for yourself what the underlying CUDA API call sequence is for a thrust sequence, to determine whether it should be blocking or non-blocking.
Oh God, this is what I was afraid… Okay, you’re right, I’ll probably have to dig into the source a bit to confirm.
At the same time, I suppose it would also help if I did actually think about potential sync points.