Is CUDA barrier-synchronization latencies are predictable? For example in an iteration algorithm,
If I call cudaDeviceSynchronize() or __threadfence() or __syncthreads()after every iteration, can I calculate it as “number of iterations” multiplied by single “barrier synchronize” time?
Is the “barrier synchronize” latencies are properly documented? Where can I find them?
How to write a good CUDA code to estimate the “barrier synchronize” latencies?
Only one way to find out. Write some test code and time it.
Last I measured it, the minimum time for cudaDeviceSynchronize() was about 15 usec (micro-seconds). Obviously, if the GPU has not finished its work yet, cudaDeviceSynchronize() will have to wait until it has, and that time could be arbitrarily long.
Each of those operations you mentioned (cudaDeviceSynchronize, __threadfence, and __syncthreads) do something different, so you will probably get a different answer for each of them.
This number will also probably be different for different GPUs, depending on microarchitecture.
Finally these numbers are closely related to memory access latencies of various components in the system, and there can be a very big difference in latency when the chip is idle and fully loaded.
So be specific about your use-case, and measure it.