I just have a question about the difference between reading and writing to global memory. The situation I have is as follows. I am implementing a kernel where I have one of two options. The first option is to make threads in a block diverge by making few of them read more data from global memory than others. The second option is to replicate parts of the data and make the threads diverge again but this time by making some of them write more data to global memory than others (to update replicated copies of data.) In both cases the extra reads or writes do not require synchronization afterwards. Also, in the first option, the divergence will occur at the beginning of the thread, but, in the second option the divergence will occur at the end of the thread. The choice between the two options depends on whether the latency of read and write are different or not and which is slower, and on whether divergence at the beginning is different than divergence at the end or not.
This is probably something one could write a tiny test kernel to evaluate pretty quickly, if nobody chimes in with a hard answer, I’d just write a test code and find out by experimentation. I’ve found writing many small test kernels to be very helpful in my own CUDA work thus far.
Hm, my experience is that finding latency reasons/issues is very tricky as it is quite hard to isolate the latency from being hidden or augmented by other effects. If someone from NVIDIA could just shed a little light on whether kernels wait for write instructions to finish before finishing themselves, would be very handy. If you have evidence however, I would be very interested in learning about it. Thanks.