Since a cudaBindTexture call is not specific to a stream, I am wondering if both bind and unbind calls have implicit device-wide synchronization. The Cuda C manual does not specify this in the implicit synchronization section when discussing streams, so I am a bit confused. For instance, it would not seem you could bind a texture, execute a kernel in one stream, then bind that same texture again (to a different piece of memory) and execute another kernel in another stream without at least some kind of stream synchronization (or device wide synchronization). There must be some documentation on this. Can someone point me in the right direction? Thank you in advance for your help.
For anyone might be interested in the answer, I ran a characterization test and found the following.
For the case of two kernels in the same stream:
bind a texture->call a kernel in one stream that uses the texture->unbind the texture->call a kernel in the same stream that uses the same texture.
Both kernels correctly access the texture memory–the unbind call appears to do nothing (as may be indicated by the fact that the API defines the only return value of cudaUnbindTexture as cudaSuccess).
For the case of two kernels in separate streams (that have been verified to be running concurrently):
bind a texture->call a kernel in one stream that uses the texture->bind the texture to a separate location in global memory->call a kernel in a different stream that uses the same texture (as the first)
Both kernels correctly access the different areas of global memory that the single texture was bound (surprising result to me).
I will mention again that, yes, I am sure the two kernels were operating concurrently when this test was executed.
Take-away: regardless of streams, wherever a texture was bound to–before launching a kernel that uses it–the kernel will correctly access the global memory (that the texture was bound to), regardless of what a user might do to the texture when the kernel is running. Also, the cudaUnbindTexture call appears to do absolutely nothing with regards to a previously bound texture (at least from a kernel point-of-view).
We’re currently facing this issue of using one common texture references on multiple streams, with different global memory bound to the texture reference on each stream. Your posting seems to confirm that this doesn’t cause a code correctness problem apparently.
However this stackoverflow question indicates that there are some kinds of forced synchronization issues with respect to asynchronous memcpy operations on binding and unbinding textures, which may impede performance: