Suppose that one kernel (kernel1) writes an array on the global memory and the other kernel (kernel2) reads the same array using tex1Dfetch() function.
(Suppose the two kernels are in a loop body and executed multiple times, as shown below:
If the underlying GPU supports “concurrent kernel exection” (ex: Tesla C2050/C2070), can these two kernels (kernel1 and kernel2) be executed concurrently?
If so, is the texture cache coherent with respect to the global memory writes?
CUDA Manual (NVIDIA CUDA C Programming Guide Version 4.0 Chapter 3.2.10.4) says that texture cache is coherent w.r.t writes by “the previous kernel calls”, but not mentioning about writes by “concurrent kernels”.
“A thread can safely read some texture or surface memory location only if this memory location has been updated by a pervious kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.”
I think you can write a small sample code to simulate this and share the results here. I don’t expect correct results with concurrent kernels reading and writing to memory bound to textures…
I posted this question since it is not easy to verify the behavior. The one shown below is the trouble-making program; it runs correctly on GPUs such as ION and Quadro 5600, but gives wrong results on Tesla C2050/C2070 GPUs. One interesting thing is that if I disable at least one of texture accesses in main_kernel0() function, it returns correct output on Tesla GPU, or if I compile it with –G option, it also returns correct output.
I can’t find any apparent error in the source code, and thus I guess that it may be a bug related to Tesla C2050 GPU executing compatibility 2.0 codes. One unique feature of Tesla C2050 GPU is its capability of concurrent kernel execution. Of course, I tried to disable this behavior using cudaThreadSynchronize(), but I’m not sure whether it really blocks concurrent kernel execution or not.
(1) The texture cache is guaranteed to be coherent with respect to writes by a previous kernel in the same stream.
(2) CUDA operations not assigned to a specific stream by the programmer are assigned to the null stream by default, thus all CUDA operations are part of a particular stream
(3) For two kernels to execute concurrently, they must be in different streams.
(4) If there is a data dependency between kernels in different streams (regardless of whether this involves textures), explicit inter-stream synchronization (e.g. cudaStreamWaitEvent) must be used, otherwise a race condition exists.
Thank you for this clarification; then in the above program, at least concurrent execution is not the reason for the incorrect output, since two kernels are assigned to the same, default stream.