Issue with Array Index Swapping at the End of CUDA Kernel Functions

Hello CUDA experts,
I’m encountering a strange numerical instability in my CUDA code that I hope you can help me understand.
I have two kernel functions that are called within a large iteration loop. Both kernels perform similar operations but swap different array indices at the very end of each kernel function. Specifically:


• The first kernel swaps with at the end of its executiondate_end(ix, iy, 2)date_end(ix, iy, 3)
• The second kernel swaps with at the end of its executiondate_start(ix, iy, 2)date_start(ix, iy, 3)
The problem is that when these two kernels are used together within my main iteration loop, numerical errors start appearing in the second iteration, and after about 10,000 iterations, the values become NaN.
Interestingly, when I extract just these swap operations into a separate standalone kernel function, the calculation remains stable and correct.
Could anyone explain why combining these seemingly simple swap operations at the end of separate kernels within an iteration loop would cause numerical instability? Are there any known issues with memory access patterns or race conditions that might explain this behavior?
Any insights would be greatly appreciated. I’d be happy to provide more code details if needed.
Thank you in advance for your time and expertise!

From the description, it indicates that there is some type of dependency between the two sections where some threads are moving to the second before the data they need has been assigned in the first section. However presumably the thread to index mapping (IX, IY) is the same between the two so I’m not sure.

Can you please provide a minimal reproducing example, or at least show the full kernel? There might be something else going on that’s not shown in the code snip-it.

-Mat

My computer is in the laboratory. In the code, before the first kernel function, we use date_in to calculate values such as FM11, FM22, etc., and these are all global variables. The second kernel function uses date_out to calculate FM11 and so on. These two kernel functions perform calculations in a loop.

I’m wondering if this issue could be related to cache coherence. If not, could you please explain the relevant reasons?

Possibly, but more likely it’s due to timing. CUDA blocks run independently from each other and have different lifetimes. Some blocks may not even start executing before others are retired. Hence if you are relying on the values of an array needing to filled by all blocks prior to being used, then you could see this issue.

I have no idea if this is case here since you don’t show enough code. For the code that is shown, it’s only using a single element in each section, so unlikely to be the issue. Though if you have other arrays, you note “date_in” and “date_out”, that are accessing
elements that rely on those elements be assigned by threads in other blocks, then this could be the problem.

Threads within blocks can be synchronized to prevent some threads continuing execution before other threads have completed computing an earlier section. However synchronization between blocks is typically only done across kernel calls, the exception being if you are using cooperative groups.

this is the beginning and end of the first kernel function involving the global memory portion, with the middle part being scalar register computationsSwapping the positions of these two 3D arrays causes an error when reading date_start(ix, iy, :) after executing these two kernel functions..


What I’m looking for are cases where the code assigns a value to an array element but also reads from different elements in the same array.

Here the “FM” arrays assign values into (ix,IY), but then reads from (ix-1,IY). This is a backwards dependency, meaning the previous value of FM needs to be set before it can be read here. However the order in which the blocks are executed is non-deterministic so you can have cases where the previous value being read has yet to be set by another block leading to incorrect results.

Note from the original post, “date_start” appears to be getting set in a different routine, “DXEE_y”. You do read from “IX+1” in this routine, “DXEE_x”, but since presumably the elements are assigned in DXEE_y and only read in DXEE_x, then it’s fine.

Now since I only can see a very small portion of the code, I can’t be certain this is the issue or if there are other issues as well. Also in the future, if you can copy the code into the post as text, rather than an image, and then put it a code block (i.e. highlight the code and click on the “pre-formatted text” </> icon) that would be appreciated.

-Mat