In our project we utilize cooperative groups, with a tiled_partition<32> (equal to a warp).
A simple “__shfl_sync” instruction causes an “illegal instruction error”, when we moved from cuda toolkit version 11.4.4 to 12.5.0.
More specifically this code segment is what causes the problem:
const int tidl = g.thread_rank(); // Group lane.
int flag = 0xaaaaaaaa; // 10101010101010101010101010101010
row_prev = __shfl_sync(flag, row, tidl-1);
This instruction stores in “row_prev” the “row” value of the previous thread (only for 16 of the 32 threads of the warp)
Does anyone have any idea why this happens? After inspecting the 12.5 release notes, as well as the CUDA Programming guide of versions 11 and 12, no change has been made regarding the shfl_X_sync instructions…
(The same error occurs when using all other shfl_X_sync instructions)
Thank you.
Replacing the more generic __shfl_sync with the more “specialized” for cooperative groups call of g.shfl fixed the problem.
What is curious is why would this cause an illegal instruction, while in another function a __shfl_xor_sync (with 0xffffffff mask) was called, without creating any problem. Anyway. I replaced the __shfl_xor_sync with a g.shfl_xor too.
Thank you again for your assistance.
I have not looked at the specific code involved here, but the usual scenario is that some particular code “happened to work” (as opposed to “worked by design”).
Erroneous code that invokes undefined behavior may under some circumstances happen to work as intended by the programmer, but as the behavior is in fact undefined, it can change at any time and in particular when switching from one version of a compiler to the next. In other words, in such instances changing the compiler version exposes a latent bug in the code.