The attached code has a single kernel that contains three nested loops. The loops enclose and update in memory. The address is computed based on the loops indices and a set of other variables. To optimize the kernels I replaced the direct computation of the memory address by a set of simpler math surrounding the iterations. The two computations should give similar values for the final address to be adapted. The two numbers are identical in emulation mode but they are no in real mode.
bug_report.tar.gz (128 KB)