Processing image with a CUDA kernel gives me different result than a seemingly equivalent CPU function

Yes, I’m quite interested in comparing them. I’ll post it once I’ll have access to the Jetson again. (later this week).

Directly putting 0xA0 and right shifting made no CPU - GPU difference, no matter the cast,

Assigning the result of binary addition into uint32_t (or int32_t checked for both) gave 160, no matter the cast.

I found no difference in the ptx/saas codes between them and they seem to be the same as the one from @njuffa .
original_saas.txt (11.6 KB)
uint_16_saas.txt (11.7 KB)
original_ptx.txt (1.7 KB)
uint_16_ptx.txt (1.7 KB)

I’m linking this thread over to the Jetson Orin NX part of the forum, here is the code for anyone who’s willing to replicate the error.

code_feb_18_B.zip (2.6 MB)

Output I’m getting:

BLOCKS: 544 THREADS 256 RUNS TOTAL: 139264 SHOULD BE 139264
Diff at: 7378 of 80
Diff at: 7379 of 240
Diff at: 7382 of 48
Difference between buffers: 368

The issue has been further investigated in:
CUDA kernel behaving strangely for no apparent reason, not replicatable on a typical x86_x64 system - Jetson & Embedded Systems / Jetson Orin NX - NVIDIA Developer Forums

The -G flag in Makefile caused the issue

1 Like

Perhaps one should say, the bug only appears with the -G flag activated. The original cause of the bug may be different.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

There was an original cause of the bug. It should be fixed in CUDA 11.6 or later. That is, compiling with CUDA 11.6, whether you use -G or not, you should not observe this issue in the provided test case above from OP.