Processing image with a CUDA kernel gives me different result than a seemingly equivalent CPU function

Yes, I’m quite interested in comparing them. I’ll post it once I’ll have access to the Jetson again. (later this week).

Directly putting 0xA0 and right shifting made no CPU - GPU difference, no matter the cast,

Assigning the result of binary addition into uint32_t (or int32_t checked for both) gave 160, no matter the cast.

I found no difference in the ptx/saas codes between them and they seem to be the same as the one from @njuffa .
original_saas.txt (11.6 KB)
uint_16_saas.txt (11.7 KB)
original_ptx.txt (1.7 KB)
uint_16_ptx.txt (1.7 KB)

I’m linking this thread over to the Jetson Orin NX part of the forum, here is the code for anyone who’s willing to replicate the error.

code_feb_18_B.zip (2.6 MB)

Output I’m getting:

BLOCKS: 544 THREADS 256 RUNS TOTAL: 139264 SHOULD BE 139264
Diff at: 7378 of 80
Diff at: 7379 of 240
Diff at: 7382 of 48
Difference between buffers: 368