Processing image with a CUDA kernel gives me different result than a seemingly equivalent CPU function

xlucny · February 12, 2024, 9:49pm

Yes, I’m quite interested in comparing them. I’ll post it once I’ll have access to the Jetson again. (later this week).

xlucny · February 18, 2024, 4:20pm

Directly putting 0xA0 and right shifting made no CPU - GPU difference, no matter the cast,

Assigning the result of binary addition into uint32_t (or int32_t checked for both) gave 160, no matter the cast.

xlucny · February 18, 2024, 4:23pm

I found no difference in the ptx/saas codes between them and they seem to be the same as the one from @njuffa .
original_saas.txt (11.6 KB)
uint_16_saas.txt (11.7 KB)
original_ptx.txt (1.7 KB)
uint_16_ptx.txt (1.7 KB)

xlucny · February 18, 2024, 4:27pm

I’m linking this thread over to the Jetson Orin NX part of the forum, here is the code for anyone who’s willing to replicate the error.

code_feb_18_B.zip (2.6 MB)

Output I’m getting:

BLOCKS: 544 THREADS 256 RUNS TOTAL: 139264 SHOULD BE 139264
Diff at: 7378 of 80
Diff at: 7379 of 240
Diff at: 7382 of 48
Difference between buffers: 368

xlucny · April 11, 2024, 7:49am

The issue has been further investigated in:
CUDA kernel behaving strangely for no apparent reason, not replicatable on a typical x86_x64 system - Jetson & Embedded Systems / Jetson Orin NX - NVIDIA Developer Forums

The -G flag in Makefile caused the issue

Curefab · April 12, 2024, 6:37pm

Perhaps one should say, the bug only appears with the -G flag activated. The original cause of the bug may be different.

system · April 26, 2024, 6:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Robert_Crovella · May 8, 2024, 3:06pm

There was an original cause of the bug. It should be fixed in CUDA 11.6 or later. That is, compiling with CUDA 11.6, whether you use -G or not, you should not observe this issue in the provided test case above from OP.

Topic		Replies	Views
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	938	October 16, 2023
CUDA kernel behaving strangely for no apparent reason, not replicatable on a typical x86_x64 system Jetson Orin NX cuda	12	605	April 25, 2024
Bug with integer division? CUDA Programming and Performance	33	9317	September 9, 2015
CUDA kernel is slow with function pointers CUDA Programming and Performance cuda	12	2321	October 12, 2021
CUDA 7.5 on Maxwell 980Ti drops performance by 10x versus CUDA 7.0, and 6.5 CUDA Programming and Performance	46	6939	October 11, 2016
CUDA runtime on Jetson Orin AGX Jetson AGX Orin cuda	46	5711	September 1, 2023
emu vs debug, different values CUDA Programming and Performance	48	15721	February 5, 2009
Zero Copy Memory vs Unified memory CUDA processing Jetson TX1	28	19996	October 18, 2021
Unexplained stalls in CUDA API calls - reproducer attached Jetson TK1	27	2923	October 18, 2021
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204313	April 13, 2009

Processing image with a CUDA kernel gives me different result than a seemingly equivalent CPU function

Related topics