CUDA kernel behaving strangely for no apparent reason, not replicatable on a typical x86_x64 system

xlucny · February 18, 2024, 4:32pm

Hello,

I’m experiencing issues with CUDA kernels that seem to have no reason and make no sense to me. I originally made this post over on the CUDA programming part of this forum, however people were unable to replicate the issue (using typical x86_x64 systems). Would anyone from Nvidia please try to replicate it on their ORIN NX?

System:
Jetson Orin NX (16GB ram), Jetpack 5.1.2, L4T 35.4.1, CUDA 11.4.315, code compiled as c++17

AastaLLL · February 20, 2024, 3:24am

Hi,

We try to reproduce this issue (full_code.zip) in our environment but meet some errors about OpenCV.

Do you need a custom OpenCV with CUDA enabled?
If yes, could you also share the installation steps or script with us?

Thanks.

xlucny · February 20, 2024, 11:15am

Hi, thank you for the response.

The original script used opencv compiled with cuda and opengl , however it’s not needed to replicate the issue.

code_feb_18_B.zip (2.6 MB)
This is a scaled down version (that does not use opencv). The output it gives on Orin NX is:

BLOCKS: 544 THREADS 256 RUNS TOTAL: 139264 SHOULD BE 139264
Diff at: 7378 of 80
Diff at: 7379 of 240
Diff at: 7382 of 48
Difference between buffers: 368

However when I run it on my ~~x86~~ x86_x64 pc (after adjusting the architectures in makefile from Orin’s sm_87 to my pc’s sm_86) the output is:

BLOCKS: 544 THREADS 256 RUNS TOTAL: 139264 SHOULD BE 139264
Difference between buffers: 0

Curefab · February 20, 2024, 1:47pm

You probably mean x64 PC with Cuda 8.6? x86 typically means 32 bits.

xlucny · February 20, 2024, 1:58pm

Yes, it’s x86_x64 with cuda 12, with RTX 3060 which corresponds to sm_86.

AastaLLL · February 21, 2024, 8:42am

Hi,

Thanks for sharing.

Confirmed that we can reproduce similar behavior in our environment.
Deep dive for the error cases:

out = ( ( src1 & 0b11110000 ) >> 4 ) | ( ( src2 & 0b00001111 ) << 4 )

cpu = ( ( 170 & 0b11110000 ) >> 4 ) | ( ( 170 & 0b00001111 ) << 4 )
    = ( 160 >> 4 ) | ( 10 << 4 )
    = 10 | 160 = 170

gpu = ( ( 170 & 0b11110000 ) >> 4 ) | ( ( 170 & 0b00001111 ) << 4 )
    = ( 160 >> 4 ) | ( 10 << 4 )
    = 250 | 160 = 250

Not sure why 160 >> 4 returns 250 on Jetson.
We are checking with our internal team. Will update more with you later.

Thanks.

AastaLLL · February 22, 2024, 2:03am

Hi,

Based on the issue, a temporal workaround is to reorder the shift and the AND operator to skip the problem.

For example:

diff --git a/kernel.cu b/kernel.cu
index 4213d23..e498c8b 100644
--- a/kernel.cu
+++ b/kernel.cu
@@ -12,8 +12,8 @@ void depacking_kernel(uint8_t *src, uint8_t *dst)

   dst[dst_it]=src[src_it];
   dst[dst_it+1]=src[src_it+1] & 0b00001111;
-  dst[dst_it+2]=((src[src_it+1] & 0b11110000) >> 4) | ((src[src_it+2] & 0b00001111) << 4);
-  dst[dst_it+3]=((src[src_it+2] & 0b11110000) >> 4);
+  dst[dst_it+2]=((src[src_it+1]>>4) & 0b00001111) | ((src[src_it+2] & 0b00001111) << 4);
+  dst[dst_it+3]=((src[src_it+2]>>4) & 0b00001111);

 }

Hope this can unblock your task first.
Thanks.

xlucny · February 22, 2024, 10:34am

Thank you for the workaround. I’ve temporarily shifted development to an x86_x64 system, given that the cause of this is currently unknown. Therefore I’m not sure whether this is an isolated case only affecting byte shifts, or pops up again in other seemingly unrelated scenarios.

I’m looking forward to further updates from you guys.

AastaLLL · February 26, 2024, 7:15am

Thanks for the update.

Our compiler team is working on this.
Will let you know if we get a further update.

AastaLLL · March 21, 2024, 2:50am

Hi,

Thanks a lot for your patience. Here are some updates.

The difference is related to the -G flag.
Removing the flag, we have confirmed that the sample can work as expected.

Edit Makefile

NVCXXFLAGS= -g -pg -arch=compute_87 -code=sm_87

Test result:

$ make
g++ -g -pg -Wall -std=c++17 -c  kernel_test.cpp  -I /usr/include/ -I /usr/local/include  -I /usr/local/cuda/include
nvcc -g -pg -arch=compute_87 -code=sm_87 -std=c++17 -x cu -c kernel.cu -I /usr/include/ -I /usr/local/include  -I /usr/local/cuda/include
g++ -std=c++17 kernel_test.o kernel.o  -o kernel_test   -lcudart   -L/usr/lib -L/usr/local/lib -L/usr/local/cuda/lib64

$ ./kernel_test
BLOCKS: 544 THREADS 256 RUNS TOTAL: 139264 SHOULD BE 139264
Difference between buffers: 0

Our internal team is still working on the case with the -G flag.
But is there any particular reason to add this flag?

Thanks.

xlucny · March 30, 2024, 8:42pm

Wow, thank you for the update! I never thought it would be something like this.

The way I understand it the -G flag allows for debugging the gpu code, right? Or is it possible to debug it without it?

By now nearly all of my Gpu code is already done, so it’s wonderful news either way.

AastaLLL · April 11, 2024, 8:06am

Hi,

Yes, -G is a kernel debug build.
The error with -G is fixed after CUDA 11. 6. So you won’t get the issue with JetPack 6.
Thanks.

system · April 25, 2024, 8:06am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Processing image with a CUDA kernel gives me different result than a seemingly equivalent CPU function CUDA Programming and Performance opencv , cuda	27	1478	May 8, 2024
Clock64() return value incorrect when debugged, part 2 Jetson Orin NX	10	413	December 11, 2023
CUDA 10.2 - ptxas bitwise OR miscompilation CUDA NVCC Compiler	0	28	January 28, 2025
CUDA is not installed on Jetson Orin Jetson AGX Orin cuda	8	19116	August 10, 2022
Jetson ORIN is not detecting my cuda instsallation Jetson AGX Orin cuda	5	2220	August 15, 2022
CUDA disabled on Orin after restart Jetson AGX Orin opencv , cuda , yolo	6	904	May 17, 2023
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	1042	October 16, 2023
torch.cuda.is_available()=FALSE and [INFO]: Driver not installed... but it IS installed!? Jetson Orin NX cuda , cudnn	6	325	November 7, 2024
CUDA 12 : Insufficient driver version on AGX Orin Jetson AGX Orin cuda , nvbugs	13	3867	March 23, 2023
Failing compilation with clang and std17 on Tegra CUDA Programming and Performance jetson	3	110	December 3, 2024

CUDA kernel behaving strangely for no apparent reason, not replicatable on a typical x86_x64 system

Related topics