CUDA kernel behaving strangely for no apparent reason, not replicatable on a typical x86_x64 system

Hello,

I’m experiencing issues with CUDA kernels that seem to have no reason and make no sense to me. I originally made this post over on the CUDA programming part of this forum, however people were unable to replicate the issue (using typical x86_x64 systems). Would anyone from Nvidia please try to replicate it on their ORIN NX?

System:
Jetson Orin NX (16GB ram), Jetpack 5.1.2, L4T 35.4.1, CUDA 11.4.315, code compiled as c++17

Hi,

We try to reproduce this issue (full_code.zip) in our environment but meet some errors about OpenCV.

Do you need a custom OpenCV with CUDA enabled?
If yes, could you also share the installation steps or script with us?

Thanks.

Hi, thank you for the response.

The original script used opencv compiled with cuda and opengl , however it’s not needed to replicate the issue.

code_feb_18_B.zip (2.6 MB)
This is a scaled down version (that does not use opencv). The output it gives on Orin NX is:

BLOCKS: 544 THREADS 256 RUNS TOTAL: 139264 SHOULD BE 139264
Diff at: 7378 of 80
Diff at: 7379 of 240
Diff at: 7382 of 48
Difference between buffers: 368

However when I run it on my x86 x86_x64 pc (after adjusting the architectures in makefile from Orin’s sm_87 to my pc’s sm_86) the output is:

BLOCKS: 544 THREADS 256 RUNS TOTAL: 139264 SHOULD BE 139264
Difference between buffers: 0

You probably mean x64 PC with Cuda 8.6? x86 typically means 32 bits.

Yes, it’s x86_x64 with cuda 12, with RTX 3060 which corresponds to sm_86.

Hi,

Thanks for sharing.

Confirmed that we can reproduce similar behavior in our environment.
Deep dive for the error cases:

out = ( ( src1 & 0b11110000 ) >> 4 ) | ( ( src2 & 0b00001111 ) << 4 )

cpu = ( ( 170 & 0b11110000 ) >> 4 ) | ( ( 170 & 0b00001111 ) << 4 )
    = ( 160 >> 4 ) | ( 10 << 4 )
    = 10 | 160 = 170
gpu = ( ( 170 & 0b11110000 ) >> 4 ) | ( ( 170 & 0b00001111 ) << 4 )
    = ( 160 >> 4 ) | ( 10 << 4 )
    = 250 | 160 = 250

Not sure why 160 >> 4 returns 250 on Jetson.
We are checking with our internal team. Will update more with you later.

Thanks.

1 Like

Hi,

Based on the issue, a temporal workaround is to reorder the shift and the AND operator to skip the problem.

For example:

diff --git a/kernel.cu b/kernel.cu
index 4213d23..e498c8b 100644
--- a/kernel.cu
+++ b/kernel.cu
@@ -12,8 +12,8 @@ void depacking_kernel(uint8_t *src, uint8_t *dst)

   dst[dst_it]=src[src_it];
   dst[dst_it+1]=src[src_it+1] & 0b00001111;
-  dst[dst_it+2]=((src[src_it+1] & 0b11110000) >> 4) | ((src[src_it+2] & 0b00001111) << 4);
-  dst[dst_it+3]=((src[src_it+2] & 0b11110000) >> 4);
+  dst[dst_it+2]=((src[src_it+1]>>4) & 0b00001111) | ((src[src_it+2] & 0b00001111) << 4);
+  dst[dst_it+3]=((src[src_it+2]>>4) & 0b00001111);

 }

Hope this can unblock your task first.
Thanks.

Thank you for the workaround. I’ve temporarily shifted development to an x86_x64 system, given that the cause of this is currently unknown. Therefore I’m not sure whether this is an isolated case only affecting byte shifts, or pops up again in other seemingly unrelated scenarios.

I’m looking forward to further updates from you guys.

Thanks for the update.

Our compiler team is working on this.
Will let you know if we get a further update.

1 Like

Hi,

Thanks a lot for your patience. Here are some updates.

The difference is related to the -G flag.
Removing the flag, we have confirmed that the sample can work as expected.

Edit Makefile

NVCXXFLAGS= -g -pg -arch=compute_87 -code=sm_87

Test result:

$ make
g++ -g -pg -Wall -std=c++17 -c  kernel_test.cpp  -I /usr/include/ -I /usr/local/include  -I /usr/local/cuda/include
nvcc -g -pg -arch=compute_87 -code=sm_87 -std=c++17 -x cu -c kernel.cu -I /usr/include/ -I /usr/local/include  -I /usr/local/cuda/include
g++ -std=c++17 kernel_test.o kernel.o  -o kernel_test   -lcudart   -L/usr/lib -L/usr/local/lib -L/usr/local/cuda/lib64
$ ./kernel_test
BLOCKS: 544 THREADS 256 RUNS TOTAL: 139264 SHOULD BE 139264
Difference between buffers: 0

Our internal team is still working on the case with the -G flag.
But is there any particular reason to add this flag?

Thanks.

1 Like

Wow, thank you for the update! I never thought it would be something like this.

The way I understand it the -G flag allows for debugging the gpu code, right? Or is it possible to debug it without it?

By now nearly all of my Gpu code is already done, so it’s wonderful news either way.

Hi,

Yes, -G is a kernel debug build.
The error with -G is fixed after CUDA 11. 6. So you won’t get the issue with JetPack 6.
Thanks.