Strange behavior

I’m seeing some strange behavior in my CUDA program. Looks like a bug in CUDA to me (but maybe I’m misunderstanding something?) I’ve attached a highly simplified program that exhibits the bug.

When compiled using ‘nvcc -c cuda_bug.cpp -o cuda_bug.o -gencode arch=compute_20,code=sm_20’ (incidentally: is this the correct command line to generate the optimal code for the 470?) and launched on my PC (Ubuntu 10.04 x64, CUDA 3.2 RC1 - latest from the web site, NVIDIA GTX 470), it produces the following output:

toy_cuda()
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
cudaGetLastError 4, aborting

The culprit seems to be line 9, where I read a value from an array in global memory. It almost looks like a memory access violation, except (1) the array is clearly big enough, and (2) uncommenting the if() in line 8 makes the crash go away - even though the printf in line 10 is never executed!
cuda_bug.cpp (1.27 KB)

I’m seeing some strange behavior in my CUDA program. Looks like a bug in CUDA to me (but maybe I’m misunderstanding something?) I’ve attached a highly simplified program that exhibits the bug.

When compiled using ‘nvcc -c cuda_bug.cpp -o cuda_bug.o -gencode arch=compute_20,code=sm_20’ (incidentally: is this the correct command line to generate the optimal code for the 470?) and launched on my PC (Ubuntu 10.04 x64, CUDA 3.2 RC1 - latest from the web site, NVIDIA GTX 470), it produces the following output:

toy_cuda()
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
cudaGetLastError 4, aborting

The culprit seems to be line 9, where I read a value from an array in global memory. It almost looks like a memory access violation, except (1) the array is clearly big enough, and (2) uncommenting the if() in line 8 makes the crash go away - even though the printf in line 10 is never executed!

The following code does not work

int j ;

        if (0 ==  blockIdx.x){ 

          j =  1 ; 

        }else{ 

          j =  0 ;

        }

        v = d_p[j] ;

but the following code works.

unsigned int j ;

        if (0 ==  blockIdx.x){ 

          j =  1 ; 

        }else{ 

          j =  0 ;

        }

        v = d_p[j] ;

I don’t know why but if we look at ptx codes, then they diff only tow line.

ptx of first

mov.u32         %r4, %ctaid.x;

        mov.u32         %r5, 0;

        set.eq.u32.u32  %r6, %r4, %r5;

        neg.s32         %r7, %r6;

        cvt.s64.s32     %rd6, %r7;

                ^^^^

        mul.wide.s32    %rd7, %r7, 8;

                 ^^^^

        add.u64         %rd8, %rd5, %rd7;

        ldu.global.u64  %rd9, [%rd8+0];

ptx of second

mov.u32         %r4, %ctaid.x;

        mov.u32         %r5, 0;

        set.eq.u32.u32  %r6, %r4, %r5;

        neg.s32         %r7, %r6;

        cvt.s64.u32     %rd6, %r7;

                ^^^^

        mul.wide.u32    %rd7, %r7, 8;

                 ^^^^

        add.u64         %rd8, %rd5, %rd7;

        ldu.global.u64  %rd9, [%rd8+0];

I cannot explain.

The following code does not work

int j ;

        if (0 ==  blockIdx.x){ 

          j =  1 ; 

        }else{ 

          j =  0 ;

        }

        v = d_p[j] ;

but the following code works.

unsigned int j ;

        if (0 ==  blockIdx.x){ 

          j =  1 ; 

        }else{ 

          j =  0 ;

        }

        v = d_p[j] ;

I don’t know why but if we look at ptx codes, then they diff only tow line.

ptx of first

mov.u32         %r4, %ctaid.x;

        mov.u32         %r5, 0;

        set.eq.u32.u32  %r6, %r4, %r5;

        neg.s32         %r7, %r6;

        cvt.s64.s32     %rd6, %r7;

                ^^^^

        mul.wide.s32    %rd7, %r7, 8;

                 ^^^^

        add.u64         %rd8, %rd5, %rd7;

        ldu.global.u64  %rd9, [%rd8+0];

ptx of second

mov.u32         %r4, %ctaid.x;

        mov.u32         %r5, 0;

        set.eq.u32.u32  %r6, %r4, %r5;

        neg.s32         %r7, %r6;

        cvt.s64.u32     %rd6, %r7;

                ^^^^

        mul.wide.u32    %rd7, %r7, 8;

                 ^^^^

        add.u64         %rd8, %rd5, %rd7;

        ldu.global.u64  %rd9, [%rd8+0];

I cannot explain.

Yeah, me neither.

Tried to debug this toy program with cuda-gdb, it outputs

toy_cuda()

Launching toy kernel with nBlocks=100, s=256

[Launch of CUDA Kernel 0 (do_toy_kernel) on Device 0]

[Termination of CUDA Kernel 0 (do_toy_kernel) on Device 0]

Launching toy kernel with nBlocks=100, s=256

[Launch of CUDA Kernel 1 (do_toy_kernel) on Device 0]

and then appears to hang. I’ve been waiting for 10 minutes since the last line was output, nothing is happening.

I’d try NSight, I even have the requisite hardware (two systems with NVIDIA cards), but it appears to be Windows only. If I can’t resolve this puzzle by Monday, I’ll try to rebuild in Windows.

Yeah, me neither.

Tried to debug this toy program with cuda-gdb, it outputs

toy_cuda()

Launching toy kernel with nBlocks=100, s=256

[Launch of CUDA Kernel 0 (do_toy_kernel) on Device 0]

[Termination of CUDA Kernel 0 (do_toy_kernel) on Device 0]

Launching toy kernel with nBlocks=100, s=256

[Launch of CUDA Kernel 1 (do_toy_kernel) on Device 0]

and then appears to hang. I’ve been waiting for 10 minutes since the last line was output, nothing is happening.

I’d try NSight, I even have the requisite hardware (two systems with NVIDIA cards), but it appears to be Windows only. If I can’t resolve this puzzle by Monday, I’ll try to rebuild in Windows.