I’m seeing some strange behavior in my CUDA program. Looks like a bug in CUDA to me (but maybe I’m misunderstanding something?) I’ve attached a highly simplified program that exhibits the bug.
When compiled using ‘nvcc -c cuda_bug.cpp -o cuda_bug.o -gencode arch=compute_20,code=sm_20’ (incidentally: is this the correct command line to generate the optimal code for the 470?) and launched on my PC (Ubuntu 10.04 x64, CUDA 3.2 RC1 - latest from the web site, NVIDIA GTX 470), it produces the following output:
toy_cuda()
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
cudaGetLastError 4, aborting
The culprit seems to be line 9, where I read a value from an array in global memory. It almost looks like a memory access violation, except (1) the array is clearly big enough, and (2) uncommenting the if() in line 8 makes the crash go away - even though the printf in line 10 is never executed!
cuda_bug.cpp (1.27 KB)
I’m seeing some strange behavior in my CUDA program. Looks like a bug in CUDA to me (but maybe I’m misunderstanding something?) I’ve attached a highly simplified program that exhibits the bug.
When compiled using ‘nvcc -c cuda_bug.cpp -o cuda_bug.o -gencode arch=compute_20,code=sm_20’ (incidentally: is this the correct command line to generate the optimal code for the 470?) and launched on my PC (Ubuntu 10.04 x64, CUDA 3.2 RC1 - latest from the web site, NVIDIA GTX 470), it produces the following output:
toy_cuda()
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
Launching toy kernel with nBlocks=100, s=256
cudaGetLastError 4, aborting
The culprit seems to be line 9, where I read a value from an array in global memory. It almost looks like a memory access violation, except (1) the array is clearly big enough, and (2) uncommenting the if() in line 8 makes the crash go away - even though the printf in line 10 is never executed!
The following code does not work
int j ;
if (0 == blockIdx.x){
j = 1 ;
}else{
j = 0 ;
}
v = d_p[j] ;
but the following code works.
unsigned int j ;
if (0 == blockIdx.x){
j = 1 ;
}else{
j = 0 ;
}
v = d_p[j] ;
I don’t know why but if we look at ptx codes, then they diff only tow line.
ptx of first
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, 0;
set.eq.u32.u32 %r6, %r4, %r5;
neg.s32 %r7, %r6;
cvt.s64.s32 %rd6, %r7;
^^^^
mul.wide.s32 %rd7, %r7, 8;
^^^^
add.u64 %rd8, %rd5, %rd7;
ldu.global.u64 %rd9, [%rd8+0];
ptx of second
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, 0;
set.eq.u32.u32 %r6, %r4, %r5;
neg.s32 %r7, %r6;
cvt.s64.u32 %rd6, %r7;
^^^^
mul.wide.u32 %rd7, %r7, 8;
^^^^
add.u64 %rd8, %rd5, %rd7;
ldu.global.u64 %rd9, [%rd8+0];
I cannot explain.
The following code does not work
int j ;
if (0 == blockIdx.x){
j = 1 ;
}else{
j = 0 ;
}
v = d_p[j] ;
but the following code works.
unsigned int j ;
if (0 == blockIdx.x){
j = 1 ;
}else{
j = 0 ;
}
v = d_p[j] ;
I don’t know why but if we look at ptx codes, then they diff only tow line.
ptx of first
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, 0;
set.eq.u32.u32 %r6, %r4, %r5;
neg.s32 %r7, %r6;
cvt.s64.s32 %rd6, %r7;
^^^^
mul.wide.s32 %rd7, %r7, 8;
^^^^
add.u64 %rd8, %rd5, %rd7;
ldu.global.u64 %rd9, [%rd8+0];
ptx of second
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, 0;
set.eq.u32.u32 %r6, %r4, %r5;
neg.s32 %r7, %r6;
cvt.s64.u32 %rd6, %r7;
^^^^
mul.wide.u32 %rd7, %r7, 8;
^^^^
add.u64 %rd8, %rd5, %rd7;
ldu.global.u64 %rd9, [%rd8+0];
I cannot explain.
Yeah, me neither.
Tried to debug this toy program with cuda-gdb, it outputs
toy_cuda()
Launching toy kernel with nBlocks=100, s=256
[Launch of CUDA Kernel 0 (do_toy_kernel) on Device 0]
[Termination of CUDA Kernel 0 (do_toy_kernel) on Device 0]
Launching toy kernel with nBlocks=100, s=256
[Launch of CUDA Kernel 1 (do_toy_kernel) on Device 0]
and then appears to hang. I’ve been waiting for 10 minutes since the last line was output, nothing is happening.
I’d try NSight, I even have the requisite hardware (two systems with NVIDIA cards), but it appears to be Windows only. If I can’t resolve this puzzle by Monday, I’ll try to rebuild in Windows.
Yeah, me neither.
Tried to debug this toy program with cuda-gdb, it outputs
toy_cuda()
Launching toy kernel with nBlocks=100, s=256
[Launch of CUDA Kernel 0 (do_toy_kernel) on Device 0]
[Termination of CUDA Kernel 0 (do_toy_kernel) on Device 0]
Launching toy kernel with nBlocks=100, s=256
[Launch of CUDA Kernel 1 (do_toy_kernel) on Device 0]
and then appears to hang. I’ve been waiting for 10 minutes since the last line was output, nothing is happening.
I’d try NSight, I even have the requisite hardware (two systems with NVIDIA cards), but it appears to be Windows only. If I can’t resolve this puzzle by Monday, I’ll try to rebuild in Windows.