I’m trying to debug an “unspecified launch failure” I encounter on two different machines with a Tesla C2050 each, using both Cuda 4.0 and driver 270.41.19
However the failure doesn’t occur when the code is compiled with -G ; thus it becomes harder to find out where the error is coming from exactly using cuda-gdb. Any hints ?
Moreover the exact same binary run smoothly on an older computer with a Tesla C1060 (Cuda 4.0 and driver 270.41.19).
Using cuda-memcheck I get a invalide read :
========= Invalid global read of size 8
========= at 0x00000800 in p4a_wrapper_main
========= by thread (0,0,0) in block (2,0,0)
========= Address 0x320a80800 is out of bounds
========= ERROR SUMMARY: 1 error
Ok then I tried to debug my kernel using device printf, and launching the kernel with only 1 thread (so that I avoid any memory race issue) :
What you are seeing is almost certainly due to out of bounds shared memory access. Fermi has much better hardware protection for shared memory than older designs did (after all, out of bounds shared memory access potentially means L2 cache and global memory corruption in Fermi). The reason why it goes away in debugging builds is because all shared memory gets spilled to local memory by the compiler so that the debugger can have access to it during kernel execution.
As for the specifics of why, without a repro case it is impossible to say more.
I don’t see your point with shared memory and L2, since I thought shared memory is closer to the SM than L2 and shared memory access are not cached in L2.
Moreover my kernel doesn’t make use of shared memory, I can try to extract the kernel from the framework to provide a reproductible testcase, in the meantime here is the kernel (run on one thread).
L1 cache, sorry, that was a typo. Shared memory and L1 cache are physically the same thing in Fermi. There is 64Kb per SM which is either configured as 16kb shared memory and 48kb L2 cache, or 48kb shared memory and 16kb L2 cache.
Ok now I get it :-)
Anyway is the compiler able to use part of the shared memory automatically for some purpose ? Because as stated before, I don’t make any explicit use of the shared memory.
And it would explain the issue I reported using printf.
Please find attached the source code to reproduce the bug.
You can compile it with :
“nvcc bug_pointer.cu -o bug_pointer -arch=sm_20”
In this version, the bug occurs also with -G, and I get in cuda-gdb (with memcheck) : CUDA_EXCEPTION_1 :
“Lane Illegal Address†bug_pointer.cu (3.94 KB)
Oops, sorry, apparently I was still in 32 bit mode (ignoring the difference in the upper word…). That indeed looks like a bug. I’d guess you could work around it by compiling for 32 bit: [font=“Courier New”]nvcc -m32[/font], although it’s certainly something that should get fixed.
I can see the out of bounds error with -arch=sm_13, but not with -arch=sm_20.
If I compile with “-arch=sm_13 --ptxas-options=-v”, the compiler informs me that the kernel uses “152+16 bytes smem” (though it’s not clear to me why it would need to use shared memory here). With “-arch=sm_13 --ptxas-options=-v”, no shared memory is used.
It could be an compiler bug. Maybe, with too many registers in use, the compiler tries to spill some registers into shared memory(?), but it is unaware that, on Fermi, pointer size is 40 bits, not 32 bits.
Anyway it doesn’t help me a lot with my issue, I would probably fill a bug report, bug I’m unsure it’s related to nvcc since it works on a C1060, it might rather be related to the JIT in the driver ?
Ah, I see. It still looks like a compiler bug, though: any tweaking of the code that reduces the number of registers significantly below 60 makes the problem go away.
Here’s a reduced version of the same kernel that still shows the out-of-bounds read:
Here’s is the same code update with a second version of the same kernel with non linearized array accesses. It’s interesting to note that compiling with sm_20 I got 63 registers for each of the kernels, but with sm_13 the linearized version uses only 62 registers while the non linearized one uses 65 registers.
Anyway, the non linearized version run smoothly on both C1060 and C2050 (with 65 registers), while the linearized version breaks on C2050 only (compiled using sm_13 or sm_20). bug_pointer.cu (7.36 KB)