I’m trying to debug an “unspecified launch failure” I encounter on two different machines with a Tesla C2050 each, using both Cuda 4.0 and driver 270.41.19
However the failure doesn’t occur when the code is compiled with -G ; thus it becomes harder to find out where the error is coming from exactly using cuda-gdb. Any hints ?
Moreover the exact same binary run smoothly on an older computer with a Tesla C1060 (Cuda 4.0 and driver 270.41.19).
Using cuda-memcheck I get a invalide read :
========= Invalid global read of size 8
========= at 0x00000800 in p4a_wrapper_main
========= by thread (0,0,0) in block (2,0,0)
========= Address 0x320a80800 is out of bounds
========= ERROR SUMMARY: 1 error
Ok then I tried to debug my kernel using device printf, and launching the kernel with only 1 thread (so that I avoid any memory race issue) :
printf(“cxph : %p - Cxm : %d - (cxph+Cxm) : %p - &cxph[Cxm] : %p\n”,cxph,Cxm,(cxph+Cxm), &cxph[Cxm]);
Which outputs :
cxph : 0x220c80a00 - Cxm : 256 - (cxph+Cxm) : 0x320c81200 - &cxph[Cxm] : 0x320c81200
Hum… seems that Fermi has hard time doing simple pointer arithmetic ?
Do I miss something ? I might have to have a look to the PTX, but since it works (without printf) on a C1060 the issue is maybe in the driver jitter…
Any hints on this issue would be welcome :-)