I found some bug, probably compiler’s. First, the kernel works only(!) with gpu debug info option included. Anything else does not matter. And it works right. With out enabling gpu debug info kernel returns cuda_unknown_error. And does not write any results. I will try to get isolated case, but I am not sure it could be possible.
The kernell has cycle
int index=a[…]; // a device array
for (x=x1; x<x2; x++)
for (y=y1; y<y2; y++)
for (z=z1; z<z2; z++)
{
b[index]=xstride2+ystride+z; // b device array
…
index++;
}
the problem seems is in this cycle
cause the fix is to make some false operation with index at the start of a cycle
for (x=x1; x<x2; x++)
for (y=y1; y<y2; y++)
for (z=z1; z<z2; z++)
{
index=index+(z>>20); //z>>20 =0, but compiler does not know it
b[index]=xstride2+ystride+z; // b device array
…
index++;
}
and this run correctly with out debug gpu info included/
So, I think it is either compiler or ptx generator bug.
I could not try it on other cuda versions now. I use win 7, drivers 260.99 and msvs8.0
I will work later on short test case, there are not suspicious operation. With the fix program run long time with out errors, while with out fix kernell just does not write any results end returns unknown_error.
I am pretty sure it is compiler or ptx generator bug, cause no shared variables there, and would it be bug, kernell would work wrongly, but it returns cuda_unknown_error and does not write anything to b, or write zeroes.
I found compiler bug in other kernell, huge one with a lot of control flow. Compiler just died with external exception reading on small address. I rearranged branches and could complete compilation. However, if program goes to some branch, kernell return cudaUnknownError. I suspect it may be ptx to machine code bug in driver or run time. I use geforce 465 and driver 260.99. Now I uninstall cuda 3.2 and install cuda 3.0, cause I need emulation mode to check algorithm of a program. Btw, if I generate gpu debug info, program works much differently, it does not return cudaUnknownError, but just hand driver. And compiler somehow reports different register usage. It is all pointless.