Precise definition of cudaBoundaryModeTrap

Hi there,

I faced a problem with cudaBoundaryModeTrap and surface-writes when changing from toolkit edition 4.2 to 5.0:

I run a programming that always writes the same data to a surface. This has worked perfectly before. But since the update my program crashes in standard vs10 debug mode with an undefined error in the kernel that writes to the surface.

Usually I would understand - as it is described in the programming guide - that the kernel run will lead to an undefined error if the coordinates at whom I access the surface are out of bounds and I chose cudaBoundaryModeTrap (as set by default).

Conform to that my kernel run succeeds when I change the boudary mode to *Zero or *Clamp. (I’m still running standard vs10 debugger)

Further I could now assume that if I add an if condition that is entered when the coordinates are out of bounds the NSight debugger has to step into that scope if I start the program with NSight instead of the standard vs10 debugger.

To say this in code:

if (nX < 0 || nX >= sizeVolume[0] ||
    nY < 0 || nY >= sizeVolume[1] ||
    nZ < 0 || nZ >= sizeVolume[2])
  surf3Dwrite(make_uchar2(127,255),   // put a breakpoint for nsight here
              sizeof(uchar2)*1, 1, 1, // choose a coordinate that MUST be correct
              sizeof(uchar2)*nX, nY, nZ,

But NSight does not enter the breakpoint as expected. Not even once.

Could a surface write also fail with cudaBoundaryModeTrap if the pixel or alpha value exceeds [0,255]? (I assume no, cause I already checked that with another if-clause)

I really am confused now because my surf3Dwrite calls all work properly when simply starting with NSight-Debugger while the kernel fails with an “undefined error” when using the standard vs10 debugger.

Ok, I just found out that NSight used the old 4.2 sdk. I updated to version 3.0 for NSight and now I get equal behavior with both nsight and vs10 debugger.

What I observed in debugging mode with NSight is that all my constant memory variables contain invalid / random data. Have there been applied changes to constant memory initialization process during step from 4.2 to 5.0?

I use the CUDA-Driver-API and use constant memory vars like that:

CUdeviceptr c_pSizeVolume;
cuModuleGetGlobal(&c_pSizeVolume, NULL, m_CudaModule, "c_pSizeVolume"); // no allocation needed, yes?

cuMemcpyHtoD(c_pSizeVolume, &pSizeVolume[0], sizeof(unsigned int) * 3);

Inside my .cu file the constant variable is declared like that:

__constant__ unsigned int c_pSizeVolume[3];

Can you please change the title of this thread? I think it is not that precise anymore…