reading from unmalloc'ed memory


I am having a wail of a time with the following piece of code (a banded matrix-vector multiplication). All device arrays are cudaMalloc’ed with size “n”, “m” is sqrt(n) due to the Finite element space and underlying mesh this is based on. The most simple code which does not coalesce reads via SMEM is as follows:

// simple, uncoalesced version

  __global__ void smv_row1(float* y, float* x,

                           float* dd,

                           float* ll, float* ld, float* lu,

                           float* dl, float* du,

                           float* ul, float* ud, float* uu,

                           int n, int m)


      int idx = blockDim.x*blockIdx.x+threadIdx.x;


      y[idx] = dd[idx]*x[idx];

      if (idx-m-1 >= 0)

        y[idx] += ll[idx]*x[idx-m-1];

      if (idx-m >= 0)

        y[idx] += ld[idx]*x[idx-m];

      if (idx-m+1 >= 0)

        y[idx] += lu[idx]*x[idx-m+1];

      if (idx-1 >= 0)

        y[idx] += dl[idx]*x[idx-1];

      if (idx+1<n)

        y[idx] += du[idx]*x[idx+1];

      if (idx+m-1 < n)

        y[idx] += ul[idx]*x[idx+m-1];

      if (idx+m < n)

        y[idx] += ud[idx]*x[idx+m];

      if (idx+m+1 < n)

        y[idx] += uu[idx]*x[idx+m+1];


      y[idx] = dd[idx]*x[idx] + (dl[idx]*x[idx-1] + du[idx]*x[idx+1] +

                                 ld[idx]*x[idx-m] + ll[idx]*x[idx-m-1] + lu[idx]*x[idx-m+1] +

                                 ud[idx]*x[idx+m] + ul[idx]*x[idx+m-1] + uu[idx]*x[idx+m+1]);



As you can see, I am performing “evil” reads in unmalloc’ed device memory in the #else version of the code (the x array). The matrix bands ll, ld, …, ud, uu are properly padded with zeros, and on the CPU and in GL, I can rely on “crap (from x) times zero (from the appropriate band) equals zero” and the code runs correctly.

It also does in CUDA on the T10P. In some simple test codes. If I do some funky stuff that my application mandates on some other cudaMalloc’ed arrays prior to calling this kernel, I get NaNs ONLY if “n” exceeds 1024*1024.

This all seems to be a bug to me, but I for the heck of it don’t see why my code would be responsible as it runs fine for “small n”. I tried valgrind in device emulation mode, but as the crash only occurs for large values of n, that froze my machine.

Here’s the code to launch the kernel, the important point is that I use one thread per value in the result vector y:

     dim3 grid;

    dim3 block;

    block.x = 128;

          grid.x = (int)ceil(N/(double)(block.x));

//          printf("configuration: %d * %d = %d\n", grid.x, block.x, grid.x*block.x);

          smv_row1<<<grid,block>>>(y, x, A->sDD_GPU, A->sLL_GPU, A->sLD_GPU, A->sLU_GPU, A->sDL_GPU, A->sDU_GPU, A->sUL_GPU, A->sUD_GPU, A->sUU_GPU, N, M);

(A->something is a cudaMalloc’ed array).

I know this post sounds weird, but I am simply at a loss here. If you need further information, please let me know. I even thought about redoing the whole thing in C instead of C++; and I thought about padding all my vectors with zeros, but both is a hell of a lot of work on the application side when performance is to be maintained (padding needs to me a multiple of the warp size etc)



I probably should say that I’m using OpenSuSE 10.2, but the default gcc in that distro is from the same SVN branch as RHEL5’s gcc, so I doubt the issue is with nvcc. After all, everything works well except in the application :(

Hey Dominik,

No, I can’t see what you mean. :) There’s not enough information here to know what the problem is. Is the problem a performance problem or a correctness problem?

If you can package up a repro and file the bug through your registered cuda developer account, this will get looked into much more quickly.




after some more digging into this, I can now confirm that the issue is not a CUDA or hardware bug. It’s also not necessarily an application bug, probably just “bad luck”.

In summary: If for some reason you have to perform “out of bounds” reads from a double precision array, make sure that the arrays allocated next to it are not single precision (and the other way round).

Sorry for the hassle, now to think of a way to improve performance…