Hi,
I am having a wail of a time with the following piece of code (a banded matrix-vector multiplication). All device arrays are cudaMalloc’ed with size “n”, “m” is sqrt(n) due to the Finite element space and underlying mesh this is based on. The most simple code which does not coalesce reads via SMEM is as follows:
// simple, uncoalesced version
__global__ void smv_row1(float* y, float* x,
float* dd,
float* ll, float* ld, float* lu,
float* dl, float* du,
float* ul, float* ud, float* uu,
int n, int m)
{
int idx = blockDim.x*blockIdx.x+threadIdx.x;
#ifdef MVQ1_GPU_SLOW_BUT_CORRECT
y[idx] = dd[idx]*x[idx];
if (idx-m-1 >= 0)
y[idx] += ll[idx]*x[idx-m-1];
if (idx-m >= 0)
y[idx] += ld[idx]*x[idx-m];
if (idx-m+1 >= 0)
y[idx] += lu[idx]*x[idx-m+1];
if (idx-1 >= 0)
y[idx] += dl[idx]*x[idx-1];
if (idx+1<n)
y[idx] += du[idx]*x[idx+1];
if (idx+m-1 < n)
y[idx] += ul[idx]*x[idx+m-1];
if (idx+m < n)
y[idx] += ud[idx]*x[idx+m];
if (idx+m+1 < n)
y[idx] += uu[idx]*x[idx+m+1];
#else
y[idx] = dd[idx]*x[idx] + (dl[idx]*x[idx-1] + du[idx]*x[idx+1] +
ld[idx]*x[idx-m] + ll[idx]*x[idx-m-1] + lu[idx]*x[idx-m+1] +
ud[idx]*x[idx+m] + ul[idx]*x[idx+m-1] + uu[idx]*x[idx+m+1]);
#endif
}
As you can see, I am performing “evil” reads in unmalloc’ed device memory in the #else version of the code (the x array). The matrix bands ll, ld, …, ud, uu are properly padded with zeros, and on the CPU and in GL, I can rely on “crap (from x) times zero (from the appropriate band) equals zero” and the code runs correctly.
It also does in CUDA on the T10P. In some simple test codes. If I do some funky stuff that my application mandates on some other cudaMalloc’ed arrays prior to calling this kernel, I get NaNs ONLY if “n” exceeds 1024*1024.
This all seems to be a bug to me, but I for the heck of it don’t see why my code would be responsible as it runs fine for “small n”. I tried valgrind in device emulation mode, but as the crash only occurs for large values of n, that froze my machine.
Here’s the code to launch the kernel, the important point is that I use one thread per value in the result vector y:
dim3 grid;
dim3 block;
block.x = 128;
grid.x = (int)ceil(N/(double)(block.x));
// printf("configuration: %d * %d = %d\n", grid.x, block.x, grid.x*block.x);
smv_row1<<<grid,block>>>(y, x, A->sDD_GPU, A->sLL_GPU, A->sLD_GPU, A->sLU_GPU, A->sDL_GPU, A->sDU_GPU, A->sUL_GPU, A->sUD_GPU, A->sUU_GPU, N, M);
(A->something is a cudaMalloc’ed array).
I know this post sounds weird, but I am simply at a loss here. If you need further information, please let me know. I even thought about redoing the whole thing in C instead of C++; and I thought about padding all my vectors with zeros, but both is a hell of a lot of work on the application side when performance is to be maintained (padding needs to me a multiple of the warp size etc)
Thanks,
–dom