Bell & Garland SPMV kernel unspecified launch failure for num_rows >=64 SPMV, unspecified lau

Dear everyone;

I tried using the Bell & Garland SPMV kernel and compiled it using CUDA SDK 2.1 for use on a 9800GTX running on Open SUSE 11.0 64 bit. It works fine for matrix dimensions num_rows <64, but when num_rows >= 64, i’m getting an ‘unspecified kernal launch’ failure at runtime and the kernel will not launch. This occurs irrespective if the matrix is dense (containing 64 nonzeroes per row) or spase (containing < 32 nonzeroes per row). The kernel launches and works correctly for matrixes dimension 0-63. Anything >= 64 fails. Weird. I’m not sure if anyone has encountered something similar and could suggest possible solutions. Any insight and help would be much appreciated.

Some questions:

  • Does the SPMV kernel need to be compiled for compute capability 1.3 and run on Tesla for it to work?

  • The error seems to be coming from this line “sdata[threadIdx.x] += Ax[jj] * x[ Aj[jj] ] ;”
    in the kernel within the loop:
    global void spmv_csr_kernel (const GPU_IT num_rows, const GPU_IT *Ap, const GPU_IT *Aj, const GPU_DT *Ax,
    GPU_DT *y, GPU_DT *x) {
    … snip …

    for(GPU_IT jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE) {
    sdata[threadIdx.x] += Ax[jj] * x[ Aj[jj] ];
    }
    … snip …
    }

When i comment out that line, no problem launching kernals (of course it does nothing). I have replaced the line with sdata[threadIdx.x] = threadIdx.x or some constant number 12 and it launches and works ok. I’m guessing that the culprit is the Ax and Aj (passed as below). It looks ok, but i’m sure i’m definately overlooking something. Any help or pointers in the right direction would be most appreciated.

pgpu_xyz was allocated on the host using cudaMaloc ((void**) &pgpu_xyz, …) and arrays copied from host using cudaMemcpyAsync(pgpu_xyz, phost_xyz, …)

Kernal launched from host using :
spmv_csr_kernel<<< D_grid, D_block >>>
(m, pgpu_A_CSR_pntr, pgpu_A_CSR_indx, pgpu_A_CSR_val, pgpu_gmres_V[gmres_itr + 1], pgpu_gmres_V[gmres_itr] );

My compile options are:
"nvcc -c -I/opt/matlab/extern/include -DMATLAB_MEX_FILE -v -D_GNU_SOURCE -I/usr/local/cuda/include -I/opt/intel/mkl/10.1.2.024/include -Xcompiler “-ansi -fexceptions -fPIC -fno-omit-frame-pointer -pthread " -O -DNDEBUG “GMRES_MGRS_DGPU.cu””

Thank you,
seb