Does anyone have any advice as to what is the most likely cause of this pretty generic error ( returned by cudaGetLastError())?

[codebox] unsigned int num_threads = 256;

unsigned int blocks = (len/num_threads) + 1;	

//printf("block: %d\r\n",blocks);

dim3 grid(blocks, 1);

dim3 threads(num_threads, 1);    

//dim3 grid(1, 1);

//dim3 threads(1, 1);  

actFuncDouble<<< grid, threads >>>(f);

cutilCheckMsg("Kernel execution failed");[/codebox]

[codebox]global void

actFuncDouble( double* d_data )


// write data to global memory

const unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;

//double data = d_data[tid];

d_data[tid] = 1/(   1+exp(-d_data[tid])       );


In most of my code I am calling CUBLAS functions, but I also needed to add a few basic ones of my own (listed above). Seems simple enough, I am wondering if there is a CUBLAS error being thrown and reqular driver/runtime CUDA doesn’t identifiy those even through I syncThreads after a cublas command completes. Any ideas?

    Check if size of array f is a multiple of num_threads. If it is not, last block may overwrite some other memory.

    If len is a multiple of num_threads, blocks should be equal to ((len-1)/num_threads) + 1. Otherwise, last block will have nothing to do - all tid-s will point beyond the end of array.

That’s all I could think of from this short code.

Also note, you can provide a single int and not necessairy a dim3 structure for kernel launch configuration. Ints will be implicitly casted to dim3 in x direction, filling rest with 1.

  1. Don’t use CUTIL at all, it was developed only for use in the SDK and is not stable for production code. Check out the Dr. Dobbs Article on Error Handling for the correct way to check errors.

  2. It’s just an idea, but try adding a


between the kernel launch and the error checking. Since kernel launches are async, the kernel may not be done by the time you query the error message.

When I call cublasAlloc I allocate at least 512 more bytes than neccesary so the code I wrote shouldn’t run off the end provided I understand how CUBLAS allocates memory.



	cublasDgemm(	'n', 	'n',	dataPnts, 	lyr2Cols, 	lyr2Rows, 

					1, ((double*)devMem_tMatrix2Inputs), 

					dataPnts, 	((double*)devMem_tMatrix2Wghts), 

					lyr2Rows, 	0,  ((double*)devMem_tMatrix2Output), dataPnts);






So I am assuming the output is allocated just as if I called cudaMalloc and all the elements of the matrix are stored sequentially in memory. Is this correct?

Is the CUBLAS source code available? I didn’t see it in the SDK anywhere.

May b, you are NOT using the correct DRIVER version.

Are other CUDA apps working fine? Check out for driver compatiblity