problem launching kernel with cuLaunchGrid

Hi All,

I am new to CUDA and I am porting some existing code from runtime API to driver API.  I am running into a strange problem while launching kernels via cuLaunchGrid.  The same kernel works with  the following

[codebox]

dim3 block(16, 16);

dim3 grid(width/block.x, height/block.y);

kernel<<<block,grid>>>(parameters);

[/codebox]

yet it simply returns CUDA_ERROR_UNKNOWN with the (therotically) equavalent call in driver API:

[codebox]

dim3 block(16, 16);

dim3 grid(width/block.x, height/block.y);

cuFuncSetBlockShape(kernel, block.x, block.y, 1);

// …parameter passing via cuParamSet*

cuLaunchGrid(kernel, grid.x, grid.y);

[/codebox]

 the error is returned on the next call (in my case is a cuCtxSynchronize() ).  However, if I limit the call to one dimension, say make the call like below:

[codebox]

cuFuncSetBlockShape(kernel, block.x, 1, 1);

cuLaunchGrid(kernel, grid.x, 1);

[/codebox]

 it will execute my kernel; yet the result will only cover block.x * grid.x of course.  I've been stuck for two days and I am sure it must be something trivial.  Anyone with any comment is greatly appreciated!

below is my kernel, a simple float to byte buffer conversion…

[codebox]

global void Cvtkernel(int w, int h, float dFloat, unsigned char dByte)

{

int ix = blockDim.x * blockIdx.x + threadIdx.x;

int iy = blockDim.y * blockIdx.y + threadIdx.y;

if( ix < w && iy < h )

    dByte[ ix + iy * w] = (unsigned char) dFloat[ix + iy * w];

}

[/codebox]

commenting out the last line and cuLaunchGrid() did not complain; yet if I do anything, it will return ERROR_UNKNOWN; like changing the last line to

[codebox]

dByte[ix + iy * w] = 0;

[/codebox]

will bring down the whole thing. Am I doing index calculation wrong?

I found my problem, it was due to the call earlier when I used cuMemcpyHtoD(); I accidentally casted the host memory pointer to (void**) instead of (void*). Changing it to (void*) now my problems are gone.

Is there a debugger for device API on Windows?