Simple kernel problem A question about debugging a simple kernel


I have a question about a simple kernel I am trying to write that I am not understand why is failing. I am new to CUDA and this is the first time trying to write a kernel (have used FFT and CUBLAS libraries thus far).

Is there a way to debug a kernel?

I want to basically make a cuComplex matrix out of a float matrix by assigning the real values of the float matrix to the .x part of the cuComplex matrix.

Currently, I am sending the vector back to host memory and doing it at host and this process is taking about 20ms (on a 1024*1024 array) which is way too long for my application. I understand that CUBLAS uses column-major order, but my code should still at least work and not just crash.

Here is what I have been trying to do: (on a 1024 x 1024 array… ROWS=COLUMNS)


// Thread block size

#define BLOCK_SIZE 16

// setup execution parameters

dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid(COLUMNS / threads.x, ROWS / threads.y);

// execute the kernel

realToComplex<<< grid, threads >>>(d_image_buff,d_image_complex_buff,COLUMNS,ROWS);

global void

realToComplex( float* A, cuComplex* B, int Width, int Height)


// Block index

int bx = blockIdx.x;

int by = blockIdx.y;

// Thread index

int tx = threadIdx.x;

int ty = threadIdx.y;

//Calculating the position of the element that will be converted

int pos = BLOCK_SIZE *  bx + 

	BLOCK_SIZE * BLOCK_SIZE * Width * by + 

	tx + 

	BLOCK_SIZE * Width * ty;

B[pos].x = A[pos];

B[pos].y = 0.0f;



There is no a descriptive error coming back. This is what I receive:

cudaThreadSynchronize error: Kernel execution failed in file <c:/C Projects/matrixMul/>, line 235 : unspecified launch failure.

Any help debugging this would be appreciated

Thanks for your help!

I think that this is invalid memory access in your kernel code

int pos = BLOCK_SIZE *  bx + BLOCK_SIZE * BLOCK_SIZE * Width * by + tx + BLOCK_SIZE * Width * ty;

global index of ((bx,by),(tx,ty)) is (bxBLOCK_SIZE + tx, byBLOCK_SIZE + ty )

its row-major mapping is (by*BLOCK_SIZE + ty)width + (bxBLOCK_SIZE + tx).

so try

int pos = (by*BLOCK_SIZE + ty)*width + (bx*BLOCK_SIZE + tx)

moreover if dimension is not multiple of BLOCK_SIZE, you need to impose boundary condition

Thanks!! It works! I think I had an extra BLOCK_SIZE in Y. Thanks for the clarification. My code now takes about 0.6ms according to the cuda timer!

THanks again