Hi!
I have a question about a simple kernel I am trying to write that I am not understand why is failing. I am new to CUDA and this is the first time trying to write a kernel (have used FFT and CUBLAS libraries thus far).
Is there a way to debug a kernel?
I want to basically make a cuComplex matrix out of a float matrix by assigning the real values of the float matrix to the .x part of the cuComplex matrix.
Currently, I am sending the vector back to host memory and doing it at host and this process is taking about 20ms (on a 1024*1024 array) which is way too long for my application. I understand that CUBLAS uses column-major order, but my code should still at least work and not just crash.
Here is what I have been trying to do: (on a 1024 x 1024 array… ROWS=COLUMNS)
[codebox]
// Thread block size
#define BLOCK_SIZE 16
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(COLUMNS / threads.x, ROWS / threads.y);
// execute the kernel
realToComplex<<< grid, threads >>>(d_image_buff,d_image_complex_buff,COLUMNS,ROWS);
global void
realToComplex( float* A, cuComplex* B, int Width, int Height)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
//Calculating the position of the element that will be converted
int pos = BLOCK_SIZE * bx +
BLOCK_SIZE * BLOCK_SIZE * Width * by +
tx +
BLOCK_SIZE * Width * ty;
B[pos].x = A[pos];
B[pos].y = 0.0f;
}
[/codebox]
There is no a descriptive error coming back. This is what I receive:
cudaThreadSynchronize error: Kernel execution failed in file <c:/C Projects/matrixMul/matrixMul.cu>, line 235 : unspecified launch failure.
Any help debugging this would be appreciated
Thanks for your help!