I am beginner in CUDA and have been trying to write a kernel that finds the absolute value of a complex number.
I have an array of 1024x1024 and based on the matrixmultiplication examples and CUDA videos, I am using 64x64 blocks with BLOCK_SIZE of 16.
This code is taking about 28ms which is rather long when compared to other complex operations (like a complex matrix multiplication of 2 1024x1024 matrixes that takes about 1ms).
This is the kernel I have so far
[codebox]// setup execution parameters
dim3 threads2(BLOCK_SIZE, BLOCK_SIZE); dim3 grid2(COLUMNS / threads.x, ROWS / threads.y);
// execute the kernel
abs_complex<<< grid2, threads2 >>>(d_image_buff,d_result_buff,COLUMNS);
abs_complex( float* A, cuComplex* B, int Width)
// Block index int bx = blockIdx.x; int by = blockIdx.y;
// Thread index
int tx = threadIdx.x; int ty = threadIdx.y;
//Calculating the position of the element that will be converted
// Using columkn-major ordering int pos = (by*BLOCK_SIZE + ty)*Width + (bx*BLOCK_SIZE + tx); A[pos] = sqrt(B[pos].x*B[pos].x + B[pos].y*B[pos].y);
#endif // #ifndef REAL_TO_COMPLEX_KERNEL_H[/codebox]
I do not know how to do it faster. Any hint/advice/idea would be greatly appreciated.
Thanks a lot in advance!