Optimization of kernel optimization question about implementation of matlab abs() function

Hi All!

I am beginner in CUDA and have been trying to write a kernel that finds the absolute value of a complex number.

I have an array of 1024x1024 and based on the matrixmultiplication examples and CUDA videos, I am using 64x64 blocks with BLOCK_SIZE of 16.

This code is taking about 28ms which is rather long when compared to other complex operations (like a complex matrix multiplication of 2 1024x1024 matrixes that takes about 1ms).

This is the kernel I have so far

[codebox]// setup execution parameters

dim3 threads2(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid2(COLUMNS / threads.x, ROWS / threads.y);

// execute the kernel

abs_complex<<< grid2, threads2 >>>(d_image_buff,d_result_buff,COLUMNS);

global void

abs_complex( float* A, cuComplex* B, int Width)

{

// Block index

int bx = blockIdx.x;

int by = blockIdx.y;

// Thread index

int tx = threadIdx.x;

int ty = threadIdx.y;

//Calculating the position of the element that will be converted

// Using columkn-major ordering

int pos = (by*BLOCK_SIZE + ty)*Width + (bx*BLOCK_SIZE + tx);

A[pos] = sqrt(B[pos].x*B[pos].x + B[pos].y*B[pos].y);

}

#endif // #ifndef REAL_TO_COMPLEX_KERNEL_H[/codebox]

I do not know how to do it faster. Any hint/advice/idea would be greatly appreciated.

Thanks a lot in advance!

First off - I’d do it all in 1D. Fewer sums are required on the kernel side. This means you want a 1D block (dim3 threads2(256)) and then just have a 1D grid (dim3 grid2(4096)). Pos is then simply bx*256 + tx. It doesn’t save much computation, but it seems a lot more understandable to me.

(I should probably point out here that I’m not entirely certain what type cuComplex is under the hood. I’d guess it’s a float2. Having said that, it shouldn’t really matter for the rest of this post.)

Now… the rest. I’m a bit confused as to why it’s slow. You have two coalesced reads, and one coalesced store. Have you been messing with your input pointers at all (as in, are they the same pointers as assigned by cudaMaloc)? If so you could messed up alignment and that would slow you down, especially on older cards.

Failing that, it may be the compiler being foolish. I would have thought that it would have to be really quite foolish though. Might be an idea to do it a bit more explictally:

cuComplex value = B[pos]
A[pos] = sqrt(B[value.xvalue.x + value.yvalue y);

That should definately be coalesced.

Thanks a lot for your response!

I am allocating both buffers using the cublasAlloc funciton in the following way:

status = cublasAlloc(lSize, sizeof(d_image_buff[0]), (void**)&d_image_buff);

if (status != CUBLAS_STATUS_SUCCESS) {return ERR_CUDA_MEM_ALLOCATION;}

status = cublasAlloc(lSize, sizeof(d_result_buff[0]), (void**)&d_result_buff);

if (status != CUBLAS_STATUS_SUCCESS) {return ERR_CUDA_MEM_ALLOCATION;}

where lSize = 1024*1024

I have started also using shared memory in the following way:

//Calculating the position of the element that will be converted

// Using columkn-major ordering

int pos = (by*BLOCK_SIZE + ty)Width + (bxBLOCK_SIZE + tx);

shared float realpart;

shared float imgpart;

realpart = B[pos].x;

imgpart - B[pos].y;

A[pos] = sqrt(realpartrealpart + imgpartimgpart);

but I do not see any improvement. I will try the 1D addressing and will let you know

Thanks for your help again!

Regards,

Shared memory won’t help at all - in fact, it should give you the wrong results. Shared memory is only useful for intra-block communiction. What you’ve done is ok, just ditch the shared specifier. Given you saw no improvement, then I think the compiler probably wasn’t being stupid.

Next question - how are you doing your timings? Are you sure they’re correct?

Thanks for the ideas!

It is taking about 0.05 ms!

I was doing the time measurement wrong. I was doing something like:

[codebox]

unsigned int timer = 0;

cutilCheckError(cutCreateTimer(&timer));

cutilCheckError(cutStartTimer(timer));

// setup execution parameters

dim3 threads2(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid2(COLUMNS / threads.x, ROWS / threads.y);

// execute the kernel

abs_complex<<< grid2, threads2 >>>(d_image_buff,d_result_buff,COLUMNS);

// Getting result back to create the complex image matrix

status = cublasGetVector(lSize, sizeof(d_image_buff[0]), d_image_buff, 1, h_image_buff, 1);

if (status != CUBLAS_STATUS_SUCCESS) {return ERR_CUDA_CUBLAS;}

// stop and destroy timer

cutilCheckError(cutStopTimer(timer));

printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));

cutilCheckError(cutDeleteTimer(timer));[/codebox]

Sorry about the silly mistake… but this is anyhow interesting… is the cublasGetVector() supposed to take this long? (28ms!!!)

Thanks for your help!