Optimization of kernel optimization question about implementation of matlab abs() function

bluestorm · November 12, 2009, 3:28pm

Hi All!

I am beginner in CUDA and have been trying to write a kernel that finds the absolute value of a complex number.

I have an array of 1024x1024 and based on the matrixmultiplication examples and CUDA videos, I am using 64x64 blocks with BLOCK_SIZE of 16.

This code is taking about 28ms which is rather long when compared to other complex operations (like a complex matrix multiplication of 2 1024x1024 matrixes that takes about 1ms).

This is the kernel I have so far

[codebox]// setup execution parameters

dim3 threads2(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid2(COLUMNS / threads.x, ROWS / threads.y);

// execute the kernel

abs_complex<<< grid2, threads2 >>>(d_image_buff,d_result_buff,COLUMNS);

global void

abs_complex( float* A, cuComplex* B, int Width)

{

// Block index

int bx = blockIdx.x;

int by = blockIdx.y;

// Thread index

int tx = threadIdx.x;

int ty = threadIdx.y;

//Calculating the position of the element that will be converted

// Using columkn-major ordering

int pos = (by*BLOCK_SIZE + ty)*Width + (bx*BLOCK_SIZE + tx);

A[pos] = sqrt(B[pos].x*B[pos].x + B[pos].y*B[pos].y);

}

#endif // #ifndef REAL_TO_COMPLEX_KERNEL_H[/codebox]

I do not know how to do it faster. Any hint/advice/idea would be greatly appreciated.

Thanks a lot in advance!

Tigga · November 12, 2009, 3:44pm

First off - I’d do it all in 1D. Fewer sums are required on the kernel side. This means you want a 1D block (dim3 threads2(256)) and then just have a 1D grid (dim3 grid2(4096)). Pos is then simply bx*256 + tx. It doesn’t save much computation, but it seems a lot more understandable to me.

(I should probably point out here that I’m not entirely certain what type cuComplex is under the hood. I’d guess it’s a float2. Having said that, it shouldn’t really matter for the rest of this post.)

Now… the rest. I’m a bit confused as to why it’s slow. You have two coalesced reads, and one coalesced store. Have you been messing with your input pointers at all (as in, are they the same pointers as assigned by cudaMaloc)? If so you could messed up alignment and that would slow you down, especially on older cards.

Failing that, it may be the compiler being foolish. I would have thought that it would have to be really quite foolish though. Might be an idea to do it a bit more explictally:

cuComplex value = B[pos]
A[pos] = sqrt(B[value.xvalue.x + value.yvalue y);

That should definately be coalesced.

bluestorm · November 12, 2009, 4:04pm

First off - I’d do it all in 1D. Fewer sums are required on the kernel side. This means you want a 1D block (dim3 threads2(256)) and then just have a 1D grid (dim3 grid2(4096)). Pos is then simply bx*256 + tx. It doesn’t save much computation, but it seems a lot more understandable to me.

(I should probably point out here that I’m not entirely certain what type cuComplex is under the hood. I’d guess it’s a float2. Having said that, it shouldn’t really matter for the rest of this post.)

Now… the rest. I’m a bit confused as to why it’s slow. You have two coalesced reads, and one coalesced store. Have you been messing with your input pointers at all (as in, are they the same pointers as assigned by cudaMaloc)? If so you could messed up alignment and that would slow you down, especially on older cards.

Failing that, it may be the compiler being foolish. I would have thought that it would have to be really quite foolish though. Might be an idea to do it a bit more explictally:

cuComplex value = B[pos]

A[pos] = sqrt(B[value.xvalue.x + value.yvalue y);

That should definately be coalesced.

Thanks a lot for your response!

I am allocating both buffers using the cublasAlloc funciton in the following way:

status = cublasAlloc(lSize, sizeof(d_image_buff[0]), (void**)&d_image_buff);

if (status != CUBLAS_STATUS_SUCCESS) {return ERR_CUDA_MEM_ALLOCATION;}

status = cublasAlloc(lSize, sizeof(d_result_buff[0]), (void**)&d_result_buff);

if (status != CUBLAS_STATUS_SUCCESS) {return ERR_CUDA_MEM_ALLOCATION;}

where lSize = 1024*1024

I have started also using shared memory in the following way:

//Calculating the position of the element that will be converted

// Using columkn-major ordering

int pos = (by*BLOCK_SIZE + ty)Width + (bxBLOCK_SIZE + tx);

shared float realpart;

shared float imgpart;

realpart = B[pos].x;

imgpart - B[pos].y;

A[pos] = sqrt(realpartrealpart + imgpartimgpart);

but I do not see any improvement. I will try the 1D addressing and will let you know

Thanks for your help again!

Regards,

Tigga · November 12, 2009, 4:08pm

Shared memory won’t help at all - in fact, it should give you the wrong results. Shared memory is only useful for intra-block communiction. What you’ve done is ok, just ditch the shared specifier. Given you saw no improvement, then I think the compiler probably wasn’t being stupid.

Next question - how are you doing your timings? Are you sure they’re correct?

bluestorm · November 12, 2009, 4:33pm

Thanks for the ideas!

It is taking about 0.05 ms!

I was doing the time measurement wrong. I was doing something like:

[codebox]

unsigned int timer = 0;

cutilCheckError(cutCreateTimer(&timer));

cutilCheckError(cutStartTimer(timer));

// setup execution parameters

dim3 threads2(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid2(COLUMNS / threads.x, ROWS / threads.y);

// execute the kernel

abs_complex<<< grid2, threads2 >>>(d_image_buff,d_result_buff,COLUMNS);

// Getting result back to create the complex image matrix

status = cublasGetVector(lSize, sizeof(d_image_buff[0]), d_image_buff, 1, h_image_buff, 1);

if (status != CUBLAS_STATUS_SUCCESS) {return ERR_CUDA_CUBLAS;}

// stop and destroy timer

cutilCheckError(cutStopTimer(timer));

printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));

cutilCheckError(cutDeleteTimer(timer));[/codebox]

Sorry about the silly mistake… but this is anyhow interesting… is the cublasGetVector() supposed to take this long? (28ms!!!)

Thanks for your help!

Topic		Replies	Views
Faster MatrixMult than CUBLAS! CUDA Programming and Performance	4	2795	September 4, 2009
Cuda matrix multiplication too slow CUDA Programming and Performance	5	13333	February 17, 2010
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	766	August 26, 2018
Need Help Optimizing Kernel CUDA Programming and Performance	6	1064	January 7, 2014
performance of the matrix multiplication CUDA Programming and Performance	16	4359	September 11, 2013
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7491	September 1, 2010
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28048	February 1, 2011
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	941	August 23, 2018
CUBLAS question cublasGetVector() call CUDA Programming and Performance	3	5600	November 19, 2009
Matrix Multiplication Garbage value :( CUDA Programming and Performance	10	3406	July 25, 2009

Optimization of kernel optimization question about implementation of matlab abs() function

Related topics