Multiplication takes large time?

Manjunath_Gudisi · May 21, 2009, 9:43am

Hi All,

#define d_coeff 1024

__global__  void foo( unsigned char* array_1, unsigned char* array_2 )

{

	long k = blockDim.x * blockIdx.x + threadIdx.x;

	

	if(k < 4320000 )

	{

		int ind = 0;

		// some operations here. ind is updated here.

		ind = array_2[k];

		

		unsigned short int coeff = 0;

		 coeff = array_2[ind << 8];

		for( int i = 0; i < 3; i++ )

		{

			int p_value = ( __mul24( array_1[i], coeff ) + d_coeff ) >> 8;

			array_1[i] = (unsigned char)p_value;

		}

	}

}

this kernel is called as:

foo<<<8438, 512>>>(array_1, array_2);

and array_1 and array_2 are of size 1728000.

Here whole execution time for this function is 14ms . If I comment array_1[i]* coeff in

inner for loop then executon time is 4ms.

My questions are :

(1)Does mutiplication take so much time?

(2)If Yes, then how to reduce the execution time of this multiplication?

I have used <b>__mul24(array_1[i],coeff)</b> instead of <b>array_1[i]* coeff</b>, but 

didnot get any improvement in  execution time time.

x248 · May 21, 2009, 10:01am

Hi All,
#define d_coeff 1024

__global__  void foo( unsigned char* array_1, unsigned char* array_2 )

{

	long k = blockDim.x * blockIdx.x + threadIdx.x;

	

	if(k < 4320000 )

	{

		int ind = 0;

		// some operations here. ind is updated here.

		ind = array_2[k];

		

		unsigned short int coeff = 0;

		 coeff = array_2[ind << 8];

		for( int i = 0; i < 3; i++ )

		{

			int p_value = ( __mul24( array_1[i], coeff ) + d_coeff ) >> 8;

			array_1[i] = (unsigned char)p_value;

		}

	}

}
this kernel is called as:

foo<<<8438, 512>>>(array_1, array_2);

and array_1 and array_2 are of size 1728000.

Here whole execution time for this function is 14ms . If I comment array_1[i]* coeff in

inner for loop then executon time is 4ms.

My questions are :

(1)Does mutiplication take so much time?

(2)If Yes, then how to reduce the execution time of this multiplication?
I have used <b>__mul24(array_1[i],coeff)</b> instead of <b>array_1[i]* coeff</b>, but 

didnot get any improvement in  execution time time.

You are using array.

see http://forums.nvidia.com/index.php?showtopic=97222&st=20

with array you are in local memory and not the register.

Try wihout array, and you will see the performance…

Jamie_K · May 21, 2009, 2:03pm

You also have an obvious race condition on array_1. It won’t hurt performance but your program will not work correctly. Your main problem is you are in the wrong line of work.