#define d_coeff 1024

__global__  void foo( unsigned char* array_1, unsigned char* array_2 )


	long k = blockDim.x * blockIdx.x + threadIdx.x;


	if(k < 4320000 )


		int ind = 0;

		// some operations here. ind is updated here.

		ind = array_2[k];


		unsigned short int coeff = 0;

		 coeff = array_2[ind << 8];

		for( int i = 0; i < 3; i++ )


			int p_value = ( __mul24( array_1[i], coeff ) + d_coeff ) >> 8;

			array_1[i] = (unsigned char)p_value;




this kernel is called as:

foo<<<8438, 512>>>(array_1, array_2);

and array_1 and array_2 are of size 1728000.

Here whole execution time for this function is 14ms . If I comment array_1[i]* coeff in

inner for loop then executon time is 4ms.

My questions are :

(1)Does mutiplication take so much time?

(2)If Yes, then how to reduce the execution time of this multiplication?

I have used <b>__mul24(array_1[i],coeff)</b> instead of <b>array_1[i]* coeff</b>, but 

didnot get any improvement in  execution time time.

You are using array.


with array you are in local memory and not the register.

Try wihout array, and you will see the performance…

You also have an obvious race condition on array_1. It won’t hurt performance but your program will not work correctly. Your main problem is you are in the wrong line of work.