Hi All,

```
#define d_coeff 1024
__global__ void foo( unsigned char* array_1, unsigned char* array_2 )
{
long k = blockDim.x * blockIdx.x + threadIdx.x;
if(k < 4320000 )
{
int ind = 0;
// some operations here. ind is updated here.
ind = array_2[k];
unsigned short int coeff = 0;
coeff = array_2[ind << 8];
for( int i = 0; i < 3; i++ )
{
int p_value = ( __mul24( array_1[i], coeff ) + d_coeff ) >> 8;
array_1[i] = (unsigned char)p_value;
}
}
}
```

this kernel is called as:

foo<<<8438, 512>>>(array_1, array_2);

and array_1 and array_2 are of size 1728000.

Here whole execution time for this function is **14ms **. If I comment **array_1[i]* coeff **in

inner **for** loop then executon time is **4ms**.

My questions are :

(1)Does mutiplication take so much time?

(2)If Yes, then how to reduce the execution time of this multiplication?

```
I have used <b>__mul24(array_1[i],coeff)</b> instead of <b>array_1[i]* coeff</b>, but
didnot get any improvement in execution time time.
```