Hi All,
#define d_coeff 1024
__global__ void foo( unsigned char* array_1, unsigned char* array_2 )
{
long k = blockDim.x * blockIdx.x + threadIdx.x;
if(k < 4320000 )
{
int ind = 0;
// some operations here. ind is updated here.
ind = array_2[k];
unsigned short int coeff = 0;
coeff = array_2[ind << 8];
for( int i = 0; i < 3; i++ )
{
int p_value = ( __mul24( array_1[i], coeff ) + d_coeff ) >> 8;
array_1[i] = (unsigned char)p_value;
}
}
}
this kernel is called as:
foo<<<8438, 512>>>(array_1, array_2);
and array_1 and array_2 are of size 1728000.
Here whole execution time for this function is 14ms . If I comment array_1[i]* coeff in
inner for loop then executon time is 4ms .
My questions are :
(1)Does mutiplication take so much time?
(2)If Yes, then how to reduce the execution time of this multiplication?
I have used <b>__mul24(array_1[i],coeff)</b> instead of <b>array_1[i]* coeff</b>, but
didnot get any improvement in execution time time.
x248
May 21, 2009, 10:01am
2
Hi All,
#define d_coeff 1024
__global__ void foo( unsigned char* array_1, unsigned char* array_2 )
{
long k = blockDim.x * blockIdx.x + threadIdx.x;
if(k < 4320000 )
{
int ind = 0;
// some operations here. ind is updated here.
ind = array_2[k];
unsigned short int coeff = 0;
coeff = array_2[ind << 8];
for( int i = 0; i < 3; i++ )
{
int p_value = ( __mul24( array_1[i], coeff ) + d_coeff ) >> 8;
array_1[i] = (unsigned char)p_value;
}
}
}
this kernel is called as:
foo<<<8438, 512>>>(array_1, array_2);
and array_1 and array_2 are of size 1728000.
Here whole execution time for this function is 14ms . If I comment array_1[i]* coeff in
inner for loop then executon time is 4ms .
My questions are :
(1)Does mutiplication take so much time?
(2)If Yes, then how to reduce the execution time of this multiplication?
I have used <b>__mul24(array_1[i],coeff)</b> instead of <b>array_1[i]* coeff</b>, but
didnot get any improvement in execution time time.
You are using array.
see http://forums.nvidia.com/index.php?showtopic=97222&st=20
with array you are in local memory and not the register.
Try wihout array, and you will see the performance…
You also have an obvious race condition on array_1. It won’t hurt performance but your program will not work correctly. Your main problem is you are in the wrong line of work.