basic operations takes large time?

Hi everyone,

I am implementing a kernel from which I am calling device function that are doing simple operation:

device void deviceFun(short int *inArray, short int * outArray, int idx,int index)
{
short int a0 = 0, a1 = 0, a2 = 0, a3 = 0;
a0 = tex1Dfetch(textArray, idx ) ;
a1 = tex1Dfetch(textArray, idx+1 ) ;
a2 = tex1Dfetch(textArray, idx+2 ) ;
a3 = tex1Dfetch(textArray, idx+3 ) ;

           outArray[index] = (__mul24(a0 , outArray[0]) + __mul24(a1 , outArray[1]) + __mul24(a2 , outArray[2]) + __mul24(a3 ,  outArray[3]) )>>12;

}

This device function takes 1.02 ms which is very large for this function (I am using Quadro CX ).

Is anything lacking here so it takes large time?