Hi,

I’m trying to implement a real-time image reconstruction algorithm using CUDA. Although my initial attempt produce a correct results when compared to a serialise implementation, the performance is rather poor. I’m trying to optimise the function using Visual profiler and found out that performance is bounded by the instruction and memory latency. And looking at the kernel profile result, it was found that the texture interpolation (tex1DLayared) is stalling the process due to execution dependency.

Another problem found is the high register((38 register/thread) usage which limit the SM to simultaneously excuting only 1 block out of 32 block on a gtx 980.

I would be very grateful if someone can have a look at the script below and suggest some solutions.

Thanks.

Andy

```
__global__ void kernel_DAS_DynApo7(float2* __restrict__ d_LRI,
const float* __restrict__ d_pixelMapX, const float* __restrict__ d_pixelMapZ, const float angle,
const unsigned int imageHeight, const unsigned int imageWidth, const unsigned int numChannel)
{
//Define linear index;
const int ind = blockIdx.x *blockDim.x + threadIdx.x;
if (ind< imageWidth*imageHeight)
{
const int i = ind%imageHeight;
const int j = ind / imageHeight;
int ChannelInd = blockIdx.y;
float zIm = d_pixelMapZ[i];
float xIm = d_pixelMapX[j];
float xObj = d_pixelMapX[j] - d_ElePosC[ChannelInd];
float apoTheta = atanf(__fdividef(xObj, zIm));
float k = 0.7854f*d_pitch*d_t2wl*__sinf(apoTheta);
float apo;
if (k == 0) {
apo = 1;
}
else {
apo = __fdividef(__sinf(k)*__cosf(apoTheta), k);
}
if (apo > 0)
{
float t_tx = zIm * __cosf(angle) + xIm * __sinf(angle);
float t_rx = __fsqrt_rn((zIm * zIm) + xObj * xObj);
float t_bf = (t_tx + t_rx)*d_t2wl + d_t0;
//Texture interpolation
float2 temp = tex1DLayered(d_RF_Layertex, t_bf, ChannelInd);
temp.x = temp.x*apo;
temp.y = temp.y*apo;
d_LRI[ind].x += temp.x;
d_LRI[ind].y += temp.y;
}
}
}
```