Is texture based interpolation the best option for volume warping?

Hey everybody,

I am working on a distortion correction algorithm for a 3D-volume of 1024x512x512 8-bit pixels. So basically each pixel is remaped to a new location in the output image. The output image is about 574x704x704. I do this by calculating in each kernel the new pixel postion, getting the the interpolated pixel value via a 3D-texture and writing it via coalescent access to the output array.

Yet I noticed during profiling that the texture access takes a lot of time. Visual profiler states that the code is mainly limited to memory bandwidth. So does it make sense to use a texture here or to try developing an own interpolation function based on faster shared memory/register?

Currently I am at 28 Registers + 60B/Block shared memory used at 100% occupancy.

Here a quick timing overview (working on GT640):

  1. Writing a constant value to each pixel in output array - 51ms
  2. Same as 1) but the new coordinates are computed without being used - 1559ms
  3. Calculate coordinates, get value from texture and write to output - 3657ms

Here is the code by the way:
call:

int blockSize=1024;
cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, devId);
d_correctFanBeamDistortion<<<512*numSMs, blockSize>>>(d_output, imageW, imageH, imageD, d_Params);

kernel:

__global__ void		//testing global memory write
d_correctFanBeamDistortion(uchar *d_output, uint imageW, uint imageH, uint imageD, FanBeamCorrParams *d_Params)
{
	// use shared memory as cache
	__shared__ float ShCache[11];	// float cache array for parameters
	__shared__ uint ShCacheInt[4];	// int cache array for constants

	// initialize cache in shared memory for each block only once 
    if (threadIdx.x == 0) {	
	  // float cache	
      ShCache[0]=(*d_Params).d;
	  ShCache[1]=2.0*ShCache[0]/(float)imageD;
	  ShCache[2]=(*d_Params).ac/(float)imageW;
	  ShCache[3]=(*d_Params).r0;
	  ShCache[4]=0.5/(*d_Params).BetaMax;
	  ShCache[5]=(*d_Params).b;
	  ShCache[6]=2.0*ShCache[5]/(float)imageH;
	  ShCache[7]=0.5/(*d_Params).AlphaMax;
	  ShCache[8]=ShCache[3]*(*d_Params).thMax;
	  ShCache[9]=1.0/(*d_Params).r_max;
	  ShCache[10]=(*d_Params).zOffset;
	  // int cache
	  ShCacheInt[0]=imageW;
	  ShCacheInt[1]=ShCacheInt[0]*imageH;
	  ShCacheInt[2]=ShCacheInt[1]*imageD;
	  ShCacheInt[3]=blockDim.x * gridDim.x;
    }
	__syncthreads();	// Wait until all threads are done
	
	uint i,j,k;		// declare variables

	// using a grid stride loop (see Nvidia Cuda Pro Tipps from Mark Harris 22. April '13)
	// This way you ensure coalescent memory communication
	for (uint index = blockIdx.x * blockDim.x + threadIdx.x; 
        index < ShCacheInt[2]; 
        index += ShCacheInt[3])
	{
		// using integer arithmetics with shared memory cache
		k=index/ShCacheInt[1];
		j=(index-ShCacheInt[1]*k)/ShCacheInt[0];
		i=index-ShCacheInt[0]*j-ShCacheInt[1]*k;

		// new version with shared memory cache
		float y= ShCache[0]- ShCache[1]*k;
		float z=ShCache[2]*i+ShCache[8];
		float beta=atan2(y,z);
		float l=0.5-beta*ShCache[4];	
		float x=ShCache[5]-ShCache[6]*j;
		float alpha=atan2(x,sqrt(y*y+z*z));
		float h=0.5-alpha*ShCache[7];
		float r=sqrt(x*x+y*y+z*z);
		float v=(r-ShCache[3])*ShCache[9]+ShCache[10];

		// read from 3D texture
		float voxel = tex3D(tex_3D, v, h, l);

		// Write to output
		d_output[index] = voxel*255;
	}
}

So what are your thoughts…?
Cheers