basic interpolation kernel should i use shared memory here or not?

Eluri · October 22, 2009, 3:57pm

Hi all,

I hope one of you can help me think about a problem I have:

I have an image (1024*512) which needs interpolation row per row. This makes 1024 interpolations. I want to launch a grid size of

1024 blocks, with each 512 threads so that every block can do the interpolation on one row. The points where the interpolation needs to

take place are stored in host memory. These points are the same for all rows (interpolation takes place at exactly the same points for all rows).

I want to store these 512 data points on the shared memory of each block (good idea or not?)

Before actually trying to implement it, I want to think about the general programming strategy so here’s some psuedocode, representing

what I want to do. Question is: Am I doing something (a lot of things maybe) completely wrong or should this approach work?

texture<float, 2 , cudaReadModeElementType> tex;

__global__ void interpolate(float* points, float* outputMatrix)

{

idx = blockIdx.x * blockDim.x + threadIdx.x

__shared__ float a[512];

a[threadIdx.x] = points[threadIdx.x] // Transfer points where interpolation needs to be done from global to shared memory

float x = tex2D(tex, a[threadIdx.x], blockIdx.x)

outputMatrix[idx] = x;

}

int main() 

{

...set up the texture here and bind cudaArray to it, also load points and invoke kernel.

}

Thanks in advance!

aplyer · October 22, 2009, 5:34pm

Hi all,

I hope one of you can help me think about a problem I have:

I have an image (1024*512) which needs interpolation row per row. This makes 1024 interpolations. I want to launch a grid size of

1024 blocks, with each 512 threads so that every block can do the interpolation on one row. The points where the interpolation needs to

take place are stored in host memory. These points are the same for all rows (interpolation takes place at exactly the same points for all rows).

I want to store these 512 data points on the shared memory of each block (good idea or not?)

Before actually trying to implement it, I want to think about the general programming strategy so here’s some psuedocode, representing

what I want to do. Question is: Am I doing something (a lot of things maybe) completely wrong or should this approach work?
texture<float, 2 , cudaReadModeElementType> tex;

__global__ void interpolate(float* points, float* outputMatrix)

{

idx = blockIdx.x * blockDim.x + threadIdx.x

__shared__ float a[512];

a[threadIdx.x] = points[threadIdx.x] // Transfer points where interpolation needs to be done from global to shared memory

float x = tex2D(tex, a[threadIdx.x], blockIdx.x)

outputMatrix[idx] = x;

}

int main() 

{

...set up the texture here and bind cudaArray to it, also load points and invoke kernel.

}
Thanks in advance!

your approache is good,

in your example the use of the shared is not needed :

texture<float, 2 , cudaReadModeElementType> tex;

__global__ void interpolate(float* points, float* outputMatrix)

{

idx = blockIdx.x * blockDim.x + threadIdx.x;

float a;

a = points[threadIdx.x];

float x = tex2D(tex, a, blockIdx.x);

outputMatrix[idx] = x;

}

work also (and is simpler).

for example, the matlab Iw=interp2(I,x,y) in cuda :

#define CB_TILE_W  16

#define CB_TILE_H  16

#define iDivUp(a,b) (((int)(a) % (int)(b) != 0) ? (((int)(a) /(int) (b)) + 1) : ((int)(a) /(int) (b)))

texture<float, 2 , cudaReadModeElementType> texI;

__global__ void interp2Ker(float* Iw, float* x, float* y, uint col, uint row, uint pitch)

{

  int2 addr;

  addr.x = blockIdx.x * blockDim.x + threadIdx.x;

  addr.y = blockIdx.y * blockDim.y + threadIdx.y;

  float tx,ty;

  if(addr.x <col && addr.y < row){

	offset = addr.x + pitch*addr.y;

	tx =  x[offset];

	ty = y[offset];

	Iw[offset] = tex2D(texI, tx+0.5f , ty+0.5f);

  }

}

void

interp2(float *Iw, float *I, float *x, float *y, uint col, uint row, uint pitch){

  dim3 grid(iDivUp(col, CB_TILE_W), iDivUp(row, CB_TILE_H));

  dim3 threads(CB_TILE_W, CB_TILE_H);

  cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();

  texI.normalized = false;

  texI.filterMode = cudaFilterModeLinear;

  texI.addressMode[0] = cudaAddressModeClamp;

  texI.addressMode[1] = cudaAddressModeClamp;

  cudaBindTexture2D(0,&texI, I, &channelDesc , col, row, pitch);

  interp2ker<<<grid,threads>>>(Iw,x,y,col,row,pitch/sizeof(float));

  cudaThreadSynchronize();

  cudaUnbindTexture(texI);

}

Eluri · October 22, 2009, 5:47pm

Ow ok, Thanks for the quick answer.
So shared memory is not needed here? Because I read the matrix multiplication example where they used it and
I thought it would be faster here as well, but that’s not true?
Second thing is, should I then use global memory or something to store the points where every interpolation needs to be done
or also not? I’m a little worried about the fact that you will have a store and load to global memory of the points for every
thread block. Won’t this take a huge amount of time that shared or global memory could solve?

aplyer · October 24, 2009, 7:01pm

yes, the shared memory is one of the faster memory of GPU, but if you don’t share data between threads in a blocks shared is not needed.

for me, I use one thread for the interpolation of one pixels :

he need to read in the global memory his interpolation coordinate
he need to read in texture memory his interpolated value
he need to store the result in global memory

and all of this is do once by pixel.

if the interpolation coordinate is parametric (homographic, affine, quadratic … ) you can use constant memory to share the parameter and reduce the interaction with the global memory to the result storing.