I hope one of you can help me think about a problem I have:
I have an image (1024*512) which needs interpolation row per row. This makes 1024 interpolations. I want to launch a grid size of
1024 blocks, with each 512 threads so that every block can do the interpolation on one row. The points where the interpolation needs to
take place are stored in host memory. These points are the same for all rows (interpolation takes place at exactly the same points for all rows).
I want to store these 512 data points on the shared memory of each block (good idea or not?)
Before actually trying to implement it, I want to think about the general programming strategy so here’s some psuedocode, representing
what I want to do. Question is: Am I doing something (a lot of things maybe) completely wrong or should this approach work?
texture<float, 2 , cudaReadModeElementType> tex;
__global__ void interpolate(float* points, float* outputMatrix)
{
idx = blockIdx.x * blockDim.x + threadIdx.x
__shared__ float a[512];
a[threadIdx.x] = points[threadIdx.x] // Transfer points where interpolation needs to be done from global to shared memory
float x = tex2D(tex, a[threadIdx.x], blockIdx.x)
outputMatrix[idx] = x;
}
int main()
{
...set up the texture here and bind cudaArray to it, also load points and invoke kernel.
}
Ow ok, Thanks for the quick answer.
So shared memory is not needed here? Because I read the matrix multiplication example where they used it and
I thought it would be faster here as well, but that’s not true?
Second thing is, should I then use global memory or something to store the points where every interpolation needs to be done
or also not? I’m a little worried about the fact that you will have a store and load to global memory of the points for every
thread block. Won’t this take a huge amount of time that shared or global memory could solve?
yes, the shared memory is one of the faster memory of GPU, but if you don’t share data between threads in a blocks shared is not needed.
for me, I use one thread for the interpolation of one pixels :
he need to read in the global memory his interpolation coordinate
he need to read in texture memory his interpolated value
he need to store the result in global memory
and all of this is do once by pixel.
if the interpolation coordinate is parametric (homographic, affine, quadratic … ) you can use constant memory to share the parameter and reduce the interaction with the global memory to the result storing.