[Help] Kernel Optimization Image subsampling

I’d like to ask you for some help regarding optimizing my very first kernel written in CUDA which I believe could be much faster…

Some preliminaries:

I am using single channel (greyscale) images that reside in one dimensional float arrays on the host memory.

For an example image 6x3, pixels are coded into an array of size 18 as follows:

Original Image

0  1  2  3  4  5

6  7  8  9  10 11

12 13 14 15 16 17

Coded Array

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

So in order to retrieve a pixel p(x, y) from the array one does:

p(x, y) = x + stride*y

where stride = image width

I am loading such an array on the GPU using cudaMalloc. My goal is to perform a simple form of image subsampling, where i want to half the image size (e.g. width/2 and height/2) and keep only the pixels on even rows and columns.

As I can imagine the kernel for such an operation should be upper speed bounded by the throughput of the GPU on the linear memory <–> linear memory speed.

My kernel is as follows:

__global__ void subsampleGPU(float* d_in, float* d_out, int inDataW, int inDataH,

                             int outDataW, int outDataH)



   const int start = IMUL(blockIdx.x, THREADS_PER_BLOCK);

   const int writePos = start + threadIdx.x;

   const int outSize = IMUL(outDataW, outDataH);


   if(writePos < outSize)


     const int ratio = inDataW/outDataW;

      const int tmp1 = writePos/outDataW;

      const int readPos = writePos*ratio + inDataW*tmp1;


      d_out[writePos] = d_in[readPos];



And I am calling it as follows:

int blocksize = iDivUp(size, THREADS_PER_BLOCK);

dim3 subsampleGridDim(blocksize);

dim3 subsampleBlockDim(THREADS_PER_BLOCK);

CUDA_SAFE_CALL( cudaThreadSynchronize() );

subsampleGPU<<< subsampleGridDim, subsampleBlockDim>>>(

                     memPoint, temp, prw, prh, w, h);

CUT_CHECK_ERROR("subsampleGPU() failed...\n");

where size is the size of the smaller image (width*height), THREADS_PER_BLOCK are the maximum number of threads per block (384 for the 8800 GTS i have), and w and h are correspondingly prw/2 and prh/2. IMUL and iDivUp are as the examples provided in the SDK.

The idea behind this is that since the kernel is memory <–> memory speed bounded, I can have each thread R/W one pixel at a time for maximum throughput (along with some rogue threads at the end of the array that do nothing).

Now for an image of 1280x960, for two successive subsampling passes (eg from 1280x960 to 640x480 and again from 640x480 to 320x240), I am getting a speed of approximately 2ms.

Would you think this speed is good? Would you have any suggestions on optimizations I could perform to the kernel in order to make it faster?

On a side note, I have tried altering the kernel to provide it with outSize, ratio, inDataW and outDataW directly, in order to save some calculation time on the kernel, however when i did that i found that the kernel was approximately 1ms slower than the original one… I am baffled on this one and cannot explain why it happens…

Any suggestions and points will be greatly appreciated,

thank you for your time,


Well, why not use textures when you’re just doing image processing? Textures are optimized for this kind of access patterns. For image subsampling, OpenGL fixed pipeline would do just fine.
If there’s a reason not using array textures, you could replace the /outDataW with a float mad and a float-int cast. I suspect your kernel is spending too much time doing integer division. Using a linear memory texture for reading would also help.

I calculate that out to be ~2GB/s of data transferred, which is slow. I didn’t see which device you are running on, but an 8800 GTX is capable of 70GB/s transfer rates. The reason for your slowdown is that your memory reads are not fully coalesced (see the programming guide for details).

Options to increase performance: 1) Use textures for the read

  1. Use shared memory in a creative way to get fully coalesced reads.

Of the two options, I would guess that 1) is the better choice. It’s simpler to code and will give equal or better performance to option 2.