[Help] Kernel Optimization Image subsampling

exe-gr · July 30, 2007, 2:47am

I’d like to ask you for some help regarding optimizing my very first kernel written in CUDA which I believe could be much faster…

Some preliminaries:

I am using single channel (greyscale) images that reside in one dimensional float arrays on the host memory.

For an example image 6x3, pixels are coded into an array of size 18 as follows:

Original Image

0  1  2  3  4  5

6  7  8  9  10 11

12 13 14 15 16 17

Coded Array

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

So in order to retrieve a pixel p(x, y) from the array one does:

p(x, y) = x + stride*y

where stride = image width

I am loading such an array on the GPU using cudaMalloc. My goal is to perform a simple form of image subsampling, where i want to half the image size (e.g. width/2 and height/2) and keep only the pixels on even rows and columns.

As I can imagine the kernel for such an operation should be upper speed bounded by the throughput of the GPU on the linear memory <–> linear memory speed.

My kernel is as follows:

__global__ void subsampleGPU(float* d_in, float* d_out, int inDataW, int inDataH,

                             int outDataW, int outDataH)

{

   

   const int start = IMUL(blockIdx.x, THREADS_PER_BLOCK);

   const int writePos = start + threadIdx.x;

   const int outSize = IMUL(outDataW, outDataH);

   

   if(writePos < outSize)

   {

     const int ratio = inDataW/outDataW;

      const int tmp1 = writePos/outDataW;

      const int readPos = writePos*ratio + inDataW*tmp1;

      

      d_out[writePos] = d_in[readPos];

   }

}

And I am calling it as follows:

int blocksize = iDivUp(size, THREADS_PER_BLOCK);

dim3 subsampleGridDim(blocksize);

dim3 subsampleBlockDim(THREADS_PER_BLOCK);

CUDA_SAFE_CALL( cudaThreadSynchronize() );

subsampleGPU<<< subsampleGridDim, subsampleBlockDim>>>(

                     memPoint, temp, prw, prh, w, h);

CUT_CHECK_ERROR("subsampleGPU() failed...\n");

where size is the size of the smaller image (width*height), THREADS_PER_BLOCK are the maximum number of threads per block (384 for the 8800 GTS i have), and w and h are correspondingly prw/2 and prh/2. IMUL and iDivUp are as the examples provided in the SDK.

The idea behind this is that since the kernel is memory <–> memory speed bounded, I can have each thread R/W one pixel at a time for maximum throughput (along with some rogue threads at the end of the array that do nothing).

Now for an image of 1280x960, for two successive subsampling passes (eg from 1280x960 to 640x480 and again from 640x480 to 320x240), I am getting a speed of approximately 2ms.

Would you think this speed is good? Would you have any suggestions on optimizations I could perform to the kernel in order to make it faster?

On a side note, I have tried altering the kernel to provide it with outSize, ratio, inDataW and outDataW directly, in order to save some calculation time on the kernel, however when i did that i found that the kernel was approximately 1ms slower than the original one… I am baffled on this one and cannot explain why it happens…

Any suggestions and points will be greatly appreciated,

thank you for your time,

Konstantinos

asadafag · July 30, 2007, 3:32am

Well, why not use textures when you’re just doing image processing? Textures are optimized for this kind of access patterns. For image subsampling, OpenGL fixed pipeline would do just fine.
If there’s a reason not using array textures, you could replace the /outDataW with a float mad and a float-int cast. I suspect your kernel is spending too much time doing integer division. Using a linear memory texture for reading would also help.

MisterAnderson42 · July 30, 2007, 1:43pm

I calculate that out to be ~2GB/s of data transferred, which is slow. I didn’t see which device you are running on, but an 8800 GTX is capable of 70GB/s transfer rates. The reason for your slowdown is that your memory reads are not fully coalesced (see the programming guide for details).

Options to increase performance: 1) Use textures for the read

Use shared memory in a creative way to get fully coalesced reads.

Of the two options, I would guess that 1) is the better choice. It’s simpler to code and will give equal or better performance to option 2.

Topic		Replies	Views
Processing pictures, low load efficiency..? CUDA Programming and Performance	3	528	April 10, 2018
100x Resampling->Sobel->Histogram over Angles: How do get Cuda to do it fast? CUDA Programming and Performance	6	10392	December 20, 2008
The GPU utilization is low CUDA Programming and Performance	3	2016	November 14, 2014
Texture memory fetch extremely slow CUDA Programming and Performance	13	3090	December 21, 2017
Kernel is slow - don't know why CUDA Programming and Performance	4	3175	May 11, 2012
Shared memory problem CUDA Programming and Performance	3	2258	February 8, 2008
Odd performance problem/question CUDA Programming and Performance	3	829	June 3, 2009
CUDA Image Processing Demo & Soure code&Tutorials CUDA Programming and Performance	7	25039	April 2, 2007
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10238	June 5, 2009
CUDA Runs 1/10th the speed of openGL CUDA Programming and Performance	9	7188	September 3, 2008

[Help] Kernel Optimization Image subsampling

Related topics