Calculation of averages values of an image sequence

Hello,

my name is Marcus Große and I am working in the field of 3d measurements using structured light. Our group
is evaluating the use of GPU’s for image processing tasks. In order to get to know OpenCL I have written a kernel, which
averages twentyone gray value images and writes the results into global device memory for later usage (see provided kernel code below).

The runtime of the kernel (which is measured using the clGetEventProfilingInfo) is about 113ms (GPU).
To get that fast I am using loop-unrolling as described here “http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/Pages/ImageConvolutionUsingOpenCL.aspx” (about 10ms faster compared to non unrolled case).
An implementation on the CPU takes about (140ms, no loop-unrolling used and only one core used). So
for this problem there seems to be no big performance gain, when using the GPU. If have a few question related to that result.

  1. The problem may be that for every memory access there is only one addition made, so that the memory bandwith hinder a faster execution. Is this plausible?
  2. As image dimension (global buffer dimension) is a multiple of 16 memory accessed should be coalesced in my implementation. Is there a way to check this or can someone point me
    to problems in my kernel-code that surpress coalesced memory access?
  3. Are there other options to decrease execution time?
  4. We use a NVIDIA-Geforce 9500GT. When switching to a more recent model (perhaps the upcoming Fermi-Cards), which
    speed-up may be achieved for this presented problem (factor >10?)?
  5. I adresse the same problem, using image_2d and image_3d instead of the one dimensional buffer. The runtime is about the same compared to using two buffers.

Questions not related to results.

  1. I am also eager to see more examples written in OpenCL, which handle image processing. Perhaps someone can
    point me to a link or book?
  2. If I do not assign the local variable avgl,…,avgl3 to the global buffer avgL the GPU seems to skip the entire
    calculation of theses values which makes it difficult to track memory read/write time consumption compared to calculation time consumption.
    Is there a work around?

thanks in advance,
Marcus Große

I am using CUDA-Toolkit 3.0 + NVIDIA 9500GT

Kernel-Code:
//l contains image data of one camera, r contains image data of a second camera, average values are computed for both cameras (stored into avgL and avgR)
__kernel void AverageKernel(__global float* avgL,__global float* avgR, __global float* l, __global float* r)
{
//get position of workitem in image
unsigned int nx = get_global_id(0);
unsigned int ny = get_global_id(1);
float inv_pics=1.0f/21.0f;
//variables used for loop unrolling
float avgl=0.0f;
float avgr=0.0f;
float avgl2=0.0f;
float avgr2=0.0f;
float avgl3=0.0f;
float avgr3=0.0f;
int c=0;
//average calculation of 21 images of size 640x480
for(int c=0;c<7;c++)//loop-unrolling
{
avgl+=l[nx+640ny+c3640480];
avgr+=r[nx+640ny+c3640480];
avgl2+=l[nx+640ny+(c3+1)640480];
avgr2+=r[nx+640ny+(c3+1)640480];
avgl3+=l[nx+640ny+(c3+2)640480];
avgr3+=r[nx+640ny+(c3+2)640480];
}
//writing results to global device memory
avgL[nx+640*ny]=(avgl+avgl2+avgl3)inv_pics;
avgR[nx+640
ny]=(avgr+avgr2+avgr3)*inv_pics;
};

P.S.: I posted the same topic here “http://www.khronos.org/message_boards/viewforum.php?f=37”, I hope to get more feedback by posting it here

I think this is the major problem. 9500GT does not have a lot of memory bandwidth (25.6GB/s or 16GB/s depending on the memory type). It’s really not much more than the memory bandwidth of a CPU, especially when you are using a Core i7.

Your memory access looks to be coalesced. If you want to verify it, you can use OpenCL Visual Profiler to test it and check whether there are uncoalesced access or not.

I think the problem here is, is that going to help your total execution time?

Since your computation is very simple, the major advantage of a GPU here is memory bandwidth. However, as the data still has to be copied from main memory to display memory, if there is no data reuse, any memory bandwidth advantage is pretty much useless. Even if you use a faster display card, such as a GeForce GTX 285, which will definitely show a better execution time because it’s higher bandwidth, most of the real time spend will be on copying the data from main memory to the display memory, and vice versa.

If you have some sort of data reuse, for example, if you need to average images in a FIFO fashion (i.e. averaging image #0 ~ #20, then averaging image #1 ~ #21, #2 ~ #22, etc.) then you can take the advantage of superior memory bandwidth of a display card.

image_2d and image_3d are textures, so they can take advantage of the texture cache of a GPU. However, this is only useful when you need to read the same pixel multiple times. In your image average kernel, every pixel is read only once, so they are not going to help.

I think NVIDIA’s GPU Computing SDK has a few OpenCL samples related to image processing. They are basically ported from their CUDA samples, but rewritten in OpenCL syntax.

I’m not sure about this. This is basically due to the optimization of the compiler, which sees the variables “not used” so optimize them out. You can try disabling the optimization (by giving the build option “-cl-opt-disable”) but that’s not useful for your purpose. Maybe you can try using volatile qualifier on these variables.

Thanks for your detailed response.

The calculation of the average values is the first steps in a series of calculation based on the 42 images. The calculated average values will be reused for later stages of the calculation chain. As we would like to realize real time data processing, it would be important that all calculations would be finished after 40ms. If I consider that the GTX has about 140 GB/s bandwith we may talk about a tenfold boost to the presented first step (clBandwithTest delivers 13GB/s for our 9500GT)?

As my background is non computer-science related my understanding of the caching process was like this:

As soon as i try to access pixel information of pixel 0,0 a bigger chunk of memory is read to the cache. Therefor it contains information of pixel 0,0 and 0,1 and 0,2… . If i try to access pixel 0,1 in the next step it is read from the cache, and saves me time. So what you say is, that only the pixel 0,0 is read, and will be cached for later use?

I will look into this.

The GPU also always reads bigger chunks of data, but for another reason. The reads an writes are done always in half warps, this means 16 read/write accesses are triggered. Using CUDA hardware < Version 1.2, this results in one memory access if the memory access pattern is coalesced or in up to 16 accesses if it is uncoalesced. CUDA hardware >= 1.2 has a by far better “memory management” and reduces the amount of unnecessary memory accesses. I would strongly recommend to use CUDA hardware >= 1.2.

Back to caching … every memory read access on the gpu is either used by a thread or it is wasted. If the same thread read a memory position severel times, there will be severel memory access to the DRAM triggered. With caching this would result only in fetches from the cache.

Cache will not give you any benifit, if all your data is “consumed” (used) by your threads and you don’t need the data a second time. I hope I could make this clear.

For other access patterns (multiple reads of nearby adresses) caches are a big advantage, but in the current hardware generation you have to use shared memory or the read-only texture accesses to avoid multiple unnecessary memory access. Unfortunately, managing the shared memory is a error prone task und texture access are quite restricted (read-only).

Your idea of caching is is true for the CPU. The CPU fetches always at least 64-byte of data from RAM into cache and from there the accesses are very cheap. There are also some other techniques on CPU like “read ahead” to improve the cache hit rate. The write-back strategie of the CPU cache also guarantees that recently read and also writen data resides inside the cache.

GeForce GTX 285 can achieve more than 110GB/s running oclBandwidthTest, so it’s more like around 8 fold boost if you are memory bandwidth limited.

However, since your 42 images still need to be copied from main memory to the GPU, you have to take that in mind. With PCI express 2.0, memory transfer between CPU (paged memory) and GPU is generally around 4GB/s (notice that PCI express is a bi-directional bus). It can be a bit faster using pinned memory (> 5.5 GB/s is possible).