Cuda good practices for image processing

Chacha · December 7, 2008, 10:37pm

Hello,

I am currently considering working with CUDA to improve some real-time image processing of a software I am developing at work.
Currently, I use OpenGL shaders, but I think CUDA is worth being considered.
I have read the CUDA Programming Guide and some examples, but I have not decided yet if I can take the time for learning CUDA while we have so much things to do with our softwares.
My concern is to get some vague answers about the good practices, and see if it would be easy enough to bring it up in my code.
Let’s suppose I want to perform some basic operations on an image : 1 dilatation and 1 erosion (just a simple example).
On MacOS X, I would use some CIFilters with custom kernels.
Otherwise, I can make a texture with the image and use some pixel shaders with GLSL.
Considering CUDA, I am wondering what I should do :
-Should I make a CUDA texture with the image, or should I use linear memory or CUDA arrays ?
-To process sequentially multiple kernels, streams seems good to chain operations. But is it really necessary there ? As long as my image is mapped to the memory device (either as a texture or something else), I can call several kernels on it and end with a read from device to host ?
-What is the preferred data format for CUDA internals ? I suppose float ?
-if I want to make a program that use CUDA when available on the machine, and a software mode otherwise, should I use the CUDA emulation mode for the software fallback (code unification), or a custom implementation (will it be more efficient for simple cases ?) ?
-When using CUDA, is it usual to query the device properties to decide how many thread blocks or grid cells shoud be used ?

Sorry if it seems like dumb questions, but I feel that I need such clarifications before deciding whether to dedicate time or not to CUDA experimentations…

Regards,

Pierre Chatelier

alex_dubinsky · December 8, 2008, 2:13am

I usually never use textures, but maybe for you a texture + CUDA array would be easier. E.g., when doing a dilation I would copy the data into shared mem and do multiple reads from there. Simply doing it straight from global mem would be slow and not exploit the data reuse. Using a texture and its cache would be faster than reading from global mem, but slower than staging in shared memory. Then again it would be easier than staging in shared memory. Textures are oddballs, there’s few things for which they’re really the best. At least if you dont’ need to do filtering.

If you don’t need to copy memory into the GPU while the GPU is busy running a kernel (or something similarly fancy), you don’t need streams. As you say, call several kernels and end with a memcpy.

Yes, float is the dominant type, but you have a lot of flexibility. int or uchar4 may be better for you. I won’t lie, there’s some nuances regarding what data types and access patterns cause performance trouble, but I wouldn’t worry too much now. If you’re curious, study coalescing and shared memory bank conflicts.

Custom implementation, or the “nvcc -multicore” option that we were promised a while back. Don’t use emulation. Also read up on how to put CUDA into a dynamic libary and load it manually on start-up. Otherwise you’ll have a problem when CUDA isn’t installed on a system.

Umm. Yes and no. Mostly, no. Don’t make your blocks too big (resource-wise) and don’t make the grid too small, and you’ll be fine. Then again we don’t know what CUDA devices will exist in the future. But right now, I don’t think most people query anything.

Chacha · December 8, 2008, 3:28pm

Hello,

Thank you for those very good answers. One point, though :

What surprises me is that I feel I am between hardware abstraction and hardware awareness…

Let’ suppose that the developer, working on a recent card, choose a block size and grid dim that would not work for an older card. Thus, he would have to check the compatibility of both parameters.

I would have expected an “auto-mode” for dispatching blocks and grids. When using N threads, the driver would automatically split N it into blocks of maximal size, and deduce the grid size… (letting the ability to the developer to do that himself). Then, the kernel launch would be optimal, adaptive to any card, and a single error checking would tell if it can be executed at all…

Regards,

Pierre Chatelier

E.D_Riedijk · December 8, 2008, 3:45pm

grid dimensions work for all cards, as long as you stay within 65k x 65k.

block dimensions are resource-dependend (GT200 has twice as many registers as G80/G92), so that is the only factor deciding between older and newer devices. There will probably be more diversity in the future, but especially grid dimensions look to be going to be constant for a while.

alex_dubinsky · December 8, 2008, 5:55pm

Efficiency does not come from using blocks of maximal size. Not exactly. If you use small blocks, it will run just fine on a larger device (running several blocks per multiprocessor). No reason for the runtime to automate this.

However… for certain algorithms the more shared mem and the more registers you have per block, the better (a good example is matrix-multiply). Also, gmem accesses may be slightly faster if there is one block per multiprocessor (and the accesses are perfectly optimized).

For your type of algorithm, I don’t think having a maximally large block or using a hundred registers has any advantage. A good configuration is: use blocks with 128 threads, 32 registers per thread, and up to 4KB shared memory per block. Launch at least a few thousand of these. This will let your code scale nicely across all devices, including future ones, and is a good balance of occupancy and resource usage on current ones.

Chacha · December 8, 2008, 8:34pm

[solved]

Efficiency does not come from using blocks of maximal size. Not exactly. If you use small blocks, it will run just fine on a larger device (running several blocks per multiprocessor). No reason for the runtime to automate this.

However… for certain algorithms the more shared mem and the more registers you have per block, the better (a good example is matrix-multiply). Also, gmem accesses may be slightly faster if there is one block per multiprocessor (and the accesses are perfectly optimized).

For your type of algorithm, I don’t think having a maximally large block or using a hundred registers has any advantage. A good configuration is: use blocks with 128 threads, 32 registers per thread, and up to 4KB shared memory per block. Launch at least a few thousand of these. This will let your code scale nicely across all devices, including future ones, and is a good balance of occupancy and resource usage on current ones.

Thank you.

What I like with your precise answers is that my questions do not seem stupid, since they bring much more information than expected (for me).

I think that I have now a good understanding of how I could use CUDA.

Regards,

Pierre Chatelier

liv · December 9, 2008, 3:29pm

My rule of thumb is the following: if I have edge effects, I read from a 2D texture, which provides boundary safety and is, apparently, optimized for 2D access. If I don’t, then I can read from linear memory.

More specifically, most operators in image processing are non-punctual, they require additional pixels outside the area of interest. For example, to calculate the output of a 3-tap FIR filter, one needs 2 additional pixels beside the current one. If you process the image in tiles (in a GPU, each block of threads would process one of these tiles) then you need to overlap these tiles so as to cover the erroneous pixels around the borders. I call these erroneous pixels edge effects. NVidia tutorials call the extra pixels that must be brought in the apron. To be efficient, global memory access needs to happen in a coalesced fashion, which entails reading from aligned addresses (in the newer chips, the coalescence requirements are more relaxed, but I think the alignment still stands). The need to overlap your tiles makes it difficult to follow the coalescence rules, that’s why I use 2D textures if overlaps (aprons) are needed. Hope it’s all clear :wacko:

Chacha · February 12, 2009, 8:26am

Mmmmhhh… there is something more to explain here.

I have understood the memory hierarchy of CUDA, so your sentence makes sense. But on the other hand, it does not describe what I have already seen. Let me explain :

In the CUDA doc, the matrix addition is a basic example : a cudamalloc() will allocate memory on the device (in the global mem), and the the kernels will read/write to that mem to perform the additions.

Do you mean that for better performance, even such a simple example would require to prepare the kernels, by copying first a submatrix into the shared memory of each block, then perform the kernels, then copy back the shared memory to global memory, then copy back from device to host ?

As far as I understand, this would have the following form :

//on host

{

  cudaMalloc() the matrices on the device

  cudamemcpy() from host to global device mem

  launch kernel_matrix_add<<<>>>

  cudamemcpy() from global device mem to host

}

//on device

kernel_matrix_add()

{

  if (threadIdx == 0) //only the fist thread to minimize global mem access

	 cudamemcpy() from global mem to shared block mem

  threadssycnhronize();

  perform additions on the current kernel sub matrix : they only access shared memory for better perf.

  threadssycnhronize();

  if (threadIdx == 0) //only the fist thread to minimize global mem access

	 cudamemcpy() from shared block mem to global mem

}

Is this relevant ?

Regards,

Pierre Chatelier

E.D_Riedijk · February 12, 2009, 9:03am

//on host

//on device

kernel_matrix_add()

{

  if (threadIdx == 0) //only the fist thread to minimize global mem access

	 cudamemcpy() from global mem to shared block mem

  threadssycnhronize();

  perform additions on the current kernel sub matrix : they only access shared memory for better perf.

  threadssycnhronize();

  if (threadIdx == 0) //only the fist thread to minimize global mem access

	 cudamemcpy() from shared block mem to global mem

}

you do not do a cudamemcpy. Each thread (not only thread 0!) assigns some element of a global memory array to a shared memory array element. There are plenty of examples in the SDK that do this, for example the reduction example.

Topic		Replies	Views
How would you do this? CUDA Programming and Performance	12	4466	August 5, 2008
Doubts related to CUDA CUDA Programming and Performance	17	11805	November 18, 2010
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134565	May 26, 2010
Help with some CUDA concepts CUDA Programming and Performance	7	1448	August 16, 2009
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6061	November 16, 2010
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4697	June 22, 2011
Newbie - Need to use shared mem? CUDA Programming and Performance	27	14988	December 17, 2008
Using Textures CUDA Programming and Performance	10	21810	March 29, 2007
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	210979	January 18, 2014
Re_arranging Cuda array CUDA Programming and Performance	8	40	September 23, 2024

Cuda good practices for image processing

Related topics