I am currently considering working with CUDA to improve some real-time image processing of a software I am developing at work.
Currently, I use OpenGL shaders, but I think CUDA is worth being considered.
I have read the CUDA Programming Guide and some examples, but I have not decided yet if I can take the time for learning CUDA while we have so much things to do with our softwares.
My concern is to get some vague answers about the good practices, and see if it would be easy enough to bring it up in my code.
Let’s suppose I want to perform some basic operations on an image : 1 dilatation and 1 erosion (just a simple example).
On MacOS X, I would use some CIFilters with custom kernels.
Otherwise, I can make a texture with the image and use some pixel shaders with GLSL.
Considering CUDA, I am wondering what I should do :
-Should I make a CUDA texture with the image, or should I use linear memory or CUDA arrays ?
-To process sequentially multiple kernels, streams seems good to chain operations. But is it really necessary there ? As long as my image is mapped to the memory device (either as a texture or something else), I can call several kernels on it and end with a read from device to host ?
-What is the preferred data format for CUDA internals ? I suppose float ?
-if I want to make a program that use CUDA when available on the machine, and a software mode otherwise, should I use the CUDA emulation mode for the software fallback (code unification), or a custom implementation (will it be more efficient for simple cases ?) ?
-When using CUDA, is it usual to query the device properties to decide how many thread blocks or grid cells shoud be used ?
Sorry if it seems like dumb questions, but I feel that I need such clarifications before deciding whether to dedicate time or not to CUDA experimentations…
I usually never use textures, but maybe for you a texture + CUDA array would be easier. E.g., when doing a dilation I would copy the data into shared mem and do multiple reads from there. Simply doing it straight from global mem would be slow and not exploit the data reuse. Using a texture and its cache would be faster than reading from global mem, but slower than staging in shared memory. Then again it would be easier than staging in shared memory. Textures are oddballs, there’s few things for which they’re really the best. At least if you dont’ need to do filtering.
If you don’t need to copy memory into the GPU while the GPU is busy running a kernel (or something similarly fancy), you don’t need streams. As you say, call several kernels and end with a memcpy.
Yes, float is the dominant type, but you have a lot of flexibility. int or uchar4 may be better for you. I won’t lie, there’s some nuances regarding what data types and access patterns cause performance trouble, but I wouldn’t worry too much now. If you’re curious, study coalescing and shared memory bank conflicts.
Custom implementation, or the “nvcc -multicore” option that we were promised a while back. Don’t use emulation. Also read up on how to put CUDA into a dynamic libary and load it manually on start-up. Otherwise you’ll have a problem when CUDA isn’t installed on a system.
Umm. Yes and no. Mostly, no. Don’t make your blocks too big (resource-wise) and don’t make the grid too small, and you’ll be fine. Then again we don’t know what CUDA devices will exist in the future. But right now, I don’t think most people query anything.
Thank you for those very good answers. One point, though :
What surprises me is that I feel I am between hardware abstraction and hardware awareness…
Let’ suppose that the developer, working on a recent card, choose a block size and grid dim that would not work for an older card. Thus, he would have to check the compatibility of both parameters.
I would have expected an “auto-mode” for dispatching blocks and grids. When using N threads, the driver would automatically split N it into blocks of maximal size, and deduce the grid size… (letting the ability to the developer to do that himself). Then, the kernel launch would be optimal, adaptive to any card, and a single error checking would tell if it can be executed at all…
grid dimensions work for all cards, as long as you stay within 65k x 65k.
block dimensions are resource-dependend (GT200 has twice as many registers as G80/G92), so that is the only factor deciding between older and newer devices. There will probably be more diversity in the future, but especially grid dimensions look to be going to be constant for a while.
Efficiency does not come from using blocks of maximal size. Not exactly. If you use small blocks, it will run just fine on a larger device (running several blocks per multiprocessor). No reason for the runtime to automate this.
However… for certain algorithms the more shared mem and the more registers you have per block, the better (a good example is matrix-multiply). Also, gmem accesses may be slightly faster if there is one block per multiprocessor (and the accesses are perfectly optimized).
For your type of algorithm, I don’t think having a maximally large block or using a hundred registers has any advantage. A good configuration is: use blocks with 128 threads, 32 registers per thread, and up to 4KB shared memory per block. Launch at least a few thousand of these. This will let your code scale nicely across all devices, including future ones, and is a good balance of occupancy and resource usage on current ones.
My rule of thumb is the following: if I have edge effects, I read from a 2D texture, which provides boundary safety and is, apparently, optimized for 2D access. If I don’t, then I can read from linear memory.
More specifically, most operators in image processing are non-punctual, they require additional pixels outside the area of interest. For example, to calculate the output of a 3-tap FIR filter, one needs 2 additional pixels beside the current one. If you process the image in tiles (in a GPU, each block of threads would process one of these tiles) then you need to overlap these tiles so as to cover the erroneous pixels around the borders. I call these erroneous pixels edge effects. NVidia tutorials call the extra pixels that must be brought in the apron. To be efficient, global memory access needs to happen in a coalesced fashion, which entails reading from aligned addresses (in the newer chips, the coalescence requirements are more relaxed, but I think the alignment still stands). The need to overlap your tiles makes it difficult to follow the coalescence rules, that’s why I use 2D textures if overlaps (aprons) are needed. Hope it’s all clear :wacko:
I have understood the memory hierarchy of CUDA, so your sentence makes sense. But on the other hand, it does not describe what I have already seen. Let me explain :
In the CUDA doc, the matrix addition is a basic example : a cudamalloc() will allocate memory on the device (in the global mem), and the the kernels will read/write to that mem to perform the additions.
Do you mean that for better performance, even such a simple example would require to prepare the kernels, by copying first a submatrix into the shared memory of each block, then perform the kernels, then copy back the shared memory to global memory, then copy back from device to host ?
As far as I understand, this would have the following form :
//on host
{
cudaMalloc() the matrices on the device
cudamemcpy() from host to global device mem
launch kernel_matrix_add<<<>>>
cudamemcpy() from global device mem to host
}
//on device
kernel_matrix_add()
{
if (threadIdx == 0) //only the fist thread to minimize global mem access
cudamemcpy() from global mem to shared block mem
threadssycnhronize();
perform additions on the current kernel sub matrix : they only access shared memory for better perf.
threadssycnhronize();
if (threadIdx == 0) //only the fist thread to minimize global mem access
cudamemcpy() from shared block mem to global mem
}
you do not do a cudamemcpy. Each thread (not only thread 0!) assigns some element of a global memory array to a shared memory array element. There are plenty of examples in the SDK that do this, for example the reduction example.