Ages ago when I worked on the original OpenCL implementation the CUDA guys said that using OpenGL for imaging was the way to go as CUDA performance on imaging lagged behind OpenGL and we found this to be true due to cache coherency issues with OpenGL having more coherent access to block regions of memory and tread execution was block orientated for OpenGL vs linear on CUDA.
Has this been fixed with recent generations of nvidia hardware with better scheduling and samplers or is this still the case?
I used to be one of the “CUDA guys” ages ago. Frankly, I have no idea what you may be referring to with “more coherent access to block regions of memory” and “tread execution was block orientated for OpenGL vs linear on CUDA”.
Whether OpenGL compute shaders or CUDA kernels work better will likely depend on what specific kind of image processing the use case requires, and where the data is needed eventually. There will be OpenGL/CUDA interop overhead if the use case requires passing data back and force between the two environments, and that overhead could be significant.
My standard 10,000 ft recommendation is: If the application is primarily graphical in nature, try OpenGL or DirectX computer shaders first, unless it turns out that the use case requires features not supported by those graphical environments (that includes ease of programming and cost of maintenance. CUDA is a subset of C++11). If the application is primarily computational in nature, start with CUDA and add CUDA/OpenGL interop if needed.
OpenGL / CUDA interop overhead is what I am avoiding, we had found that OpenGL texture shaders for simple shaders for photographic effects like sepia were much faster than on CUDA. OpenGL fills primitives in a block fashion which increases cache coherency for data access as adjacent threads were accessing similar data. I am fairly certain CUDA took on kernel submits as linear in the past.
Before writing all the code in OpenGL texture shaders I wanted to try CUDA first.
If you use cudaArray in CUDA it should be first-order equivalent to what OpenGL uses. It is optimized for 2D storage. I am fairly certain that cudaArray was available in early versions of CUDA.
Texture caches in GPUs do not provide coherency, no matter through which software interface textures are used. If you need to write to textures, you would want to look for APIs dealing with surfaces (those were added later to CUDA, ca. 2008, I think). I am guessing your description of data access for adjacent threads may refer to spatial locality?
As I said, OpenGL / CUDA interop overhead is still very much a thing.
Yeah… I have two accounts, one as a former nVidia employee one for personal… and MacOS / linux machines which log into different accounts and I haven’t taken the time to fix because they just work ;)
I think spatial locality is the term we are looking for to speak to the same issue.
I am working on some imaging algorithms for FITS images from space telescopes, some algorithms I can’t or don’t want to attempt to implement in OpenGL. It’s been awhile since I dove into performance work for GPGPU (like G80 and Kepler) But I want to churn through a few thousand images in a dataset as part of this work, so performance would be a good place to start.