Is CUDA better than GLSLang? I need to know more...

Hello everyone.

I’m here to ask some questions and share my point of view.
At the moment I’m implementing exotic multi-assets options pricing on GPU.
Everything is working fine, the speed is sometime 10x faster than CPU (Athlon X2 4600+ vs GeForce 7600GS) and the results are always inside 1 sigma (numerically speaking good results).

I’m thinking if a switch to CUDA (from plain OpenGL+GLSLang) would be better or worse.
I mean, with old-school GPGPU approach it’s more easy for me to think how to solve problems. Yes, it can be harder to think and develope good shaders considering that I don’t have any threadId for example, but once complex shaders are developed, I can rely on the goodness of the drivers (and nVidia rocks as usual :-) ) and I know that the drivers will optimize not only the code, but even the resource handling.

Why should I switch to CUDA? Even in the case of a gather-scatter computation (and is not my case) it’s easier for me to think to the lattice as a texture being drawn on a flat quad…
Or even with multipass software architectures is even easier to split up computation in most easier passes…

BTW, I need to use big and bigger textures to fit my calculus: now I’m stuck to 2048x1024xRGBAx32Float. This is fine, but I’m asking myself if with Tesla (and supposely its multi-SLI) I will be able to use 8192x8192xRGBAx32Float.

Thanks in advance,
Ema. :-)

Ps. I’m on Ubuntu 7.04

Hi, Ema.

In most general words, CUDA programming model is an extension of “legacy” graphics-based GPGPU model, but available only on NVIDIA GPUs starting from GeForce8 family. Everything that can be implemented with shaders can be implemented in CUDA, so at the very least you should be able to get the same performance with CUDA as with GPGPU, being executed on the same hardware, since the amount of available hardware resources(ALU/texture horsepower) remains the same. Taking advantage of features/resources, exposed to CUDA and not visible to graphics, you can speed up and simplify your old GPGPU programs. One of the typical things in GPGPU==>CUDA transition is the reduction of required “rendering” passes, thus the reduction of driver/hardware overhead. Another obvious advantage is the transparency of CUDA API, tailored specifically for Compute, making host “infrastructure” code much easier to develop, debug and maintain, compared to OpenGL, which is still foreign to the most problems solved with GPGPU.

Please note that Ubunty is officially not supported in CUDA 1.0.

Thanks for answers.

To be honest:

  1. The overhead of a rendering step isn’t a problem (to draw a quad is very easy & fast task).

  2. Yes it’s true that you have to deal with fragment shaders, but you don’t have to think about threads and blocks (etc etc) - the driver will think to it.

  3. Is true that what you can do with CUDA you can still do with GLSLang? Till now I never read about something you can do with CUDA and not with GLSLang.

  4. The most important. OpenGL is standard. I’m not forced to stick with any vendor.

Apart that I still think that nVidia is still the best vendor for OpenGL solutions (and I do suggest to buy nVidia hardware), I still see that the videocards are mainly produced for videogames (so basically thought for 3D realtime graphics) and I know that with GLSLang I have more control on optimization/texture fetch/data fetch and so on.

When I read on the CUDA guide that there are textures a suspect went into my mind: isn’t it just a wrapper on GLSLang?

And btw, I need to use big and bigger textures to fit my calculus: now I’m stuck to 2048x1024xRGBAx32Float. This is fine, but I’m asking myself if with Tesla (and supposely its multi-SLI) I will be able to use 8192x8192xRGBAx32Float. Is Tesla architecture able to do this?

Can I use standard OpenGL on Tesla architecture?

Thanks again,

Ema.

As I’ve already said, CUDA programming model is a superset of graphics GPGPU model. In CUDA each thread belongs to a thread block and can cooperate with the other threads in the block and write its results to arbitrary memory addresses. In legacy GPGPU pixels are in fact threads, running in independent fashion and able to write only to a fixed number of output buffers and fixed locations (bound to thread/pixel coordinates). CUDA is not a wrapper built on top of graphics, but dedicated chip functioning mode, and the programming model is intended to expose as much as possible of the chip capabilities.

If staying vendor/hardware independent is one of your primary priorities, then of course you can’t drop legacy GPGPU codepath.

Apart from the amount of available video memory, 2D texture size in CUDA running on GeForce8 family is limited by (2 ^ 16) x (2 ^ 15) texels. DirectX 10 states minimum of 8192 x 8192 texels. It’s possible to use Tesla for off-screen graphics rendering.

Textures are optional in CUDA. The only reason to use them is to leverage the dedicated texture hardware on the GPU, such as the texture cache and the interpolator. You can write programs which access memory as a standard C-style linear array of bytes as well. As a GPGPU-neophyte, this was the nicest part of CUDA. You can write code which is basically C (with some C++ features, like templates), and the compiler turns it assembly for the GPU (bypassing GLSLang).

The tradeoff is that CUDA makes it easy to write code which has poor performance if you pretend that a GPU is just like a CPU. Shader languages constrain you at a language level, so it is a little harder to write slow code. Getting access to the 16 kB per-multiprocessor shared memory with CUDA can make some algorithms much faster and easier to write, especially if you want threads to collaborate.

Very good, so I will be able to use Tesla and GPGPU (GLSLang), isn’t it?

What do you mean mate? Do you mean the fact that is harder to go think to shaders, but once you do they are definetly faster than easy C-copied CUDA code? I mean, if you want to write fast CUDA code, and you have to follow GPGPU principles (as if you were to write a fragment shader program) isn’t better just to write directly a fragment shader one?

Why impose a sort of abstraction (CUDA) that then imposes you to write code like a fragment shader?

Or am I totally wrong?

Yes that’s the only drawback I see comparing CUDA to GPGPU.

But I think that having to cope with threads without shared memory makes your algorithm faster than synchronizing them: yes, it’s harder to turn a shared-memory thread algorithm to a completely thread independent one, but once done you will avoid every imposed synchronization and the calculations should be way faster than the ones with sync.

Thanks again,

Ema.

You don’t have to write a program that looks like a fragment shader to get good performance in CUDA. You do have to be aware of the hardware design, to know where bottlenecks can occur, though. When I said that CUDA makes it easy to write slow code, I was thinking in terms of the person completely new to GPGPU (as I was several months ago) who needs to understand that you must have a data-parallel algorithm of some kind to make good use of a GPU, regardless of whether you express the algorithm in CUDA or GLSLang.

You are correct for most cases. But, if you can modify an algorithm to store intermediate results in the shared memory, thereby avoiding some reads and writes to global memory, CUDA will be faster. Shared memory has much lower latency, and by being local to each multiprocessor, it has much greater aggregate bandwidth available. In fact, there is enough shared memory bandwidth to feed each stream processor (128 on 8800 GTX) 1 word per clock (1.35 GHz), compared to 12 words for the entire GPU per global memory transfer (900 MHz x 2).

This, in most cases, is pretty inaccurate for two reasons (in no particular order):

  1. synchronization is not expensive. Unless your threads diverge (which they rarely do in a warp if the parallelization is well though-out), synchronization costs you one cycle plus however long you have to wait. So, on a G80, in the worst case without divergence you’d have to wait 512/32=16 cycles. That’s almost trivial for all but the shortest kernels. Plus, in those cases you can reduce the threadblock size, reducing the wait penalty.

  2. Think of the shared memory as an application-managed cache. Again, only the most trivial algorithms do not benefit from caches. A well used cache will often give you at least an order of magnitude improvement. That’s true for any computing architecture (just time on a Core2 two versions of a program - one that is cache-optimal, one that has pretty much random access pattern, you’ll see a big difference).

Paulius

I would add that at the same time caching strategy of the texture units is unkown to an applied developer. After all it varies from vendor to vendor and from GPU model to GPU model. The only thing an applied graphics developer needs to know is texture cache is optimized for 1D/2D locality. But since texture units are sort of hardware interface built on top of global (device) memory and their behaviour is fixed in silicone, in most cases peak texture fetch performance is lower than of raw global memory. If global memory reads and writes are coalescable (which is very likely for “streaming” algorithms like Monte-Carlo, especially given that it’s already implemented with GPGPU), observed performance is very likely to be higher than with textures, even if there is no need in thread communication and shared memory at all, and data fetched from global memory just stays in local registers.

PS. In many cases multiple rendering passes in GPGPU mean pixel synchronization and results exchange…

To introduce my researches and simulations, I need to tell you that I simulate on the GPU independent Montecarlo paths. They are so many (like 1M paths) that I can’t have any benefit from local shared memory (shared memory isn’t like 4M -1Mx4bytes per float- isn’t it?).

They have just a very strict local dependency (a very small gather factor) and I’m using a lot of uniform variables. My case is the one where you have to do calculation on a large set of data.

So as you can see point (2) is a sentence way too algorithm dependent: is not so general that 16KB are enough for every algorithm…and 16KB are (just) 4096 float numbers…

Tbh I like to know Tesla will support GLSLang.

Well what I’m trying to state is that GPU rocks on embarassing parallels kind of algorithms. I mean, everyone can implement every kind of algorithm on GPU but you have always to consider if it’s worth. For example, to make me understand better, which kind of algorithm would benefit from 16KB cache?

Thanks again,

Ema.

Ps. So won’t I be able to try out CUDA on my 7600GS?

Indeed, and knowing this one can maximize the throughput even with GPGPU.

Knowing that 3D graphics cards are optimized for multi texture reads, one can even exploit this further…

Ema.

Unfortunately, no. The CUDA programming model depends on the new “unified architecture” of the GeForce 8 series (incl. Quadro 5600, 4600 and Tesla).

You can try compiling CUDA code in emulation mode, but that really isn’t very interesting except as a first debugging stage. It isn’t a full simulation of the card, so it gives you no real information on actual performance.

Any algorithm that “touches” a data element more than once. So, that includes most of the algorithms. Matrix multiplication is something that’s used frequently in practice, and benefits greatly from the shared memory (as it does from cache-optimization on standard CPU architectures). Even something as simple as matrix transpose benefits (because shared memory facilitates coalescing reads/writes), exhibiting a 10x increase in performance.

Even if your app reads a data element and never uses it again, shared memory can help increase global memory access bandwidth by allowing you to coalesce reads/writes (in some cases improvement is as high as 10x, as was the case with matrix transpose).

CUDA will not run on pre-8xxx cards. If 8800 is not an option for trying things out, you can always get an 8600, which is cheaper (you just have to keep in mind that 8800 would be quite a bit faster when evaluating performance).

Paulius

Thanks so this is not my case.

Btw is there any paper/study that compares GPGPU and CUDA? I’d like to see a real well-optimized GPGPU and CUDA comparison.

What about memory reads? If I have to get data from a long array like

float myArray[1024*1024];

How could I code with cuda a random access to it?

Now with GPGPU I use a 2D texture to access it, and it’s optimized, how could I do with CUDA?

Should I still use texture?

Thanks again,

Ema.

You’d access the array element just like you would in any C program:

myArray[i]=3;

That’s one of the CUDA benefits, the code is pretty much straightforward C. Textures are still available, for example if you want to use filtering.

Paulius

Thanks.

One more question: now I use a lot of uniform GLSLang variables (mat4 and so on).

Should I make all them shared ?

I don’t need to write to them, just read, and those don’t change in different blocks…Which kind of variable should I use in CUDA?

Thanks again,

Ema.

Variables declared with constant modifier are probably the best choice; use cudaMemcpyToSymbol() call to update the values.

Sorry it should be like a large vector of float values (256 or so on…): is constant fine too?

Cheers,

Ema.

All the required info, including resource limits of GeForce8 family, is contained in the Programming Guide…

Total size of all constant variables is limited by 64KB.

The nature of constant variables in CUDA is the same as of uniform variables in GLSL, so one could possibly assume that uniform variables of GLSL shaders running on GeForce8 are housed in the same location of the chip as constant variables of CUDA program, and if the resource limit wasn’t hit by a GLSL fragment program then it’s unlikely to be hit by a CUDA counterpart. :)

Hy,

concerning the first question (comparison between cuda and graphics), if the application can benefit from the rasterization built-in hardware in the graphics pipeline (for matrix-vector mul for instance), we can obtain better performances with the graphics approach than with cuda which uses only the processors. I’ve experienced a x3 ratio in that case!

D.H.