Verdict: GLSL vs CUDA kind of a not-so-dead post-mortem

Disclaimer
I really think this is an important issue to all of us who has a graphics background and are looking to CUDA to widen our problem-solving capabilities. So, sorry for the long post.

Please, I’d love to hear comments from the CUDA experts around here regarding my experiencies and maybe some good advice from you guys and Nvidia as well!

Nevertheless, I hope I can help all the starting guys that have had their own share of problems.

Thank you all! External Image


So I’ll go ahead and ask the dreaded question: is it possible to achieve better performance with CUDA than with GLSL?

My answer is “yes”, Nvidia’s is “not so easy!”. Let me explain, so sit back and enjoy the ride. External Image

So I have a nice happy problem to solve that can be modeled (quite naively I admit) into the “old” GPGPU paradigm of “render a fullscreen quad and trigger your fragment shaders”.

Then I get X frames-per-second.

Me-thinks: this could go really fast if I can use CUDA! So a little experiment: the same hardware, the same code, what could go wrong? And I port the exact same code and make it work in CUDA.

A few days later: struggling with the stubborn compiler about why using local memory is a Bad Idea TM, and why inline device functions shouldn’t increase the register usage by 1 million. Finally, all looks ok.

Then I get from 50 to 90% the performance of the GLSL code. And I’d like to note that just launching an empty kernel already puts CUDA way behind GLSL. What an overhead! So I’m only looking at reasonable data sizes here.

The main reason: register pressure. I’m bottlenecked at 40 registers, which equates to 25% occupancy. Here’s why I think so:

. Memory reads:
Mainly random, so using textures (more than doubled performance when compared to global memory). Found 1D textures to be faster than 2D, because of no addressing math. But really wished for 3D textures :)

. Memory writes:
All coalesced at the end of kernel execution. Can it go better than this?

. Constant memory:
Found no use for it yet. Too small to fit my input data, unless I restrict the working set. Tried placing time-invariant kernel parameters in there but didn’t make any difference.

. Shared memory:
Currently not used, so only about 140bytes from kernel parameters. Tried to use it to save registers, but never managed any gain. Maybe someone has advice in this department? ;)

. Divergent branches:
Only about 10% of total branch count. And here I was thinking this would be my doom…

By all means I’m no hardware telepath and have no absolute conviction on all these. It only seems to me that occupancy is the bottleneck. Another argument: when I decreased register count from 43 to 40 the occupancy went up by 8% and the final performance by 15%.

So the question becomes: where (and most importantly how) the GLSL is peforming its magic?

The code is 99% the same. The memory layout is similar:

GLSL -------------------- CUDA
varying/uniforms --------- kernel parameters (shared memory)
input textures 3D/1D ---- input textures 1D
output to FB -------------- coalesced global memory writes

I have an afterwards device-device memcpy to a PBO, but that is really fast.

The logic conclusion: compiler?!

So let me rephrase my fundamental question: is it possible to emulate similar GLSL behavior in hardware using CUDA?

As in: is it ideally possible, but right now the CUDA compiler is behing the GLSL one? (I know some hardware functionalities aren’t exposed like rasterizer, filtering, ROPs, etc)

Or is it a lost cause and I should have known better? Must I always go back to the drawing board and reformulate a new algorithm suited to CUDA to solve the same problem? Or “thinking like graphics” is ever a valid way to go?

I suspect the answer is: it depends on the problem! In other words: just try and see what you get every time… Am I right?

Finally, to go back to my initial proposition. I do believe it is possible to achieve better performance with CUDA than with GLSL. Two choices:

  1. You can try the same algorithms when the CUDA compiler becomes as smart as the GLSL one. I.e. myself reordering code lines shouldn’t reduce register count by 3!

  2. You can work hard to come up with a new way of solving your problem that better matches the CUDA paradigm.

Anyway, all this does not mean I’m giving up on CUDA. I know it has its uses. I just hope to find a way to make it work for me. Until then, back to the drawing board!

Thanks for listening! External Image

Well, speaking of someone who started with CUDA, I cannot really add a lot of useful advice I am afraid. The only thing that I do say is that I write to global memory as soon as I can, so the writes that can be done early can be finished when the kernel ends.

Also, I think that most gain compared to previous techniques can be found in algorithms where there is some inter-thread communication needed. There the use of shared memory & __syncthreads() is likely to be a big gain.

You did not tell a lot about your algorithm, grid and block-dimensions, but I guess that the latter two are probably well-chosen.

Apart from that for me personally the big advantage of CUDA is that I felt confident that I can program in it from the start, whereas I have no such faith in myself when it comes to GLSL/Cg, etc. (which might be actually a false feeling and lack of looking in detail, but still it prevented me from trying to use the GPU before)

I also tried to minimize reg count with shared mem, but even though i got my occupancy up, my total performance actually went down. Since the data you are reading from texture isn’t well arranged, it stands to reason that some of the threads read the same memory. In that case, first reading the memory into shared mem from texture, and then using that for all the threads will give you a real boost.

try and find out where is the bottle neck … although its a bit tricky since cuda will use memory read lags to do other things.

Hope this helps.

From the registers, I’m at 16x12 block size with a grid that depends on input size, but in the order of 32x32. When I have time I’ll post some more details on the implementation and maybe you guys could help me out more :)

As for the shared memory, I don’t think I can predict which data will be accessed by more than one thread. Hence I don’t think its possible (with the current algorithm) to load it up from texture to shared before working with them. But I’ll look into this.

Thanks for the insights!

There’s no fundamental reason that CUDA shouldn’t be able to match or beat GLSL, unless you’re taking advantage of a GPU hardware feature in OpenGL, but not in CUDA. Over the years I’ve had to battle many many bugs in different vendor’s GLSL compilers, so I would never use it for computations where the numerical result actually matters. It might be fun for a toy problem or something you can run for your own use with a specific GPU card and driver combination, but GLSL is not going to be safe to use for numerical calculation in a production piece of software. Anyway, that said, there are various reasons why your CUDA implementation may be running slower than your GLSL, without seeing your code or significantly more details about how its structures and how you’re launching your kernels it’s pretty hard to judge the reason for your performance gap. Given that you mentioned 3-D textures on GLSL, that may be one specific source of performance loss, since emulating that in CUDA 1.1 would eat registers and FLOPS that GLSL gets from the texture hardware. Many of us have asked for 3-D textures in a future version of CUDA, so if this is a source of your performance gap, it may be addressed when a future rev of CUDA is released with support for 3-D textures. You mentioned that the launch time for your CUDA kernel was significantly longer than for your GLSL code. Can you be specific? How long are we talking, and how are you timing both of them? (are you doing a glFlush() or something in the GLSL case to ensure command completion?)

Cheers,
John

Sorry I don’t have much time to digg into this.

All I can state is that keeping the surrounding code but commenting the kernel call, I get 5400fps. When I call an empty kernel with the same setup, I get 1800fps. The GLSL easily runs at more than 5400fps, even when performing a few computations.

Empty kernel from profiler: GPU time 15us / CPU time 192 us.

When I can, I’ll time the code manually.

Quick note: the GLSL version uses only 12 registers (had forgot about using nvemulate to see that).

So we have GLSL at 12 versus CUDA at 40. Language differences aside, I think they shouldn’t be that far off…

When you say the GLSL version uses 12 registers, do you mean 12 vector registers? Note that CUDA reports the number of scalar registers.

Oh, I think I may have overlooked that part. :"> (this is what I get for messing with code late at night)

Yes, it is 12 4-component vector registers. Thanks Simon for correcting my mistake. So GLSL actually uses more scalar registers. :blink:

This just makes things stranger. I’m assuming the hardware is the same for CUDA and GLSL. Maybe that’s not true, and that’s the price we pay for more flexibility. Some hypothesis:

. some parts of the code is currently written in a more “vector-like” way and this is somehow faster in GLSL?
. GLSL does some behind-the-curtain “automagic” things that in CUDA we cannot rely upon? Or even do not have access to to.
. GLSL actually has more resources available? i.e. more registers.
. GLSL is faster in graphics stuff, like reading data from textures.

I’m sorry if I’m coming across as picky or something like that. All I’m trying to do is trying to figure out this performance difference I’m getting. It gets hard to justify CUDA when I cannot even state why it is slower. And most importantly, if it’s slower then how it could become faster than GLSL.

My only conclusion so far is that I need to approach my problem with a different algorithm, that “fits” better into CUDA.

Yup! the hardware is the same, even the underlying microcode is the same. G8x/G9x is a completely unified architecture.

If you use vectors, the resulting code will just have three/four evaluations. There is no possible speed-up in G80 in using vectors.

Possible. If you use 3D engine things like stencils, depth buffer tricks, cube textures, frame buffer blending, multiple render targets, etc. Those are not exposed in CUDA. Then again, that doesn’t mean the 3D alternative is faster than a straightforward CUDA implementation.

Certainly not. In GLSL you don’t have the shared memory available. 12*4=48, which is more than 40 registers.

Depends on what you’re doing. But in the end GLSL is optimized for graphics stuff, CUDA is optimized for GPGPU things. That’s why OpenGL<->CUDA interoperability exists.

FWIW, as someone with relatively long experience doing GPGPU the “old” way (with GLSL and FBOs (in fact I started with pbuffers! shudder)) I was also initially disappointed by the performance of my CUDA tests. The advantage of CUDA seems to lie not so much in being able to run algorithms that fit the GLSL/PBO way of doing things faster, but in enabling some that simply weren’t possible (with performance that makes sense, that is >host) before.

What I don’t buy is the “easier to program” mantra that gets repeated often. It’s easier to get something running with CUDA because you don’t have the whole graphics API baggage to deal with, but extracting full performance is a different matter entirely.

The main thing graphics has that CUDA does not, is address generation outside the shader/kernel.

The main thing CUDA has that graphics does not, is shared memory.

There are probably some contrived cases where the iterators’ address generation gives a greater benefit than the latency reduction of shared memory, but in our experience in implementing equivalent CUDA and graphics shaders, CUDA can be made at least competitive and usually faster. There were early reports that SAXPY in graphics was faster than SAXPY in CUDA, but we were able to make some changes to our CUDA implementation that made them both bandwidth bound.

If you are doing image processing with graphics, you have to rely on the texture cache to deliver the benefits of reuse; but it is designed more to conserve bandwidth than reduce latency. Shared memory reduces latency. By staging texture reads into shared memory, the two can complement one another.

If your image processing kernel uses shared memory, remember that you minimize bank conflicts by having the threads read consecutive 32-bit words in memory. Naively read/writing bytes in shared memory will cause 4-way bank conflicts. Promote to int, or unroll your loops 4x (as sobel8 does).

Blocking parameters can have a dramatic effect on performance of kernels that read through texture. For sobel8, the kernel was authored as blocking-agnostic and optimal blocking parameters were determined empirically. (This can be done with cuEventRecord/cuEventElapsedTime.) I think the optimal block size was found to be 16x4.

FWIW, I had my first matlab routine converted to CUDA in 1 week, the routine was extremely simple (thats why it took only a week), but took a long, long time in matlab. CUDA made it 100 times faster (including memcopies & double->float, float->double conversions). It is completely memorybandwidth-bound (as all my kernels up to now it seems). So it really depends on the type of algorithm you are implementing ;)

But as you said, the amount of learning you have to do to get something working in CUDA is so low, that a lot more people are now reaching the point of having to worry about how to parallelize their complicated algorithms.

And then, after reading these forums for a long time, and implementing a small algo, a very big algo, and now starting to do raytracing on the GPU, I am still often doing things 5 different ways and benchmarking to find out what works best. I have been suprised so often that things were slow when I thought they would be fast and vice-versa, it is not funny anymore. To quote MrAnderson: first implement it the simple way, it often turns out to be the fastest way.

Thanks everyone for their reply! Maybe I should mention what I’ve been doing. Basically, I’m working on my Master’s degree about ray tracing on the GPU.

My first serious attempt at CUDA, as related in this thread, was to port my GLSL ray tracing code to CUDA. I was expecting some speed-up, but right now the CUDA version is at most as fast as the GLSL one. If anyone is interested I can share some details about it.

@wumpus:
You arguments are very appreciated, thanks! I should stop being so paranoid :)

@Durante:
My experience so far seems to be exactly the same as yours. I can get things going in CUDA rather quickly, but it takes quite some time to tune it up to achieve decent performance. Maybe some parallel computing classes would help? :rolleyes:

@nwilt:
It’s funny you mention texture x shared memory, as this technique has recently improved my latest undertake. However, the performance I’m getting is still far from great. I think I should start another thread to discuss the specifics there. :D

@DenisR
It’s nice to hear that mapping Matlab to CUDA is fairly straightforward! Since I’ve been looking into raytracing, maybe we could share some experiences and tips?

Well, my raytracing porting has just recently started, so I am not sure I can be of much help, but otherwise converting matlab parts to CUDA and using them from matlab is the path I am taking at this time. Some generated CUDA code will be reused in a big C framework, the other part stays probably forever linked to matlab.

Funny thing is that I found out I am using some kind of BVH/BIH in my matlab code without ever knowing about them :D

Reviving this old thread…

I know nothing about GLSL… But from this thread, I learn that it has “no” shared memory… Ooops…

Are there applications out there, that still use GLSL??
Can some1 tell me if there is a reason why one should even consider GLSL?
Will it offer portability?
Like, Can I expect my OpenGL GLSL program to run on both NVIDIA and ATI?

Thanks for any input,

Best Regards,
Sarnath

Hi,

If your looking for portability you can use OpenCL,

the functionality is portable (results will be right on both NVIDIA and ATI)

but performance is not portable (most of the time optimal kernel are not the same).