CUDA vs DirectX in terms of performance?

Hi,

I’ve just started looking into CUDA as an option. I’ve already got the pricing models we use running through DirectX and the shader pipeline and I’m very interested in performance. I’ve had a cursory look at CUDA and it seems quite different to what I’d expect after working with shaders. If I’ve already got the algorithms vectorized and take advantage of all the swizzling ops and packing calues into registers to minimize ops will I see any difference in performance moving from DirectX to CUDA?

The second part of my question deals with the fact that with DirectX it’s very difficult to have a lot of different data in flight as you get very little info about what’s going on on the graphics card. So going beyond set vertex buffer, draw, get video memory back, set new vertex buffer requires a lot of trickery. Is it much simpler to have a lot of different data sets in flight at one time in CUDA?

If so would I be better off using D3D interoperability with CUDA to do the memory handling and sticking with D3D for the calcs or switching over whole heatedly? Or should I just stick with DirectX?

Are there any other major selling points of CUDA that I should know about when wheighing up the options?

Thanks for taking the time to read this and any help is much appreciated.

Thanks again,

Andy =)

Be ready, that with CUDA you will get only 20% (maximum 25%) of GPU perfomance. And in case, if your application designed perfect for the CUDA.

With DirecX i do not know, but it works different then CUDA. :-)

Hi

while I am no expert on DirectX Computing, I’ll try to answer your questions. Generally with CUDA you basically have full control over what is going on on the GPU. And everything is basically not much more complicated in using than C programming (with the obvious exception that you have to put effort into vectorizing your Algorithm and explicitly manage shared memory usage). But you don’t have to use any ‘tricks’ to do what you want. It is also no problem to have as much data structures on the GPU as you want as long as you got enough memory. As far as I know there is no advantage of using D3D interoperability of CUDA in order to do the memory handling. You basically call something like CudaMalloc to allocate memory in the main memory of the GPU and CudaMemcpy to copy data from the host to the GPU.
I would expect a CUDA version to be as fast or faster than a D3D version of your program, since you can use basically all hardware features of the GPU directly without any overhead with CUDA. The exception to that rule is, if you do something that can be done relatively easy with DirectX and you are not able to write as efficient code as the people who wrote the gpu-driver.

One possible selling point for CUDA is that you can use that without D3D - hence you are not limited to Windows as Operation System.
One drawback is obviously that you cant use GPUs of other vendors for your code.

As an end note: if you were able to use the GPU by misusing the graphics interface, you will probably have no hard time to adopt to CUDA.

best regards
ceearem

Edit: I do not know how Dmitry comes to the conclusion that you can only expect 20% of the Full GPU Performance. There are examples out there which are FLOP limited or bandwidth limited and in both cases it is possible to come relatively close to the theoretical peak performance with algorithms which are heavily dependent on one of the two.

In that case I have some GPU’s for sale that have 4x the performance of the currently available generation…

Your statement is total BS.

yeah I don’t understand his statement, especially when we hit like 95% of peak theoretical performance in DGEMM

Hi,

Thanks for the replies. I went away and did some research and thought I’d feedback my results. NVIDIA GPUs don’t natively support vector instructions (SIMD) or swizzling so there is no benefit from the shader assembler implementation. That means that there is no reason that any DircetX implementation should be able to be faster assuming they’re equal on memory transfer. From my investigations even this is not true CUDA is far more capable in this regard. So to sum it up any good implementation of CUDA should be at least as fast if not faster then any equivalent DirectX version of the same thing.

When I initially ran Black & Scholes in DirectX versus CUDA (custom implementation as I had to fit it into our framework), my DirectX implementation was 12 times faster. As per above I figured this meant I was doing something wrong and after reading more and tinkering with the code (especially the memory transfer) my CUDA implementation is now 400 times faster then my DirectX implementation.

Lesson learned CUDA is great but achieving maximum performance is a bit of an art form.

Thanks,

Andy =)

You’re saying that between your initial implementation in CUDA and your final version you were able to optimize the run time by a factor of 12*400 = 4800 ?

While I agree that optimization in CUDA is a form of art, it may also be a bit of an art to create an initial version that is so slow… ;)

Let me guess: You used something like 1 thread and a grid of just one block initially.

Christian

I did change it from one grid with N number of threads to N over 64 grids of 64 threads (I could still do with a good resource on exactly what makes grid to thread ratios more favourable as my intial thought was that as I wasn’t grouping anything it should have been able to split them up optimally).

The rest of it was all to do with setup and memory mangement but yes especially in those terms my first implementation was pretty naive. Then again it was the first thing I’d ever written in CUDA so I’m not beating myslef up about it. When compared with a DirectX version however, for the most part it either won’t work or you’ve done it and there’s probably nothing you can do to make anywhere near the difference I managed to get in two different versions through CUDA. That’s why I said it’s great but a bit of an art form but thanks for getting me to explain myself I really like doing that… :blink:

It doesnt split up by itself. It does exactly what you say him to do, so in your first version only one multiprocessor was used (since a block is always send to only one multiproc regardless of the number of threads of that block). As a GTX280 has 30 multiprocs (with 8 cores each) your first implementation would have used only 1/30th of the possible cores of the GPU, thereas the DirectX version probably used all of them.

Best regards

Ceearem

Ah thanks, that makes complete sense! I had it a bit backwards. So surely then having N blocks of 1 thread should be the fastest as long as there is no shared memory, etc. as this way you are specifying that it can split them up how it likes as there is no reationship between them and it should be able to make maximum use of your hardware. Is this correct?

So my thinking from now on should be to minimise the number of threads in a block to the largest extent possible as this would give the greatest flexibility in how parallell it was executed?

Thanks,

Andy =)

Sorry, you still seem to be a bit confused!

CUDA doesn’t do any “splitting up” of blocks, it uses exactly the number of blocks and threads per block that you specify. You need at least 32 threads per block, otherwise the processors within each multiprocessor will not be fully utilized. The CUDA “Best Practices Guide” (which is included with the latest toolkit) has a good section “Thread and Block Heuristics” which should help explain this.

Well, CUDA does “split up” the blocks, in that it distributes the blocks among the different multiprocessors however it likes. Once you have enough blocks to keep all the MPs busy, you have to consider how to get optimal usage of the multiprocessor, which has more to do with the size of your thread blocks, and you need at least 32 which is the warp size, and probably a multiple of 32 threads per block to get the best performance, like 256.