Is it doable with CUDA?

Hello,
I’m new on this forum so hello to all.
I’m not even a beginner: We haven’t even started with CUDA yet. So please forgive the naive question.
Also: I haven’t scanned (in parallel ah ah) all the topics of this forum yet so forgive again, please, if the answers have been posted somewhere else already.


We have an application very well adapted for parallelism. For instance - for a representative case - we were 24 seconds to compute without multi-threading, we are less than 2 seconds using multi-threading.
It’s already an archievement. All this on a 4 cores i7. We’ve ordered a 12 cores AMD threadripper beast and we’ll probably go under the second.
But the BIG archievement will be to go real time. For this we need to go under few 1/10th seconds.
Of course we could wait for next beast with 64 cores and more but it’ll be unafordable.

So we’re evaluating CUDA as you can guessed. We haven’t starded yet, just downloaded it.

First of all, the soft already work parallele so the memory is already well separated. There are no mutex, atomic or such: there are not at all shared memory for writing. For reading: yes.
But it’s std::vectors. And the functions create and resize many big std::vectors.
Is it an issue?
Sorry again: it’s probably beginner question.

Secondly the algorithms are not little. It’s not the computing of matrix or something like that. It’s a full real Lib. Is there limitations there?
There are also a lot arithmetic computing on doubles. Limitations there too?

Generally speaking: all info about limitation compared with “normal” C++ codes are welcome and the link to forum threads which already debate of this on the forum are very welcome too.
Thanks a lot,
Olivier

  • is it doable with CUDA
  • naive questions
  • as you can guess

But it’s std::vectors. And the functions create and resize many big std::vectors.
Is it an issue?

The C++ STL doesn’t compile in device code. You could however write your own functions to mimic the STL library https://stackoverflow.com/questions/10375680/using-stdvector-in-cuda-device-code or you might want to check out Thrust https://developer.nvidia.com/thrust. Another way would be to create an array in device memory to the maximum size you would need for your problem and then reuse it.

Secondly the algorithms are not little. It’s not the computing of matrix or something like that. It’s a full real Lib. Is there limitations there?

I don’t understand what you mean by full real Lib? Again, like the STL, some code can’t be compile in device code. The best thing you can do is try.

There are also a lot arithmetic computing on doubles. Limitations there too?

Well performance is higher using single precision, there is no reason why you can’t use double. Note that double precision performance is much better on the Tesla and Quadro cards. You can always develop on a GeForce card and then compile on a Tesla/Quadro when you’re ready. It’s good to make sure your development and target architectures are the same i.e. Turing, Volta, Pascal, etc.

The idea behind GPU programming is to hide latency by feeding the CUDA cores a lot of work. You need to ask yourself if your problem will scale, not to hundreds or thousands, but to tens/hundreds of thousands of threads.

If the problem is too small, you will not fully utilize the GPU and it’s quite possible the CPU version will run faster.

I think a great place to start are the NVIDIA DLI Accelerated Computing courses https://www.nvidia.com/en-us/deep-learning-ai/education/. It’s a great way to see all the possible way to parallel program using CUDA. It’s quite possible you could just use OpenACC compiler directives to increase performance without ever touching CUDA. Going through the courses first will jumpstart your thinking about where in your code you should start.

More accurate version: Note that double precision performance is much better on the Tesla and some Quadro cards.

You mentioned the application is multi-threaded. Is it also fully vectorized using AVX2? If it is not vectorized, you are leaving a lot of performance on the table.

Thanks a lot for your answers!
Very informative.
@mnnicely:

  • I didn’t say it well. It’s a .lib because it’s the way it’s organize in the code now but we have of course the source code and we understand well that the code should be organize differently for CUDA and recompiled. Just intended to say that it’s not one single function but something consistent regarding the volume of code.
  • About std::vectors: it’s possible to use buffers actually.
  • we are hundreds or thousand, non tens/hundreds of thousands.
    But it behave very well with parallelism, i-e: the time for one is about the same as the time for… 8 on my 4 cores i7 CPU :)
    At the moment I have about 30 to 40 threads. Could be increased.
    I bought a 24 Cores AMD 39060 CPU PC to check. I receive it in about 15 days.
    But first it costs the hell and secondly it’s not even certain it’ll be enough to do all in one pass.
    We can’t ask users to put in the trash their brand new i7 for a costly PC like that anyway.
    On another hand we can ask them to buy a NVIDIA card.
    CUDA, if suitable, will be a much more efficient solution.

Thanks a lot again for the advises, I’m gona check all that and will come back to you to tell.

Hi,
What do you mean exactly? Well… I checked AVX2 which seems to be “Advanced Vector Extention 2”, but I wonder, is it included in last compilators / processors? How to profit more of this ressource?!
Yes, actually the application is multi-thread. Well… ridiculous multi_thread when we see the hundreds of threads you guys are used to deal with but we had amazing improvments using fully all the cores of the CPU.
By the way, I wondered too how it behaves with CUDA. Is there a race to take all the threads and memory of the GPU? Do we have to share the ressource at the begining?
I begin with CUDA you know. Have hard time a little.

In case:
I identified a computing function that could be parallelilized massively.
It’s a intersection computing between triangles, more precisely between triangles and edges.
The code has already been organized to be parallelized, There is an array of something like 100.000 pairs of triangles and edges, and we must know if they intersect or not; and where. Normaly it’s all independant, the result of one doesn’t impact anybody else and it should be suitable for CUDA.

But well… I still don’t get those blocks, ThreadId.x and such LOL.
Nobody invented a higher level function like “operator” which makes the calculation of blocks and other things for you? :)
For instance my 100.000 thousand triangles and edges, plus many other parameters, I must divide them. Ok. But how can I know in how many cells I must divide them, and how, to be the most efficient?
I still don’t found the clear answer to it in the documentation.

Edit: I had a look on thrust::vector
https://docs.nvidia.com/cuda/thrust/index.html
Very interesting.