Is it doable with CUDA?

microunit · December 23, 2019, 10:04am

Hello,
I’m new on this forum so hello to all.
I’m not even a beginner: We haven’t even started with CUDA yet. So please forgive the naive question.
Also: I haven’t scanned (in parallel ah ah) all the topics of this forum yet so forgive again, please, if the answers have been posted somewhere else already.

We have an application very well adapted for parallelism. For instance - for a representative case - we were 24 seconds to compute without multi-threading, we are less than 2 seconds using multi-threading.
It’s already an archievement. All this on a 4 cores i7. We’ve ordered a 12 cores AMD threadripper beast and we’ll probably go under the second.
But the BIG archievement will be to go real time. For this we need to go under few 1/10th seconds.
Of course we could wait for next beast with 64 cores and more but it’ll be unafordable.

So we’re evaluating CUDA as you can guessed. We haven’t starded yet, just downloaded it.

First of all, the soft already work parallele so the memory is already well separated. There are no mutex, atomic or such: there are not at all shared memory for writing. For reading: yes.
But it’s std::vectors. And the functions create and resize many big std::vectors.
Is it an issue?
Sorry again: it’s probably beginner question.

Secondly the algorithms are not little. It’s not the computing of matrix or something like that. It’s a full real Lib. Is there limitations there?
There are also a lot arithmetic computing on doubles. Limitations there too?

Generally speaking: all info about limitation compared with “normal” C++ codes are welcome and the link to forum threads which already debate of this on the forum are very welcome too.
Thanks a lot,
Olivier

microunit · December 23, 2019, 10:07am

is it doable with CUDA
naive questions
as you can guess

mnicely · December 23, 2019, 9:23pm

But it’s std::vectors. And the functions create and resize many big std::vectors.
Is it an issue?

The C++ STL doesn’t compile in device code. You could however write your own functions to mimic the STL library https://stackoverflow.com/questions/10375680/using-stdvector-in-cuda-device-code or you might want to check out Thrust https://developer.nvidia.com/thrust. Another way would be to create an array in device memory to the maximum size you would need for your problem and then reuse it.

Secondly the algorithms are not little. It’s not the computing of matrix or something like that. It’s a full real Lib. Is there limitations there?

I don’t understand what you mean by full real Lib? Again, like the STL, some code can’t be compile in device code. The best thing you can do is try.

There are also a lot arithmetic computing on doubles. Limitations there too?

Well performance is higher using single precision, there is no reason why you can’t use double. Note that double precision performance is much better on the Tesla and Quadro cards. You can always develop on a GeForce card and then compile on a Tesla/Quadro when you’re ready. It’s good to make sure your development and target architectures are the same i.e. Turing, Volta, Pascal, etc.

The idea behind GPU programming is to hide latency by feeding the CUDA cores a lot of work. You need to ask yourself if your problem will scale, not to hundreds or thousands, but to tens/hundreds of thousands of threads.

If the problem is too small, you will not fully utilize the GPU and it’s quite possible the CPU version will run faster.

I think a great place to start are the NVIDIA DLI Accelerated Computing courses https://www.nvidia.com/en-us/deep-learning-ai/education/. It’s a great way to see all the possible way to parallel program using CUDA. It’s quite possible you could just use OpenACC compiler directives to increase performance without ever touching CUDA. Going through the courses first will jumpstart your thinking about where in your code you should start.

njuffa · December 24, 2019, 12:58am

More accurate version: Note that double precision performance is much better on the Tesla and some Quadro cards.

You mentioned the application is multi-threaded. Is it also fully vectorized using AVX2? If it is not vectorized, you are leaving a lot of performance on the table.

microunit · December 24, 2019, 10:12am

Thanks a lot for your answers!
Very informative.
@mnnicely:

I didn’t say it well. It’s a .lib because it’s the way it’s organize in the code now but we have of course the source code and we understand well that the code should be organize differently for CUDA and recompiled. Just intended to say that it’s not one single function but something consistent regarding the volume of code.
About std::vectors: it’s possible to use buffers actually.
we are hundreds or thousand, non tens/hundreds of thousands.
But it behave very well with parallelism, i-e: the time for one is about the same as the time for… 8 on my 4 cores i7 CPU :)
At the moment I have about 30 to 40 threads. Could be increased.
I bought a 24 Cores AMD 39060 CPU PC to check. I receive it in about 15 days.
But first it costs the hell and secondly it’s not even certain it’ll be enough to do all in one pass.
We can’t ask users to put in the trash their brand new i7 for a costly PC like that anyway.
On another hand we can ask them to buy a NVIDIA card.
CUDA, if suitable, will be a much more efficient solution.

Thanks a lot again for the advises, I’m gona check all that and will come back to you to tell.

microunit · December 29, 2019, 7:54pm

Hi,
What do you mean exactly? Well… I checked AVX2 which seems to be “Advanced Vector Extention 2”, but I wonder, is it included in last compilators / processors? How to profit more of this ressource?!
Yes, actually the application is multi-thread. Well… ridiculous multi_thread when we see the hundreds of threads you guys are used to deal with but we had amazing improvments using fully all the cores of the CPU.
By the way, I wondered too how it behaves with CUDA. Is there a race to take all the threads and memory of the GPU? Do we have to share the ressource at the begining?
I begin with CUDA you know. Have hard time a little.

In case:
I identified a computing function that could be parallelilized massively.
It’s a intersection computing between triangles, more precisely between triangles and edges.
The code has already been organized to be parallelized, There is an array of something like 100.000 pairs of triangles and edges, and we must know if they intersect or not; and where. Normaly it’s all independant, the result of one doesn’t impact anybody else and it should be suitable for CUDA.

But well… I still don’t get those blocks, ThreadId.x and such LOL.
Nobody invented a higher level function like “operator” which makes the calculation of blocks and other things for you? :)
For instance my 100.000 thousand triangles and edges, plus many other parameters, I must divide them. Ok. But how can I know in how many cells I must divide them, and how, to be the most efficient?
I still don’t found the clear answer to it in the documentation.

microunit · December 29, 2019, 8:10pm

Edit: I had a look on thrust::vector

Very interesting.

Topic		Replies	Views
Convincing skeptical bigwigs on the future of CUDA CUDA Programming and Performance	49	9208	March 19, 2009
cuda for ati cards we need a stadard CUDA Programming and Performance	27	43648	October 3, 2008
Accelerating Standard C++ with GPUs Using stdpar Technical Blog	7	1568	July 28, 2023
advice needed by a PhD student CUDA Programming and Performance	26	3185	December 4, 2011
Tips on writing safe parallel code with CUDA? CUDA Programming and Performance	6	1752	April 17, 2013
Multi-GPU Programming with Standard Parallel C++, Part 2 Technical Blog	0	419	April 18, 2022
C++ support for STL containers in device code and memory CUDA Programming and Performance	11	14407	December 11, 2010
When to use Serial CPU, CUDA, OpenMP and MPI? CUDA Programming and Performance	8	13987	May 29, 2021
Cuda code performance CUDA Programming and Performance	14	3328	December 16, 2014
Is CUDA better than GLSLang? I need to know more... CUDA Programming and Performance	30	38893	July 13, 2007

Is it doable with CUDA?

Related topics