Cuda Blocks and threads not sufficient to parallelize big complex code, what is alternative?

hv96 · August 9, 2021, 5:46pm

Hi everyone, I really need some advice, please help.
I want to achieve parallelism using blocks and threads but my code is too complex and big.
If those blocks and threads are not sufficient, how to achieve partial parallelism and remaining code can be run serially?
Thanks in advance.

njuffa · August 9, 2021, 6:28pm

Have you looked into OpenACC + MPI? While I understand about the challenges of tackling complex software, blocks and threads in conjunction with language primitives are in fact flexible enough to handle just about any situation; after all, all high-level approaches including OpenACC use these under the hood.

If your application requires continued performance growth, in the longer run you would probably want to invest time into re-factoring the code base so it becomes more amenable to parallel processing. That might include fundamental changes like changing key algorithms.

The path for performance growth in serially executed tasks has almost come to an end. At this point, we are talking low single digit percentage improvement per year. The future of high-performance computation is parallel, independent of the existence of CUDA.

hv96 · August 9, 2021, 6:47pm

Thank you so much for your valuable information. Actually till now, I have parallelized basic programs such as Matrix multiplication etc using Cuda C. I was not aware of OpenACC + MPI.
I will look out for that so that some part of code can be done using parallelism and remaining part of code serially.

hv96 · August 10, 2021, 6:42am

Is it possible by using only CUDA C and not OpenACC?

njuffa · August 10, 2021, 6:49am

I suggested looking at OpenACC because the original question stated that CUDA’s blocks and threads are “not sufficient”, so I assumed (possibly incorrectly) that you were looking for something with a higher level of abstraction. You can obviously also use CUDA + MPI, or just CUDA by itself if a single machine is sufficient for your use case.

Consider adding details about your use case if you are looking for more specific recommendations.

Topic		Replies	Views
Designing a CUDA algo question Sort of a newbie question.... CUDA Programming and Performance	2	2375	December 9, 2011
Can CUDA do sequential processing? CUDA Programming and Performance	7	6621	August 24, 2011
When to use Serial CPU, CUDA, OpenMP and MPI? CUDA Programming and Performance	8	13619	May 29, 2021
OpenMP + CUDA Multiple Parallel Sections Does GPU to Thread linking persist across multiple parallel CUDA Programming and Performance	11	3610	June 29, 2011
A (not so) hypothetical question CUDA Programming and Performance	6	1650	March 24, 2009
Convert an existing numerical model (C) to support CUDA CUDA Programming and Performance	4	488	July 26, 2018
Need synchronization between blocks? CUDA Programming and Performance	3	3099	September 16, 2009
Fine grain threading, correct logic? CUDA Programming and Performance	0	1015	August 4, 2009
Optimize - Many small operations (CPU is faster for now?) CUDA Programming and Performance	2	526	July 11, 2019
CUDA coding philosophy CUDA Programming and Performance	1	627	May 31, 2011

Cuda Blocks and threads not sufficient to parallelize big complex code, what is alternative?

Related topics