Cuda Blocks and threads not sufficient to parallelize big complex code, what is alternative?

Hi everyone, I really need some advice, please help.
I want to achieve parallelism using blocks and threads but my code is too complex and big.
If those blocks and threads are not sufficient, how to achieve partial parallelism and remaining code can be run serially?
Thanks in advance.

Have you looked into OpenACC + MPI? While I understand about the challenges of tackling complex software, blocks and threads in conjunction with language primitives are in fact flexible enough to handle just about any situation; after all, all high-level approaches including OpenACC use these under the hood.

If your application requires continued performance growth, in the longer run you would probably want to invest time into re-factoring the code base so it becomes more amenable to parallel processing. That might include fundamental changes like changing key algorithms.

The path for performance growth in serially executed tasks has almost come to an end. At this point, we are talking low single digit percentage improvement per year. The future of high-performance computation is parallel, independent of the existence of CUDA.

1 Like

Thank you so much for your valuable information. Actually till now, I have parallelized basic programs such as Matrix multiplication etc using Cuda C. I was not aware of OpenACC + MPI.
I will look out for that so that some part of code can be done using parallelism and remaining part of code serially.

Is it possible by using only CUDA C and not OpenACC?

I suggested looking at OpenACC because the original question stated that CUDA’s blocks and threads are “not sufficient”, so I assumed (possibly incorrectly) that you were looking for something with a higher level of abstraction. You can obviously also use CUDA + MPI, or just CUDA by itself if a single machine is sufficient for your use case.

Consider adding details about your use case if you are looking for more specific recommendations.