I hope it is ok to ask a high-level questions here. I have a time-stepping code with the following structure: initialize enter loop over M outer steps 2a. enter loop over N inner steps 2.a.1 compute time step 2b. output data de-allocate, finalize, etc I wish to execute step 2.a on the GPU. The…

Advice on porting to an HPC application to GPU

Accelerated Computing HPC Compilers nvc, nvc++ and nvfortran

MatColgrove July 29, 2024, 10:20pm 6

I’m not certain on the details, but my understanding is that there’s not a limit on the number of kernels that can be launched, but there is a limit to the number of kernels on the launch queue. So once the queue fills up, the next kernel launch need to block waiting for a spot to open up.

I found this post which you might find helpful:

Topic		Replies	Views
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2103	February 5, 2020
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2221	January 18, 2023
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5061	October 29, 2019
Performances of multi-thread vs multi-process with MPS CUDA Programming and Performance	2	3036	August 20, 2018
CUDA and NPP Misc Issues CUDA Programming and Performance	6	1451	March 28, 2011
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	574	April 6, 2023
Cuda code performance CUDA Programming and Performance	14	3137	December 16, 2014
cuFFT Callbacks With Host Compiler GPU-Accelerated Libraries	17	1309	May 5, 2019
Overlapping CPU and GPU code. CUDA Programming and Performance	6	1597	February 27, 2016
slow pointers initiation in kernel CUDA Programming and Performance	21	1772	July 8, 2016

Advice on porting to an HPC application to GPU

Related topics