Concurrent kernel

Hi,

as far as I understood I can run kernels on a GPU concurrent if I follow some guides (resource use/ use no default stream). Is it a requirement that the kernels are in the same loaded module or can they be in different modules - loaded onto the same GPU?

Example:
I load 2 different modules and run from each of them a single kernel - both in a non-default stream - will they run concurrent or sequential?

Thanks,

Kernels from two different modules can execute concurrently. If you write a simple 1 threaded wait kernel compile in two different modules it should be possible to get these to run concurrently. OS, resources per grid, etc. can all impact concurrent execution.

1 Like

Just an additional question regarding this issue.
As far as I understand I can use streams to use the ‘gap’ during async memory operations. If there are no memory operations I have just alternating streams - only one stream is active at a time.
I have a kernel that accesses a lot of global memory at random positions. Unfortunately I do not have enough work in the kernel to hide the latency completely. I tried different block & grid sizes to hide the memory access latency without much success.
If I run now another different kernel that does not need much memory access, would that kernel be able to use the time during memory access in the first kernel? Or would this kernel also run ‘alternating’ as I see it with several stream?

Thanks.

About combining memory-bound and compute-bound kernels

To really profit from this combination, it depends, where the memory speed is lost. Is it really the latency or the bandwidth? See section below.

You could either use some SMs for one kernel and some for the other (when the bottleneck is at L2 or global memory level, as those resources are shared between the SMs). Or you could try to get each SM Partition (each SM has 4 Partitions) to run both kernels. For it you have to setup the resource requirements (registers, shared memory, launch size) to keep just enough resources free to fill up the SM Partitions with the second kernel.

About memory access being slow

Slow memory access can have different reasons. Could you use Nsight Compute to find out the bottleneck?

Is it a latency or a bandwidth reason? Nonideal accesses can reduce your bandwidth:

The global memory and the L2 cache deal in sectors of 32 aligned bytes. If your access size is smaller, you loose bandwidth. You can sometimes rearrange data structures, e.g. instead of having separate w, x, y, z arrays, combine them into an array with a struct of w, x, y, z. (Often the opposite is recommended for Cuda and often makes sense, as it helps to have coalesced memory access between threads.)

The L1 memory deals in cache lines of 128 aligned bytes. If all your threads within a warp (lanes) access a different location, you could need 128 wave fronts. This lowers the throughput to/from L1 by a factor of 32.

Thanks a lot for your fast and detailed reply. That is very helpful.

Since the access pattern to the global memory is random it is not possible to have coalesced access to it.

Just one question how to share SM. It is just as easy as calculate the number of threads/warps fit onto a single SM and set the blocksize so that there must be SM left? Left assume (for this sample) that I have a single SM that can run 64 warps → 2048 threads. If I set the blocksize to 1024 I should have 32 warps left for a different kernel that may run in parallel. Is that correct?

In the case that I did it correctly, should I see both streams executing at the same time or will the profiler show the streams executing alternating?

Thanks.

You should find out (with Nvidia Compute Nsight), whether you have a memory bandwidth or latency problem. And at which stage (L1, L2).

If you have 20 SMs and 1024 max. threads per SM (both values depend on your GPU), you can e.g. choose the gridsize of a first kernel as 16 blocks and your blocksize as 768, then each SM can hold only one block with 24 warps. Each partition of the 16 SMs would have 6 warps. Typically the warps are evenly distributed on the partitions, if possible. 4 SMs would be totally free.

The kernel itself can half an outer loop to do more work (compensating the small grid size).

You can start a second kernel with gridsize 4 and blocksize 1024. It would fill the other 4 SMs.

And you can start a third kernel with gridsize 16 and blocksize 256 (or 8 warps). Each partition would have 2 warps.

All three kernels would fit on the GPU and run concurrently.

It would be important, which kernel starts first.

Alternatively you could do all work in one kernel, but make a switch block in the beginning, which chooses the actual function to do according to threadIdx / warp number. Warps 0…3 do function 1, warps 4…7 do function 2, warps 8…11 do function 1 again. Increment steps of 4 Warps for the 4 SM Partitions. If some warps complete early, it would be totally fine.

You can read out the time and SM, where each warp for each kernel is running to test whether your chosen parameters do, as you wish.

You can try to run another kernel but be aware the first kernel to run will define the resource allocation on the SMs. If you launch thread blocks >= SM count then another grid may not be able to fit on the SM even if you think you correctly designed the kernels.

EXAMPLE 1 - L1/SHMEM carveout

Grid 1 uses 0 shared memory (only driver allocation if on newer architectures)
Grid 2 uses 32 KiB of shared memory

Grid 1 and Grid 2 both launch 1 CTA per SM and the sum of 1 Grid 1 CTA + 1 Grid 2 CTA does not appear to use all SM resources.

If Grid 1 launches first then carveout may be set to PREFER_L1. In this case Grid 2 may not fit on the SM due to the carveout is dynamically too small. Grid 2 cannot change the carveout until the SM is idle.

If Grid 2 launches first then concurrency may be reached.

When trying to fully exploit concurrency it is recommended to review the following:

  1. Review and set the L1/SHM carveout
  2. Use CUDA events to force the order of launch
  3. Consider launching less thread blocks than SM count and using a loop to improve availability of SMs in the worst case.
1 Like

Hi Greg,

thanks a lot for your detailed reply. A lot to try for me this weekend. I have some questions came up now:

  • If you write 1 CTA per SM - what does that mean? Is 1 CTA per SM = ( number of warps per SM ) * ( threads per warp )?

  • Is it possible to share Shared Memory between two kernels running concurrent on the same SM?

  • to setup the required grid/block configuration I understand that I should make sure to have some SM left in the calculation. So I would set grid to the number of SMs I want to use (compensating the missing work with a loop in the kernel) and set the blocksize for the first kernel to: blocksize << (warps per SM * 32 )? The second kernel uses the remaining SMs.

  • can I run the kernel in the same context in a different stream?

  • assume I have a kernel with a first memory hard part and a second compute hard part - both parts share 50% of the complete kernel execution time. Would it make sense to run this kernel concurrent twice so that the memory part of one kernel overlaps with the compute part of the second to allow a good use of the SM?

Thanks a lot.

CTA, Cooperative Thread Group, is the architecture name used for Thread Block.

No. Shared memory is owned by the thread block. Another thread block cannot access the same shared memory. Clusters allow thread to read/write shared memory in other threads blocks in the same cluster. This is called distributed shared memory.

Absolutely. Streams is the primary method to achieve concurrent execution. Concurrent execution can also be achieved using CUDA Graphs and CUDA Dynamic Parallelism.

Consistently achieving concurrency is a difficult task and is not guaranteed to be portable. I would recommend first optimizing the kernels then deciding if you want concurrency. Please note that you may have to re-optimize the kernels to ensure sufficient resources to try to run concurrently.

I it hard to provide advice without know the duration of each section of the kernel, the size of the grids, etc.