CUDA 12.1 Supports Large Kernel Parameters

Originally published at: https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/

CUDA 12.1 offers you the option of passing up to 32,764 bytes using kernel parameters, which can be exploited to simplify applications as well as gain performance improvements.

Thank you for explanation.
The suggested scenario works well when we work with only default stream because access to constant memory is serialized. But imagine we launch two kernels concurently in two different streams and we need two sets of Large Kernel Parameters for each kernel respectivly, so we need partition constant memory some way both sets of parameters located in constant memory and not interleave, because constant memory is shared between this two kernels. Sync access to constant memory is another layer of complexity, correct? This way is different from default kernel parameters, which is also allocated in constatnt memory but runtime automatically allocated different, not interleaved memory banks for each pack of parameters, in my understanding.

Are there practical ways to use this way of passing Large Kernel Parameters to concurrent kernels launched in different streams?

Related StackOverflow question

Kernel parameters don’t reside in the same memory space as what’s used for __constant__ (kernel parameters reside in constant banks managed by the CUDA driver). Constant memory (i.e. __constant__) accesses from independent kernels are not serialized wrt. each other; they are handled independently by the GPU hardware.

If I understand correctly, in your scenario, two kernels are trying to concurrently access __constant__ memory, with each kernel accessing 32KB. This should be okay as long as the aggregate constant memory usage of both kernels is less than 64KB (__constant__ memory limit). Accesses should not be serialized since they are from independent kernels.

Scenario in the first snippet in the blog post that uses __constant__ memory to copy over larger parameters should hold for the 2 kernel scenario as well (provided aggregate constant memory usage is < 64KB).

After reading this blog.
I have modified the Roger Allen’s CUDA version of RT in one weekend by passing all the scene data and cam data as const kernel parameter. The code is running 10x faster.

Thanks for this feature.

1 Like