CUDA 12.1 Supports Large Kernel Parameters

jwitsoe · June 5, 2023, 5:00pm

Originally published at: https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/

CUDA 12.1 offers you the option of passing up to 32,764 bytes using kernel parameters, which can be exploited to simplify applications as well as gain performance improvements.

cah4ec33 · June 24, 2024, 11:20am

Thank you for explanation.
The suggested scenario works well when we work with only default stream because access to constant memory is serialized. But imagine we launch two kernels concurently in two different streams and we need two sets of Large Kernel Parameters for each kernel respectivly, so we need partition constant memory some way both sets of parameters located in constant memory and not interleave, because constant memory is shared between this two kernels. Sync access to constant memory is another layer of complexity, correct? This way is different from default kernel parameters, which is also allocated in constatnt memory but runtime automatically allocated different, not interleaved memory banks for each pack of parameters, in my understanding.

Are there practical ways to use this way of passing Large Kernel Parameters to concurrent kernels launched in different streams?

cah4ec33 · June 27, 2024, 10:52am

Related StackOverflow question

deepaku · July 23, 2024, 7:33am

Kernel parameters don’t reside in the same memory space as what’s used for __constant__ (kernel parameters reside in constant banks managed by the CUDA driver). Constant memory (i.e. __constant__) accesses from independent kernels are not serialized wrt. each other; they are handled independently by the GPU hardware.

If I understand correctly, in your scenario, two kernels are trying to concurrently access __constant__ memory, with each kernel accessing 32KB. This should be okay as long as the aggregate constant memory usage of both kernels is less than 64KB (__constant__ memory limit). Accesses should not be serialized since they are from independent kernels.

Scenario in the first snippet in the blog post that uses __constant__ memory to copy over larger parameters should hold for the 2 kernel scenario as well (provided aggregate constant memory usage is < 64KB).

mattdsik · September 12, 2024, 4:57pm

After reading this blog.
I have modified the Roger Allen’s CUDA version of RT in one weekend by passing all the scene data and cam data as const kernel parameter. The code is running 10x faster.

Thanks for this feature.

Topic		Replies	Views
Accessing beyond bank 0 in __constant__ memory from PTX? CUDA Programming and Performance	7	614	September 17, 2023
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3053	November 13, 2017
CUDA very slow performance CUDA Programming and Performance	21	16777	March 6, 2020
Unexplained stalls in CUDA API calls - reproducer attached Jetson TK1	27	2941	October 18, 2021
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134587	May 26, 2010
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1264	May 14, 2019
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2250	January 18, 2023
An Even Easier Introduction to CUDA Technical Blog	141	6421	November 28, 2023
An Easy Introduction to CUDA C and C++ Technical Blog	48	1273	July 19, 2018
Cuda code performance CUDA Programming and Performance	14	3163	December 16, 2014

CUDA 12.1 Supports Large Kernel Parameters

Related topics