Kernel launch params and allocation

Hey folks. I’m making a global function launch with several parameters through the runtime API. Based on a failure I get (out of memory), I’m guessing that under the hood a buffer is allocated via cudaMalloc to put my parameters into. Does this happen only for larger parameter lists, or does it happen even if I only want to pass e.g. a single pointer? I want to avoid calls to cudaMalloc and cudaFree whenever possible and manage my allocations myself, but on the other hand, I’d like to continue using the runtime API. Any pointers?

I don’t want my response to be interpreted as confirming your analysis. Stated directly, I’m not sure what the source of your (out of memory) failure is.

However, kernel arguments are passed via constant memory. As indicated there, the maximum size is 4KB.

Constant memory is a type of device memory (so it requires an allocation in device memory) that is passed through a per-SM constant cache.

Certainly if you exceed 4KB for the aggregate size of the function arguments, then that is going to be a problem and I expect it would throw an error.

It’s not obvious to me that other usages would cause an error.

There is no way to avoid the usage of device/constant memory to pass function arguments to a kernel launch.

I’m also not suggesting that this implies a cudaMalloc/cudaFree every time. You can use a profiler to get a detailed list of API calls that are made, and in many cases it is evident that there is not a cudaMalloc/cudaFree associated with a kernel launch.

1 Like