Increasing memory footprint with large task graphs

I have been trying to understand task graphs, and see the differences. I noticed increasing GPU memory usage with increasing size of the graph. The experiments were performed using driver 470.16 and cuda 11.2.

I create the graph using cudaStreamBeginCapture ... cudaStreamEndCapture APIs. The code snippet is:

cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
for (int j = 0; j < KL; j++) {
    k3<<<SIZE/256, 256, 0, stream>>>(a, b, c, SIZE);
}
cudaStreamEndCapture(stream, &graph);

KL is an input, which I varied. I ran the above snippet, and captured memory usage on the GPU using nvidia-smi. Not sure if the content of the kernel makes a difference, but here the kernel K3 is empty.

I saw the following trends. For graphs with 500 nodes, I see a memory usage of 362MB. 1000 nodes becomes 395MB. 2000 nodes becomes 458MB, and so on.

Can you provide any pointers or insights as to why this additional memory is needed?

I think this is expected behavior. AFAIK, there is no mechanism in a
CUDA graph to abstract a loop. Furthermore, stream capture doesn’t have a fine-grained knowledge of your host code. So issuing a kernel in a loop:

for (int i = 0; i < 4; i++)
  kernel<<<...>>>(...);

is equivalent to:

kernel<<<...>>>(...);
kernel<<<...>>>(...);
kernel<<<...>>>(...);
kernel<<<...>>>(...);

Each such invocation must have a record associated with it in the captured work description. Therefore increasing the loop count will increase the memory usage, because the graph capture process requires memory to store the information about each node.

Thanks Robert.
I understand better now. I was under the impression that the memory for graph is something needed on the CPU to capture dependencies implicitly, kernel parameters and arguments. Just the kernels are on the GPU (same as traditional streams), the driver can now call the respective kernels in sequence with the corresponding arguments captured earlier, getting rid of cudaLaunchKernel overheads (e.g., switching from user space to kernel space is reduced).
However, it seems like there is a lot more going on under the hood. I am trying the understand the implementation specifics. Are there any (public) pointers to what exactly happens with the additional memory and why is it needed on the GPU?

If you take a look at the cuda graph api, you will see some of the data that needs to be stored for a kernel node. It’s more than just dependencies. Regardless of whatever it may be, it takes up storage space. Since this must be done for each kernel invocation in stream capture mode, capturing more kernel invocations takes more space.

I’m not aware of any.