How to compute peak memory of a conv op?

Hi, I am developing a diffusion model and use a vae model which is composed of some conv ops(nn.Conv3d, nn.Conv2d), when I was profiling memory usage, I find vae’s memory usage is relatively low, so simply increase batch size may not be the best choice for this workload since there is a high memory usage and a low memory usage in a vae’s forward. These memory spikes are are buffers used temporarily by convolution operators. So I want to dispatch data into different streams and use some strategies to dispatch a vae’s high memory stage in stream A while dispatch a low memory stage in stream B.
So my questions are:

  1. Why conv needs these temp buffers. How to determine the memory spike value of a Conv3d or Conv2d operation given the conv parameters and input data shape:
  2. Is there any one to do this interleave dispatch?The approach I can think of now is to manually identify the boundary between the two stages, then split one forward function into two parts and schedule them into two different streams.

when I use torch.profile nn.Conv2d will call a cudnn conv and a aten::add_, aten::add_ is because the bias


and the cudnn conv use these kernels

So does cudnn use im2col and turn a conv op to a gemm op? So the temp buffer is im2col’s output.

I would appreciate it if someone could give me some help :)

  1. Temporary Buffers in Convolution Operations:

    • Convolution operations require temporary buffers for several reasons:
      • Edge Masks: Some implementations use edge masks that need to be stored temporarily.
      • Intermediate Results: Buffers hold intermediate results during processing, including activation values and scratch memory for layer computations.
      • Memory Optimization: Techniques such as sharing memory blocks for activation tensors and using transient tensors help to reduce the overall memory footprint.
  2. Calculating Memory Spike Values for Conv2d and Conv3d:

    • To calculate the memory spike value, consider the following steps:
      1. Understand Parameters: Analyze the input shape, kernel size, padding, stride, and other operational parameters.
      2. Size Determination: Calculate sizes for the input tensor, kernel tensor, output tensor, and any intermediate buffers.
      3. Total Memory Calculation: Sum all memory requirements, ensuring that it doesn’t exceed the available memory on the device.
  3. Strategies for Interleaving Dispatch in CUDA:

    • Implement the following approaches to manage operation dispatch:
      • Utilize CUDA Streams: Organize tasks into streams for concurrent execution.
      • Separate Data Transfers: Use different streams for data transfers and computations to avoid interference.
      • Synchronize with Events: Use events for managing synchronization across streams.
      • Allocate Pinned Host Memory: Enhance data transfer efficiency.
      • Use CUDA Graphs: Capture operations in graphs to streamline and optimize execution.
      • Implement Non-Blocking Streams: Ensure smooth execution while capturing CUDA graphs.

These insights should help you address your needs related to memory usage and operation dispatch for your diffusion model using VAE.