Hi, I am developing a diffusion model and use a vae model which is composed of some conv ops(nn.Conv3d, nn.Conv2d), when I was profiling memory usage, I find vae’s memory usage is relatively low, so simply increase batch size may not be the best choice for this workload since there is a high memory usage and a low memory usage in a vae’s forward. These memory spikes are are buffers used temporarily by convolution operators. So I want to dispatch data into different streams and use some strategies to dispatch a vae’s high memory stage in stream A while dispatch a low memory stage in stream B.
So my questions are:
Why conv needs these temp buffers. How to determine the memory spike value of a Conv3d or Conv2d operation given the conv parameters and input data shape:
Is there any one to do this interleave dispatch?The approach I can think of now is to manually identify the boundary between the two stages, then split one forward function into two parts and schedule them into two different streams.
Convolution operations require temporary buffers for several reasons:
Edge Masks: Some implementations use edge masks that need to be stored temporarily.
Intermediate Results: Buffers hold intermediate results during processing, including activation values and scratch memory for layer computations.
Memory Optimization: Techniques such as sharing memory blocks for activation tensors and using transient tensors help to reduce the overall memory footprint.
Calculating Memory Spike Values for Conv2d and Conv3d:
To calculate the memory spike value, consider the following steps:
Understand Parameters: Analyze the input shape, kernel size, padding, stride, and other operational parameters.
Size Determination: Calculate sizes for the input tensor, kernel tensor, output tensor, and any intermediate buffers.
Total Memory Calculation: Sum all memory requirements, ensuring that it doesn’t exceed the available memory on the device.
Strategies for Interleaving Dispatch in CUDA:
Implement the following approaches to manage operation dispatch:
Utilize CUDA Streams: Organize tasks into streams for concurrent execution.
Separate Data Transfers: Use different streams for data transfers and computations to avoid interference.
Synchronize with Events: Use events for managing synchronization across streams.
Allocate Pinned Host Memory: Enhance data transfer efficiency.
Use CUDA Graphs: Capture operations in graphs to streamline and optimize execution.
Implement Non-Blocking Streams: Ensure smooth execution while capturing CUDA graphs.
These insights should help you address your needs related to memory usage and operation dispatch for your diffusion model using VAE.