CuDnn slow convolution operation

I’m having trouble in using CuDnn convolution API. It is taking a lot of time. I’ve also written a simple custom convolution kernel myself and the time comparison between the two for same image and kernel size is:

Example image size (1536,2592) with 5,5 kernel

cuDnn average time: 0.796805 ms
Custom kernel average execution time: 0.331958 ms

CuDnn is slow on all image shapes/sizes. Here are the two files as well in-case anyone wanna have a look to get a better idea: GitHub - talha-10xE/Cuda-convolution

GPU: RTX 3050 ti laptop
nvcc version: cuda_11.5
Cudnn version: 9.6.0

Performance Analysis with CuDNN

  1. Relevance of Provided Information: It was noted that the context provided did not yield a relevant answer directly regarding performance issues but did mention this could be due to performance regressions and known issues in cuDNN version 9.0.0.

Optimization Techniques for CuDNN Convolution Operations

To improve the performance of your CuDNN convolution operations, consider the following optimization techniques:

  1. Use Tensor Cores: Optimize performance by ensuring input and output channels align with the requirements of tensor cores (divisible by 8 for FP16 precision or 4 for TF32 precision).

  2. Select Optimal Convolution Algorithms: CuDNN offers various algorithms; experimentation is key to finding the best-performing one for your specific hardware and workloads.

  3. Memory Layout Considerations: Choose the appropriate data layout (NCHW or NHWC) based on the architecture of your neural network, as this can enhance convolution performance.

  4. Batch Size Optimization: Adjusting the batch size can maximize GPU utilization. Larger batch sizes improve throughput, but finding a balance is crucial to avoid memory limitations.

  5. Layer Fusion Techniques: Combine operations whenever possible (e.g., convolution and activation) to reduce memory overhead and gain speed.

  6. Mixed Precision Training: Using both FP16 and FP32 can enhance training speed while preserving accuracy. This method reduces memory requirements and leverages tensor cores for faster computation.

  7. Version Compatibility: Ensure the correct version compatibility between CuDNN, CUDA Toolkit, and any deep learning frameworks being used. Refer to official documents for recommended configurations.

  8. Profiling Tools: Utilize NVIDIA’s profiling tools (such as Nsight Systems or Nsight Compute) to analyze and optimize performance, identifying critical areas for enhancement.

Additional Resources

For more detailed information, you may refer to:

By implementing these strategies, you should see improvements in the performance of your CuDNN convolution operations.