About convolution performance

Shaquille · February 16, 2017, 6:24pm

I have some question about convolution performance:
1). How to optimize the performance for convolution in CNN, is there any general solution?
I’ve applied the texture into my application, but I feel it is not enough,
2). I have a large-scale convolution problem, my image is larger than 4096x4096, and my kernel is larger than 128 * 128
Is there any solution to overcome this problem?

Is there any expert would give me some suggestion?
Thank you very much

njuffa · February 16, 2017, 6:32pm

You might want to mention the CUDA version you are using, as well as which GPU you are using.

What performance do you observe currently (actual measured numbers, relevant metrics)? What performance did you expect, and why? What does the CUDA profiler tell you about bottlenecks in your application?

Robert_Crovella · February 16, 2017, 7:27pm

large scale convolutions are often the best candidates for conversion to frequency domain for convolution (i.e. FFT → elementwise multiply → IFFT).

There is a cuda sample code that demonstrates this concept for the 1D case.

To go to a higher level of abstraction (considering CNN, not just convolution) you might take a look at the cuDNN library. It’s a non-trivial undertaking, but is aggressively optimized.

Shaquille · February 17, 2017, 5:47am

Hello, njuffa
My GPU is GeForce GT 740M, and the version of CUDA is 7.0
My test like that, imgae size is 4096 * 4096, kernel size is 128 * 128
I’ve apllied texture, and constant memory
the time of execute is 15366ms
the profiler told me, “Low Kernel Concurrency”
I can understand this promption, I just want to test it as single stream.
Now, I want to promote 50% performance at first, for example, I want to shorten the time to <7000ms
Is there any more further suggestion?
Thank you very much

Topic		Replies	Views
what is the best way to implement a convolution with CUDA? CUDA Programming and Performance	1	9783	April 2, 2010
How to further speedup a CUDA code for a Convolution Neural Network? CUDA Programming and Performance	4	1498	December 10, 2018
First attempt - convolution Non-seperable image convolution CUDA Programming and Performance	4	5392	April 13, 2008
CUDA OpenGL post-processing example CUDA Programming and Performance	9	13241	May 27, 2007
Image Convolution [src added] CUDA Programming and Performance	3	3887	November 28, 2007
[Help] Kernel Optimization Image subsampling CUDA Programming and Performance	2	4209	July 30, 2007
Could anyone explain the function of the kernel? cuDNN	1	521	June 27, 2020
Cudnn TF32 performs no better than FP32 on RTX3090 cuDNN cudnn	5	2491	January 28, 2021
Why is 2-D convolution slower than the matrix product? CUDA Programming and Performance	17	6656	April 18, 2015
CUDA texture memory performance CUDA Programming and Performance	0	1257	January 12, 2009

About convolution performance

Related topics