About convolution performance

I have some question about convolution performance:
1). How to optimize the performance for convolution in CNN, is there any general solution?
I’ve applied the texture into my application, but I feel it is not enough,
2). I have a large-scale convolution problem, my image is larger than 4096x4096, and my kernel is larger than 128 * 128
Is there any solution to overcome this problem?

Is there any expert would give me some suggestion?
Thank you very much

You might want to mention the CUDA version you are using, as well as which GPU you are using.

What performance do you observe currently (actual measured numbers, relevant metrics)? What performance did you expect, and why? What does the CUDA profiler tell you about bottlenecks in your application?

large scale convolutions are often the best candidates for conversion to frequency domain for convolution (i.e. FFT → elementwise multiply → IFFT).

There is a cuda sample code that demonstrates this concept for the 1D case.

To go to a higher level of abstraction (considering CNN, not just convolution) you might take a look at the cuDNN library. It’s a non-trivial undertaking, but is aggressively optimized.

Hello, njuffa
My GPU is GeForce GT 740M, and the version of CUDA is 7.0
My test like that, imgae size is 4096 * 4096, kernel size is 128 * 128
I’ve apllied texture, and constant memory
the time of execute is 15366ms
the profiler told me, “Low Kernel Concurrency”
I can understand this promption, I just want to test it as single stream.
Now, I want to promote 50% performance at first, for example, I want to shorten the time to <7000ms
Is there any more further suggestion?
Thank you very much