Why is 2-D convolution slower than the matrix product?

This might sound like an apples vs oranges comparison at first, but it isn’t.

On various devices, I noticed that 2-D convolution from CUDNN is slower than SGEMM from CUBLAS. For example, on my GTX 980, I get up to 4TFLOPS in one and never more than 2TFLOPS in the other (assuming the data is already on the device).

This is especially puzzling, because for some input geometries, conv2d is exactly equivalent to SGEMM: one can just call SGEMM instead of conv2d (This happens when the input’s height and width are equal to the filter’s height and width, respectively). And yet, I see a 2-fold difference in performance in these cases.

I thought that both SGEMM and conv2d were arithmetic-bound, and that the CUDNN implementation was well-optimized.

(Technically, I’m talking about and timing cross-correlation, rather than convolution)

I am not familiar with CUDNN, but was the engineer who originally created CUBLAS. Based on that I will speculate that your observation boils down to a question of maturity.

CUBLAS has been around for many years, and initial iterations were not fully optimized because the focus was mostly on developing a comprehensive feature set. After the feature set was mostly complete around 2009 or so, there was more time to optimize the code and these days CUBLAS is extensively optimized including some architecture-specific optimizations. Even so there is likely still room for improvement in lesser used functions or configurations (transpose modes, matrix aspect ratios, etc), or on brand new architectures for which there is a learning curve for the engineers. In contrast to CUBLAS, CUDNN is a very new library and probably still offers many opportunities for performance improvements.

In practical terms, you would want to file RFEs (request for enhancement) for the performance of any CUDNN feature where it can be reasonably determined that it could perform better (such as the GEMM comparison you cite above). You can file enhancement request through the bug reporting form linked from the registered developer website. Simply prefix the synopsis with “RFE:” so it can easily be distinguished from an actual bug.

Just to clarify, I am not especially concerned about the case where conv2d can be replaced by SGEMM (since I can choose to call the latter myself), but I think it illustrates that the overall performance is suboptimal.

By the way, comments like the following (from the Torch mailing list) suggest that the performance of conv2d and SGEMM should have already been similar:

As I stated, the best path of action is to file RFEs for functions, or configurations of functions, for which performance is deemed to be suboptimal. In CUBLAS at least, a single API function may map to any number of kernels, based on size and transpose mode, for example. It is thus helpful to be as specific as possible when stating use cases in an RFE.

RFEs are valuable for prioritizing work in that they inform the library team about customer usage patterns: where exactly are the needs for more performance or additional functionality. Getting issues tracked formally through the bug reporting system is the first step to future improvements. Since everybody has different needs for their projects, it is generally not safe to assume that some other user has already filed a request for the functionality one is interested in.

I’ve inspected the disassembled cuDNN code as well as talked to the engineers involved in writing it about this very topic. The main reason it’s so much slower than sgemm is that it’s compiled from cuda-c. There is hence very little hiding of the shared and global memory latencies and poor management of register bank conflicts. They’re working on a hand assembled version I believe.

It would be really nice if ptx exposed more control over these issues: a mode that doesn’t try and get smart about reordering your instructions and lets you specify banks for registers. If these were in place I wouldn’t have been motivated to write my Maxwell assembler. Oh, and better detection of warp uniform predicates would be nice too.

But did you file an RFE? :-)

I did in fact do this back in March (or a bug report at least). I’ve yet to hear anything on these suggested improvements to ptxas. So instead I reverse engineered their hardware and wrote my own assembler. This is not really an option for most but I happened to have some free time.

It’s worth noting that not all conv operations are the same in the same way that not all SGEMM operations are the same. SGEMM may do well for reasonably large square same-sized matrices, but may achieve lower throughput for matrices with skewed dimensions.

There is a large space of possible inputs to a library like this, and typically multiple specialized code paths are required to cover the space. Usually libraries will be better tuned for the most common inputs, and less well tuned for uncommon inputs.

I wouldn’t read too much into a convolution library’s performance on an SGEMM operation, because that certainly isn’t the use case that it was designed for. This isn’t to say that cuDNN couldn’t be improved, only that it would prudent to compare performance on neural network filter operations.

@scottgray is correct about the performance limitations of cuda-c kernels, but in my experience that only accounts for the difference between 75% and 95% computational efficiency (kernel TFLOPS / device peak TFLOPS) on Maxwell GPUs. It is not enough to explain all the performance problems with cuDNN.

cuDNN v.1 used way too many integer instructions, upwards of 30% of all instructions. cuDNN v.2 improves on this, but you will see the percentage vary between layers of the network. Computational efficiency of cuDNN v.2 varies between 30% - 75%.

The body of Scott’s Maxas SGEMM is 98.5% floating point instructions.

I created an efficient convolution using the Maxas SGEMM assembly code as the starting point, and using constant memory to lookup the offsets of each pixel in an image patch. This keeps the floating point instruction density close to the original Maxas SGEMM, and predictably, the result has about the same computational efficiency, 95%. Here it is: https://github.com/eBay/maxDNN

But before I wrote assembly, I took the ideas from Maxas SGEMM and used them to write a cuda-c convolution kernel. That gets about 75% computational efficiency across a variety of network shapes. Not that it was easy, it took a lot of trial and error to get the register count under control. Anyway, I think of assembly as how you get the last 20% of efficiency.

I’m just finishing up my own work on convolutions. It seems Andrew and I hit upon the same technique more or less in parallel. I guess it never occurred to the authors of cuDNN that it is possible to do small matrix multiply at full utilization. It was actually Maddy’s post that set me on the right track:

https://devtalk.nvidia.com/default/topic/776043/cuda-programming-and-performance/whats-new-in-maxwell-sm_52-gtx-9xx-/post/4341194/#4341194

Anyway, stay tuned for my full featured convolution work… though it seems fb might be seriously catching up now with their fft based approach. I suspect there will still be a demand for both though.

I have no insight into the genesis of cuDNN, but I would caution against speculating what may or may not have occurred to the authors. In an industrial setting not everything that is being thought about can be realized in a production library immediately. There are often multiple goals to be satisfied, some of them conflicting.

Generally speaking, the first goal for any new library is typically to be largely feature complete at performance levels that provide meaningful speedup to customer applications, then improve the performance further in subsequent releases. In the specific context of CUDA, this task isn’t made any easier by having to worry about multiple GPU architecture generations.

GTC might be a good opportunity to share your results and findings with NVIDIA’s engineers (if you are not already in touch).

I’m not really into deep neural network, but i suppose the convolution-kernels there are always quadratical and not that big (kernel radius <= 15 ?). So I am wondering NVIDIA does not use the 2-D convolution functions from the NPP library inside cuDNN ? Maybe (if the input image is not that big), doing multiple 2-D convoolutions on differenet CUDA streams in order to utilize the GPU better.

njuffa: that comment wasn’t meant to be in any way negative, but was rather an observation of the strategy they seemed to employ: setting up in shared memory the biggest MM they could mange. The recent Catanzaro et al paper was actually enormously helpful for me in planning out my own strategy and I’m extremely grateful to them for their work (not to mention all the gemm work of theirs that I largely borrowed). That technique is actually still likely the best approach for single image inference and I’ll likely be attempting to duplicate it soon, though with assembly level optimizations. l’m pretty confident I can get it close to full utilization.

I’ll be at GTC this year, and I’ll give a talk provided nvidia allows my late entry. Still waiting feedback on that. Seeing that they may be announcing GM200 at that event, my work would be a nice compliment to that. Single chip convolution at 6+ Tflops is a pretty big deal. If it supports fp16x2 and they want to send me an egineering sample I can probably squeze a lot more flops out of it that that.

So here’s a quick writeup on my limited precision gemm and convolution work:

https://github.com/NervanaSystems/nervana-lib-gpu-performance-preview

It’s really just a performance preview, the full library should be ready in short order. With it you’ll be able to build full convolution networks with a very user friendly python api. Precision management should be as hands off as floating point (with the option to be more involved if you want to explore). More kernels with different shapes to better optimize various tensor dimensions are also forthcoming.

Note that I’m getting about double the performance of cuDNN using half the memory.

Congratulations! This looks amazing. I knew it was possible to create full utilization convnets, just didn’t know how close anybody was to actually doing it. ;) It should be no surprise that Scott was the one to get there first.

This might be a non sequitur, but will there be multi gpu support in the Python framework? If not, how hard would it be to integrate into cuda-convnet2?

This library will be fully integrated into the Nervana Framework package, which already has a lot of data and model parallelism built in (along with a really nice network yaml configuration engine).

We also currently have cuda-convnet2 fully integrated into our framework as well.

Seems like there is a lot of demand for a C api for these kernels as well. The new cuda7 release should make that much easier (for the dynamically generated compound element-wise kernels)

Finally some head to head benchmarks of my convolution work. Bear with us a bit longer before we’re ready for a wider release.