maxwell_sgemm_128x128_nn achieving only 25% Occupancy

Hi,

I’m using a GTX980 doing stuff with neural networks involving matrix multiplications in torch.
While profiling it, I found that the maxwell_sgemm_128x128 calls (a high percentage of the runtime of my application) have only a 25% theoretical occupancy, because it is limited by the number of registers: the number of registers/thread is about 120, which appears to be too high.

Is it normal? Is there a way to increase the occupancy?

Best regards.

That kernel is designed to achieve near full utilization with that occupancy. It has extremely high ILP to make up for it’s low TLP. What’s probably holding performance back for you is that one of your outer product dims is 128, the minibatch size. This means you’ll only have as many blocks as the number of input/output feature maps divided by 128. There’s not much you can do about this, though I have built (and am still building more) kernels meant as a replacement for cublas. In a lot of dims common in DNNs, I’m getting double the performance. I also support fp16 hgemm which you wont find in cublas yet. With fp16 stochastic rounding enabled we’re seeing accuracy actually better than fp32 networks. You can check out that work here:

https://github.com/NervanaSystems/nervanagpu

And you can try it in our framework here:

You can see benchmarks of convnets here:

I’ll have some gemm benchmarks posted soon when I finish the gemm work. I’m working on some specialized kernels designed to perform well with small minibatch sizes (important for very large recurrent networks).

Torch integration is coming soon.

What do you mean by full utilization? With 25% occupancy it means that only 25% of the threads are active right, with the memory fully used right?

This happens also when I have product of matrices of sizes greater than 1024, so I don’t think that the batch size is at fault here.

25% occupancy on Maxwell means that the kernel can run with 4 warps active per scheduler or 16 warps per SM. But achieving that occupancy means you need enough blocks to fill up your SMs. That’s dependent on the outer product dims of the MM. In all supervised networks that I’m aware of there are 3 important dims that can be part of the outer product: K, C, N (ofm, ifm, minibatch). Calculating the gradient with respect to weights reduces N and gives you K and C as the outer product dims. That gives you plenty of blocks to fill the SMs. But for the other two operations (fprop and bprop) N is one of the outer product dims and severely limits the amount of work that can be divided up among SMs.

Like scottgray said you don’t need 100% occupancy for maximum performance.
If you want to read up
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
Be aware that that paper is using older architectures so some details may not be true for newer hardware, but the core concept is still valid.

This looks very interesting!

Is much of the functionality (convolutions) only accessible via Python?

Can multiple GPUs be used with nervana_c_api.h ? Do you have to load the kernels for every device? (I am not too familiar with the driver API, which is what Nervana seems to be using)

Only the gemm kernels are wrapped in C at this time. And that’s only because Baidu did the work for us. But, yes, the kernels need to be loaded to each device.

It is actually possible to embed these cubins into object files that can be linked with runtime code. That would allow you to call the kernels with runtime syntax. I’ve just been too busy writing new kernels to do any interface work outside of the python.

For runtime integration see the comments I posted at the bottom of this file:

I figured that out prior to switching to linux development, but the gist should be the same.

But I guess it would be good to announce to the wider cuda community that these new kernels are available (particularly hgemm). I’ve just been holding off till I finish the last ones I want to do.

Thanks I get it. I’m not doing convolutions, but rather language modelling with RNNs, and I have huge matrices, so I can achieve maximum occupancy on this kernel. But do you think that your kernels can improve the speed in that case?
Also what resources would you recommend to understand better the architecture of a GPU (preferably stuff applicable to the latest GPUs)/

In all the cases I’ve measured, my kernels are faster than cublas, sometimes substantially so (2-3x). If you’re using a minibatch smaller than 128, the kernels are tailored currently for that dim to be contiguous. I’ll have new kernels out soon tailored to the non-contiguous case.

I learned cuda by reading Cuda by Example, The Cuda Handbook, all the nvidia docs, and lots of trial and error with a heavy enphisis on examining sass output. I also spent a fair amount of time examining nvidia’s hand assembled sgemm implementations. And then I wrote my own assembler and started probing the hardware directly.

Oh and I wrote these a while ago which goes into some depth of GPU arch:


I’ve learned a lot since I wrote that. One of these days I’ll update those to my current understanding of the Maxwell arch (which I think is fairly close to complete at this point).

Thanks!

Thanks, I don’t think that I plan to get into machine code yet, but it is still helpful to understand more about this.