Performance Benefit of Const Keyword on DNN Inference

Right now in some code that I am developing alongside other institutions, we are storing our DNN weights in constant arrays. CMSSW uses a package called Alpaka that acts as a wrapper around CUDA, but the code compiles with NVCC like normal CUDA code:

Specifically, ignoring the Alpaka naming, our weights are stored with the device const keywords. We use our DNN weights for inference here: cmssw/RecoTracker/LSTCore/src/alpaka/NeuralNetwork.h at master · cms-sw/cmssw · GitHub

My concern is that I am trying to improve on this by loading the weights from binary files at the start of code execution rather than storing the weights in plain text as constant arrays. To my surprise however, doing this is much slower and the reason seems to be because of the const keyword used to store the weight arrays. You can see my timing results below, but without the const keyword the inference time becomes much slower. The const keyword also provides more speedup than using constant memory, which was also a surprise to me.

I was hoping if someone could tell me whether this is expected? And if there are any ideas for how I can recover this performance while still loading the weights from a file at the start of the run?

edit: I forgot to add that one of our motivations for doing this was to see if lower-precision network weights could improve timing performance. Maybe someone can correct me if this is wrong, but I don’t think you can define a half-precision const array in CUDA with literals?

From just the data definition it seem impossible to diagnose what is going on here.

My guess is that the use of const (= read only) allows the compiler to propagate the data into literal constants incorporated directly into instructions generated for (some of) the code that uses this data. Since the Volta architecture (I think) FP32 instructions can incorporate one full FP32 literal constant, to be applied to one of the instruction operands.

The way to confirm or refute this hypothesis would be to examine the SASS (machine code) generated for the code that operates on this data.

If I recall recent forum discussion correctly, that is correct at present, as half-precision literal constants are not supported. But there is hope for the not too distant future now that ISO C++23 has incorporated support for half precision. In general, the adoption of new C++ standards by the CUDA toolchain has been fairly swift, so I would expect the CUDA compiler to add C++23 support with the next major version. One possible quirk is that host compilers could be lagging.

1 Like

It does seem to be possible with some effort. See here.

Marking this as a solution since I see NN weights embedded in the FFMA instructions when const is used:

/4fd0/ FFMA.FTZ R63, R99, -1.1626704931259155273, R60 ; /* 0xbf94d263633f7823 */

/4fc0/ FFMA.FTZ R62, R100, 0.12852770090103149414, R65 ; /* 0x3e039cc4643e7823 */