Right now in some code that I am developing alongside other institutions, we are storing our DNN weights in constant arrays. CMSSW uses a package called Alpaka that acts as a wrapper around CUDA, but the code compiles with NVCC like normal CUDA code:
Specifically, ignoring the Alpaka naming, our weights are stored with the device const keywords. We use our DNN weights for inference here: cmssw/RecoTracker/LSTCore/src/alpaka/NeuralNetwork.h at master · cms-sw/cmssw · GitHub
My concern is that I am trying to improve on this by loading the weights from binary files at the start of code execution rather than storing the weights in plain text as constant arrays. To my surprise however, doing this is much slower and the reason seems to be because of the const keyword used to store the weight arrays. You can see my timing results below, but without the const keyword the inference time becomes much slower. The const keyword also provides more speedup than using constant memory, which was also a surprise to me.
I was hoping if someone could tell me whether this is expected? And if there are any ideas for how I can recover this performance while still loading the weights from a file at the start of the run?
edit: I forgot to add that one of our motivations for doing this was to see if lower-precision network weights could improve timing performance. Maybe someone can correct me if this is wrong, but I don’t think you can define a half-precision const array in CUDA with literals?