Best way to load the weights in the application


We have ported the PyTorch ML model (trained for a classification task) to C programming by hand coding the functions and storing the weights into a global array[table]. Outputs are matching with the PyTorch model.

Now we want to optimize this C code using Cuda Programming for inferencing. I have few questions regarding this.

  1. We have already optimized C code using avx2 intrinsic and it is taking 100MHz CPU. If we use optimize using CUDA programming will it be optimized further?
  2. What is the best way to load the weights(using device/constant) only at intililazation time in Cuda programming?
  3. CudaMalloc is taking huge time. How to optimize time for CudaMalloc?