We have ported the PyTorch ML model (trained for a classification task) to C programming by hand coding the functions and storing the weights into a global array[table]. Outputs are matching with the PyTorch model.
Now we want to optimize this C code using Cuda Programming for inferencing. I have few questions regarding this.
- We have already optimized C code using avx2 intrinsic and it is taking 100MHz CPU. If we use optimize using CUDA programming will it be optimized further?
- What is the best way to load the weights(using device/constant) only at intililazation time in Cuda programming?
- CudaMalloc is taking huge time. How to optimize time for CudaMalloc?