I am using the l4t-pytorch image to run a pytorch model on Jetson AGX Xavier. Same code is being run on cuda and cpu on NVP model MAXN, but the cpu beats the cuda big time. Is that expected?
Using cpu device
epoch : 1/10, time = 327.4649398326874, loss = 0.054947
Using cuda device
epoch : 1/10, time = 562.3225209712982, loss = 0.053142
You can also call .cuda() or .to(device) on criterion. You can also create your dataloader with pin_memory=True. It may be this model is too small/simple to benefit much from GPU acceleration. If you were to time the difference with convnet (for example ResNet18), you should find GPU to be much faster than CPU.
The GPU shares the same physical RAM with the CPU on Jetson - so on AGX Xavier, it is the full 32GB (minus a small amount reserved for the kernel)
I tried all suggestions. None would improve the performance, so it should be the simplicity of the model.
This is quite interesting because I actually created a foreground-background subtraction algo which is quite accurate and almost linear, which runs on opencv for 2ms, and I thought I can keep / improve the performance while reducing the computation time with a simple ANN, and I am really surprised with this result.