The tensor.cuda() call is very slow. I am using Torch 1.1.0 and Cuda 10.0. Interestingly the call for 5 different tensors, ranging between (1,3,400,300) to (1,3,800,600) varies from 0.003 to1.48 seconds.
Isn’t this should be fast because of same memory shared by gpu and cpu?
Is there any way to speed it up?
Hi @pranay731, is it the very first call to tensor.cuda()
that is taking the longest?
In my experience, the first time you use GPU in torch, it can take a bit of extra time to initialize.
Hi @dusty_nv
Its not the first call. I am copying these tensors after copying the model.
Moreover, the first tensor takes least time, also being the smallest. I thought it may be storage issue or some thing, but almost 2 GB of memory is always unused. And I also tried deleting the previous tensor before copying the next one, still same results.
I don’t believe PyTorch takes advantage of CUDA zeroCopy memory, so it may be allocating the CUDA device memory and then performing cudaMemcpy()
operations. PyTorch does however support pinned memory for fast CPU<->GPU memory copies.
In case the clocks have gone idle, have you tried running sudo jetson_clocks
beforehand?
Thanks for the help.
I already put the mode to all cores max at the drop down menu on the top right corner beside clock. Both are same or jetson_clocks will do something more?
The drop-down menu sets the nvpmodel, which sets the min/max clock frequencies and the number of CPU cores that are online. Frequency scaling is still enabled, which dynamically scales the frequencies at runtime based on workload.
jetson_clocks
disables frequency scaling, and locks the clocks to their maximums for the current nvpmodel. So they do different things.
Also, while running your pytorch script, do you get any kernel log messages from dmesg?