I have a simple network based on pytorch. The network size is 50 MB. I am running it on Jetson Nano. When I profiled the code, I see different profile numbers each time. The numbers vary from 90 msec-250 msec. I have tried to get the detailed profile information using “nvprof --metrics all”.The profile number is shown as ~600 second that could be because nvprof is trying to collect all the profiling information.
I am attaching the profiles here taken for 2 runs. I see major difference in L2 Throughput (Reads), L2 Throughput (Writes), gld_requested_throughput, gst_requested_throughput, gld_throughput, gst_throughput, tex_cache_throughput, l2_tex_read_throughput, l2_tex_write_throughput, l2_read_throughput, l2_write_throughput.
My observation is that, whenever the time taken for the network is less, I see the throughput of the above to be less. Shall I assume that, when the number of accesses to the global memory and L2 cache are less, the throughput is less and thus the time is less.
Few more observations:
- The above profiling is done by placing one inference outside the loop to avoid the initial latencies because of the global memory fetches. I have placed few more inferences outside the loop (10 more). Then I still see variations in the profile count for ex:100 msec-200 msec. This variation is reduced compared to the above count.
- When I have used a small network, the time taken by the network is more consistent and I dont see much difference in the above parameters when profiled using nprof.
Please let me know how can I make sure that these differences in profile numbers are due to accesses to global memory or cache misses etc.,
2.txt (72.9 KB)
1.txt (72.9 KB)
Have you fixed the clock rate first?
By default, Jetson uses dynamic frequency for power concerns.
To profile, it’s recommended to maximize the performance and fix the clocks as below:
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
Thanks for the reply. It helped a lot. Now, I see the variation between 83 msec to 110 msec when the network is run multiple times. I have taken 2 profiles of the network. Please find attached the files. Now, can I assume that the variation is because of the cache misses? If so, what parameters in the files are pointing that? How do I analyze the files to understand the time deviation for each run?
Thanks for the support you are providing.
1_clockfix.txt (73.0 KB)
2_clockfix.txt (73.0 KB)
I can’t tell you how to differentiate, but it would be a case of:
- Cache misses.
- Scheduler policies (a form of competition).
- Memory competition (another side of scheduling).
The hardware and scheduling are of an architecture not capable of deterministic repetition.
Thanks for the reply. Could you please elaborate on memory competition.
There is one memory controller for RAM. Transfer of data from memory of two or more processes at once means one of them has to wait. A CPU or GPU can only do its job if data is available to work on. In a typical scenario, say a game, one might have the CPU saturate at 100% while the GPU runs at 50%…in that case either the CPU does not produce enough data for the GPU, or the data is not yet available to the GPU. The timing of when memory must be available differs, and thus averages end up with a larger standard deviation even when mean is constant. Non-deterministic realtime (a scheduler could reduce this for one process at the cost of another).