Hello,
I’m writing on a software, which is doing repeated calculations on video data using cuda and at the same time writing massive amounts of data (the results) to memory.
I thought of using the Denver cores to “isolate” the program from the rest of the system to ensure stable performance.
However I noticed the contrary to happen. When running on Denver2, it appears that the file writing interferes with memcpy. Below are some stats. The stats on Denver2 and A57 are nearly the same, when I remove the last task (-> nothing is written to file).
Power model I used:
sudo nvpmodel -m 0
sudo jetson_clocks
Running on Denver:
CPU [21%@1920,100%@1958,42%@1959,7%@1920,2%@1918,3%@1920] (according to tegrastats)
FPS: 26.69 Err_to_target: 11.03 Cycle_variance: 32.40
Timings of tasks (single-threaded):
< ... previous cycle ... >
Duration [us]: 1
Duration [us]: 4622 <- memcpy
Duration [us]: 2
Duration [us]: 3970 <- memcpy
Duration [us]: 1
Duration [us]: 4151 <- memcpy
Duration [us]: 2
Duration [us]: 4860 <- memcpy
Duration [us]: 1085 <- kernel execution
Duration [us]: 5
Duration [us]: 5 <- notifying worker thread about new data
< ... repeat ... >
Running on A57 (core 3 and 4):
CPU [25%@1912,0%@1957,0%@1959,60%@1927,54%@1930,16%@1920]
FPS: 30.00 Err_to_target: 0.00 Cycle_variance: 0.37
Timings of tasks (single-threaded):
< ... previous cycle ... >
Duration [us]: 0
Duration [us]: 1823 <- memcpy
Duration [us]: 1
Duration [us]: 1718 <- memcpy
Duration [us]: 1
Duration [us]: 1641 <- memcpy
Duration [us]: 1
Duration [us]: 1886 <- memcpy
Duration [us]: 332 <- kernel execution
Duration [us]: 3
Duration [us]: 20 <- notifying worker thread about new data
< ... repeat ... >
I tried to recreate those results with a small test code, below are the results (and this is the code
main.cpp (3.1 KB)
).
I could not completely recreate the behavior, however there are some similarities:
- The Denver cores are “lazy” in terms of load sharing. Only one core does all of the work. Similar to my main software
where the second core “only does the rest” (only 50% load). And they stay this way and never switch, whereas running
on A57, the cores seem to switch between tasks all the time. - memcpy duration is doubled on Denver, when filewrite is running, while it only increases on A57 (I ran it on a single
core, too, to have fair comparison against the scheduling behavior of Denver)
Also interesting: The Denver seems to be better in memcpying, while file writing is always faster on A57
Running on Denver:
CPU [8%@1913,100%@1958,0%@1958,8%@1907,100%@1913,5%@1907]
Filewrite duration [ms]: 16763
memcpy duration [ms]: 15675
memcpy duration [ms]: 8544 (only memcpy, no filewrite in parallel)
Running on A57:
CPU [15%@1908,0%@1959,0%@1959,93%@1929,84%@1920,12%@1920]
Filewrite duration [ms]: 11089
memcpy duration [ms]: 12420
memcpy duration [ms]: 12441 (only memcpy, no filewrite in parallel)
Running on A57 (Core 0 only):
CPU [100%@1919,0%@1959,0%@1959,5%@1919,4%@1920,5%@1919]
Filewrite duration [ms]: 12362
memcpy duration [ms]: 17266
memcpy duration [ms]: 12380 (only memcpy, no filewrite in parallel)
Now my guess for the performance problems of my software is, that the Denver cores behave like my singlecore example here - only core 1 does all the memcpy and file writing, while core 2 does some of the many additional lightweight threads I have, which I did not model in the example code.
My question: Is there anything known to fix this - make Denver cores more cooperative such that both of them handle heavy loads?
System setup: R32.5, Cuda 10.2