Denver2 weird scheduling and lower (concurrent) memory performance


I’m writing on a software, which is doing repeated calculations on video data using cuda and at the same time writing massive amounts of data (the results) to memory.
I thought of using the Denver cores to “isolate” the program from the rest of the system to ensure stable performance.
However I noticed the contrary to happen. When running on Denver2, it appears that the file writing interferes with memcpy. Below are some stats. The stats on Denver2 and A57 are nearly the same, when I remove the last task (-> nothing is written to file).

Power model I used:
sudo nvpmodel -m 0
sudo jetson_clocks
Running on Denver:
    CPU [21%@1920,100%@1958,42%@1959,7%@1920,2%@1918,3%@1920] (according to tegrastats)
    FPS: 26.69  Err_to_target: 11.03  Cycle_variance: 32.40
    Timings of tasks (single-threaded):
        < ... previous cycle ... >
        Duration [us]: 1
        Duration [us]: 4622     <- memcpy
        Duration [us]: 2
        Duration [us]: 3970     <- memcpy
        Duration [us]: 1
        Duration [us]: 4151     <- memcpy
        Duration [us]: 2
        Duration [us]: 4860     <- memcpy
        Duration [us]: 1085     <- kernel execution
        Duration [us]: 5
        Duration [us]: 5        <- notifying worker thread about new data
        < ... repeat ... >

Running on A57 (core 3 and 4):
    CPU [25%@1912,0%@1957,0%@1959,60%@1927,54%@1930,16%@1920]
    FPS: 30.00  Err_to_target: 0.00  Cycle_variance: 0.37
    Timings of tasks (single-threaded):
        < ... previous cycle ... >
        Duration [us]: 0
        Duration [us]: 1823     <- memcpy
        Duration [us]: 1
        Duration [us]: 1718     <- memcpy
        Duration [us]: 1
        Duration [us]: 1641     <- memcpy
        Duration [us]: 1
        Duration [us]: 1886     <- memcpy
        Duration [us]: 332      <- kernel execution
        Duration [us]: 3
        Duration [us]: 20       <- notifying worker thread about new data
        < ... repeat ... >

I tried to recreate those results with a small test code, below are the results (and this is the code
main.cpp (3.1 KB)
I could not completely recreate the behavior, however there are some similarities:

  • The Denver cores are “lazy” in terms of load sharing. Only one core does all of the work. Similar to my main software
    where the second core “only does the rest” (only 50% load). And they stay this way and never switch, whereas running
    on A57, the cores seem to switch between tasks all the time.
  • memcpy duration is doubled on Denver, when filewrite is running, while it only increases on A57 (I ran it on a single
    core, too, to have fair comparison against the scheduling behavior of Denver)
    Also interesting: The Denver seems to be better in memcpying, while file writing is always faster on A57
Running on Denver:
    CPU [8%@1913,100%@1958,0%@1958,8%@1907,100%@1913,5%@1907]
    Filewrite duration [ms]: 16763
    memcpy duration [ms]: 15675
    memcpy duration [ms]: 8544 (only memcpy, no filewrite in parallel)

Running on A57:
    CPU [15%@1908,0%@1959,0%@1959,93%@1929,84%@1920,12%@1920]
    Filewrite duration [ms]: 11089
    memcpy duration [ms]: 12420
    memcpy duration [ms]: 12441 (only memcpy, no filewrite in parallel)

Running on A57 (Core 0 only):
    CPU [100%@1919,0%@1959,0%@1959,5%@1919,4%@1920,5%@1919]
    Filewrite duration [ms]: 12362
    memcpy duration [ms]: 17266
    memcpy duration [ms]: 12380 (only memcpy, no filewrite in parallel)

Now my guess for the performance problems of my software is, that the Denver cores behave like my singlecore example here - only core 1 does all the memcpy and file writing, while core 2 does some of the many additional lightweight threads I have, which I did not model in the example code.

My question: Is there anything known to fix this - make Denver cores more cooperative such that both of them handle heavy loads?

System setup: R32.5, Cuda 10.2

Our team will do the investigating, the status will be updated soon.

Could you check if below patch working for you.

The results are similar for my test script: memcpy on a57 takes about 12s, and about 8s on Denver2. When I enable the parallel thread that performes filewrites, again memcpy duration increases by 30% on a57 and by 100% on Denver2, when set to single core (taskset 0x2/0x4). When set to use two cores, there’s no increase on a57, and still 100% increase on Denver2. And again the second core staied idle, even though both are enabled and assigned (taskset 0x6).
Running my main software, both Denver2 cores are actually doing someting, however one runs at 100% and the other at 40%, constantly. When running on two a57 cores, the load on the cores is switching frequently and the performance is far superior.
The kernel launch latency doesn’t seem to be an issue here, its just the memcpy thats eating all the time.

This is constraint of Denver cores, so we disable the cores in default release. For using the cores, you would need to manually schedule tasks to the cores and do trials to find better throughput.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.