Hi all,
I am facing a strange scenario that I need clarification on:
- I am running 8 CPU threads.
- One of them issues a cuda kernel call that lasts 14 seconds.
- The rest of the CPU threads are also performing the same computation than the GPU and also need 14 seconds to complete.
So, I am basically substitution a CPU for a GPU :-)
When I set device flag ScheduleYield I would expect to see an occupancy over 700 but under 800 using top as the GPU-calling CPU thread should yield its quantum each time it spins to check GPU completion.
When I set device flag BlockingSync instead of Yield I would expect to see an occupancy ~700 using top as the GPU-calling CPU thread should remain “slept” for 14 seconds.
What I really see:
ON YIELD: When I check the top command I do see one of the CPU cores at ~40-50% of user occupancy which is what I expected but also another 50% of system time appears. I suppose this system time comes from the spinning process.
ON BLOCKING: When I check the top command I do see one of the CPU cores at ~100% of user occupancy which is what I expected but also another 0% of system time appears. Is this ok?. Is not blocking supposed to avoid this?
Other data:
AVERAGE CPU OCCUPANCY (using time command) ON BLOCKING SYNC: 790%
Other data from time command:
User time (seconds): 3784.60
System time (seconds): 4.49
Percent of CPU this job got: 790%
Elapsed (wall clock) time (h:mm:ss or m:ss): 7:59.42
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 699488
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 109329
Voluntary context switches: 3139
Involuntary context switches: 322089
and
AVERAGE CPU OCCUPANCY (using time commanf) ON YIELD: 790%
Other data from time command:
User time (seconds): 3572.11
System time (seconds): 218.34
Percent of CPU this job got: 790%
Elapsed (wall clock) time (h:mm:ss or m:ss): 7:59.50
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 699536
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 113562
Voluntary context switches: 2910
Involuntary context switches: 322372
What is happening here?, Is this the expected behavior?. It does not seem to be liberating a CPU whatever the flag I apply.
Thanks in advance.