Asking for clarification (Thread yield or block)

Hi all,

I am facing a strange scenario that I need clarification on:

  • I am running 8 CPU threads.
  • One of them issues a cuda kernel call that lasts 14 seconds.
  • The rest of the CPU threads are also performing the same computation than the GPU and also need 14 seconds to complete.

So, I am basically substitution a CPU for a GPU :-)

When I set device flag ScheduleYield I would expect to see an occupancy over 700 but under 800 using top as the GPU-calling CPU thread should yield its quantum each time it spins to check GPU completion.
When I set device flag BlockingSync instead of Yield I would expect to see an occupancy ~700 using top as the GPU-calling CPU thread should remain “slept” for 14 seconds.

What I really see:

ON YIELD: When I check the top command I do see one of the CPU cores at ~40-50% of user occupancy which is what I expected but also another 50% of system time appears. I suppose this system time comes from the spinning process.

ON BLOCKING: When I check the top command I do see one of the CPU cores at ~100% of user occupancy which is what I expected but also another 0% of system time appears. Is this ok?. Is not blocking supposed to avoid this?

Other data:

AVERAGE CPU OCCUPANCY (using time command) ON BLOCKING SYNC: 790%

Other data from time command:

    User time (seconds): 3784.60
System time (seconds): 4.49
Percent of CPU this job got: 790%
Elapsed (wall clock) time (h:mm:ss or m:ss): 7:59.42
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 699488
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 109329
Voluntary context switches: 3139
Involuntary context switches: 322089

and

AVERAGE CPU OCCUPANCY (using time commanf) ON YIELD: 790%

Other data from time command:

User time (seconds): 3572.11
System time (seconds): 218.34
Percent of CPU this job got: 790%
Elapsed (wall clock) time (h:mm:ss or m:ss): 7:59.50
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 699536
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 113562
Voluntary context switches: 2910
Involuntary context switches: 322372

What is happening here?, Is this the expected behavior?. It does not seem to be liberating a CPU whatever the flag I apply.

Thanks in advance.

Also, If I add the environment flag:

export CUDA_LAUNCH_BLOCKING=1

I get a ~0% occupancy of the CUDA launching core… uhmmm

From the CUDA 3 Programming Guide I read this:

"3.2.6.6 Synchronous Calls

When a synchronous function is called, control is not returned to the host thread

before the device has completed the requested task. Whether the host thread will

then yield, block, or spin can be specified by calling cudaSetDeviceFlags()with

some specific flags (see reference manual for details) before any other CUDA calls is

performed by the host thread."

Does this mean that, in order to yield my CPU so other thread can use it I must use the environment variable to force synchronous kernel calls and then define Yield as a flag to cudaSetDeviceFlags ???.