Is Hyper-Q needed to overlap these type of streams?

Due to codes being very spaghetti, I could only write what I have tried:

When two streams have

  • host-to-device(async) + kernel(stream1) + device-to-host(async) + cuStreamSynchronize(stream1) (repeat for 100 times)
  • host-to-device(async) + kernel(stream2) + device-to-host(async) + cuStreamSynchronize(stream2) (repeat for 100 times)

is Hyper-Q needed for them to overlap at least 1 operation of them (kernel + dtoh(or htod))?

I have 2x Quadro k420 and 1 card having TCC mode, other WDDM, they can’t overlap when arrays are pinned. But when not pinned, they overlap(although being slower).

If they are pinned and there is not synchronization for a long chain of commands, then they overlap again.

Does cuStreamSynchronize(stream) disrupt drivers’ overlapping mechanisms?

  • all streams are on their own CPU thread. They are of same device (3 per k420 for example) but independent of other CPU threads, so they just run without any order.

  • for TCC mode, pageable array transfer+kernel overlapping is better than wddm but they act same on pinned arrays while TCC doesn’t group commands.

  • tried default-stream per-thread sync too

  • I guess Quadro K420 having only CC 3.0 can’t do Hyper-Q.

How can I overlap 2 streams using async commands, 1 sync command per stream, using pinned arrays?

Also I observed that first batch (copy+kernel+copy+sync) is overlapping while rest is very ordered like a staircase per device for all streams.

  • all CPU threads(with their own streams) get commands from a list as soon as possible and never wait for any other stream/CPU.

  • Why would pageable array version overlap while pinned cannot?

  • Why would pinned array and pageable array bandwidth shown same(2.9 GB/s htod, 3.2 GB/s dtoh) in Nvidia visual profiler(even though pinnes is completed much quicker)?

  • streams are nonblocking type

  • tried spinwait/blocking/yield for contexts (not only per device but per stream too)

  • tried also 1 context per device instead of per stream, same result

  • giving different kernel object per stream / different kernel name per stream

  • data is different per stream

Is cuEventSynchronize needed for the wanted overlapping behavior on pinned arrays with 1 sync per repeat?

Only 1 stream synchronization after 100 repeats(I don’t want this but just to show it) (pinned arrays)

1 synchronization per repeat (I need this because of callback function per repeat) (pinned arrays again) (would Hyper-Q help here?)

1 synchronization per repeat (pageable arrays)

Only 1 stream synchronization after 100 repeats (pageable arrays)

how much heavier is cuStreamSynchronize(stream) than cuEventSynchronize(event)?

Edit: Just tested cuEventSynchronize(using blocking flag and its stream) instead of cuStreamSynchronize(stream), it is same.

Edit2: cuStreamWaitEvent did not work because data is not consistent between cpu and gpu. But it let overlapping happen again. Maybe there should be another stream using this and wait on that stream instead so that compute stream can be overlapped with other compute streams while waiting on CPU happens on this waiting-stream. I’ll test this.

Edit: cuStreamWaitEvent waits on device side. This is not working on CPU side.

SetEnvironmentVariable("CUDA_DEVICE_MAX_CONNECTIONS", "16");

seems to be working for TCC mode. Adding this made it overlap but wddm hasn’t changed.

TCC mode overlapping with 7 streams and max connection set to 16 and with 1 sync per repeat of (htod+kernel+dtoh) with pinned arrays: