Unpredictable nature of GPU action timing in Nsight

I have a question about the picture below, which is using 3 GPU to compute an answer over time. All of the data resides on GPU 0 then is copied to GPU 1&2. Then computation then transfered back to GPU 0.
-Using Unified Virtual Addressing.

P2P/TCC was not used for the data movement - even more problems with timing and incorrect results occured when turning TCC on. (saving that to figure out later, unless there is a known bug?)

What would cause the variance that the picture shows such as:
–Memory copies taking longer randomly
–Pauses/gaps between actions

For time referance the pause in blue section is ~3 ms. Then Purple ~ 3.6 ms

I’m looking for suggestions on what to possibly look at to understand the behavior.
Or is this normal?

Addition:
After much more testing it just seems that the runtime is not perfectly consistant, but it still seems strange to have random 3 ms pauses. Or to have memory transfers some times take 2-5x their normal time (this occured, not in image though)

Any thoughts? Or anyone who has also seen this?

i take it you are running windows?

you mention both random gaps in processing, and memory copies taking too long

with regards to the latter, i suppose you could accumulate all transfers across the pci bus, for all contexts (you seem to have 3), for a particular time window, to ascertain whether memory copies take too long

it also seems that you are mostly registering the ‘error’ in a single context, as opposed to multiple contexts
personally, i would take note of this

the host also seems to spend a lot of time on cudaStreamSynchronize()
i am wondering whether cudaStreamSynchronize() is the best approach, when managing multiple gpus - are you sure this is not affecting the host’s ability to manage all streams/ devices effectively, without spending too much time on a single instance at any time?

Thank you for your input. Yes I am using windows, the goal is to have the 3 compute GPU running in TCC mode but debugging still. TCC seems to change how my code is run.

Hmm, With out TCC it might be more benifical to transfer to the host once then to all GPU.

As for the cudaStreamSynchronize() I made a snipit of the logic below. The fisrt group of memory copies is the longest. (could copy alternating segments of each list to GPU to over lap?) I was trying to follow multi GPU guides but they often only deal with host to GPU memory copies. Eventually more computation will be added to this project and the ability to overlap data and compute time might increase.

			for (int k = 0; k < CALC_GPU_COUNT; k++)
			{
				int i = deviceNumber[k];
                                    cudaSetDevice(i);
                                    //Using UVA copy from main GPU to temp data structures on all GPU - I herd there is no P2P memory broadcast?
                                    //Cant use async copy becuas of copying from the same data source - there is data overlap required for computation
				cudaMemcpy(dataA[i], main_dataA, count*sizeof(float)*3, cudaMemcpyDefault);
				cudaMemcpy(dataB[i], main_dataB, count2*sizeof(UINT), cudaMemcpyDefault);
				cudaMemcpy(dataC[i], main_dataC, count2*sizeof(UINT), cudaMemcpyDefault);
			}
			
			// Compute on picked GPU
			for (int k = 0; k < CALC_GPU_COUNT; k++)
			{
				int i = deviceNumber[k];cudaSetDevice(i);
				calculate<< <blocks, threads, 0, streams[i] >> >(dataA[i], dataB[i], dataC[i], other_values, workGroup_step*i);
			}

// Finish compute before data merge
for (int k = 0; k < CALC_GPU_COUNT; k++){
int i = deviceNumber[k];
cudaSetDevice(i);
cudaStreamSynchronize(streams[i]);
}

                            // Collect Data on main GPU
			cudaSetDevice(MainGPU);
			int total = SEGMENT_COUNT_CU;
			int loopGroup = 0;
			for (int k = 0; k < CALC_GPU_COUNT; k++)
			{
				int i = deviceNumber[k];
				/*  Merge Data  */
			}

If theres no clear explination no need to think to far into the issue, I should work out TCC mode before I look at this more minor artifact.

i am not really a windows activist; hence my knowledge/ experience of said os is limited
however, i get the impression, that with tcc on/ off, synchronization, and the method of synchronization, is even more important, as windows/ tcc might impact what (work) gets out of the door when

“Cant use async copy becuas of copying from the same data source - there is data overlap required for computation”

data overlap meaning?
or, are you essentially referring to synchronization

and i think you are certainly paying a price for using synchronous memory copies
i would take it as implying the host can not schedule - get to scheduling - any kernels, until all memory copies are completed
this seems unnecessary, and limiting

i have to double check whether stream events are not cross-device
this would allow the host to better schedule work, whilst preserving what i perceive is a cross-device synchronization requirement only

When I said:
"Cant use async copy because of copying from the same data source
I read that if multiple memory copies were coping from the same location they could not overlap.
Before reading that I had tried it and it resulted in errors.
From what I understand async memcpy are normally from different sources to different destination if they are overlapping.

Then I meant this as a second part

  • there is data overlap required for computation"
    Even though the computation is being divided across GPU, the work required often needs a majority of the data. The computation per calculation needs a set of neighboring data.

I have made independent streams for each device.

Another image showing streams and data rates.

The selected and the red line are the same data, then the black and blue are about the same data size.

There are times when the high/low speeds are flipped for GPU 1&2. Also times when they are both at 9 GB/s

Had an error and it was posted 2 times, wasnt sure how to delete this one

so, with ‘data overlap’ you essentially mean that the same data array/ matrix - or parts thereof - is needed by, and must be copied to multiple devices

“I read that if multiple memory copies were coping from the same location they could not overlap.”

i am not going to question this; however, in observing:
a) if the memory copies have the same direction (h2d seems safest), i am wondering whether the driver would not simply mutex lock the memory
b) if the overlap is in the source, and not the destination, i do not see the potential for harm done

“From what I understand async memcpy are normally from different sources to different destination if they are overlapping.”

the important point for me is what is implied in terms of the host’s action by a synchronous/ asynchronous memory operation
asynchronous memory copies in the same stream, are still synchronous with respect to each other; hence, issues of overlap should not be a concern
here, a consistent stream is implied to enforce synchronization; there are other methods as well
asynchronous memory copies in terms of the host, would also imply that the host would not wait on each memory copy; hence, it can issue all memory copies, and immediately move on to issuing the kernels

in returning to the point of overlap, i am confident that you can still use asynchronous memory copies, even if you have overlap, and see it as mandatory to guard against it
guarding against overlap would then simply imply additional synchronization required - a memory transaction can not commence, before certain others have completed
synchronous memory transactions is one way of securing such synchronization, but may not be a very effective method of synchronization

Okay thanks I will try testing more.

I can try to make each GPU’s memcpy asynchronous again across different GPU, but the work on a single GPU needs to be synchronized due to the kernel needing the data before running.

I get partial data errors in the area between the work load divides when changing the initial copies to:

for (int k = 0; k < CALC_GPU_COUNT; k++)
{
int i = deviceNumber[k];
cudaSetDevice(i);
cudaMemcpyAsync(dataA[i], main_dataA, countsizeof(float)3, cudaMemcpyDefault, streams[i]);
cudaMemcpyAsync(dataB[i], main_dataB, count2
sizeof(UINT), cudaMemcpyDefault, streams[i]);
cudaMemcpyAsync(dataC[i], main_dataC, count2
sizeof(UINT), cudaMemcpyDefault, streams[i]);
}

have double-checked; events are cross-device

from the pg:

cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associated to a device that is different from the current device.
cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other.

“but the work on a single GPU needs to be synchronized due to the kernel needing the data before running”

i do not see a problem; little you can’t do without cudaStreamWaitEvent() (in more ways than one)

I tested asynchronous copies from the main GPU0 to assisting GPU1&2 using the assisting GPU’s stream
-This resulted in all of the copies from the main GPU0 being synchronous on a single stream, but while on independent streams on the reviving GPU1&2

Then I tested launching the copies from the perspective of the main GPU0 on different streams, to the other GPU
-This resulted in the copies on different GPU0 streams but still synchronous

From this I will guess that only one memory transfer off of a GPU can be done at a time, forced synchronous.


So next I tested copying the data from GPU0 to the CPU, then transferring it to GPU1&2 each on their own steams
-This resulted in true asynchronous behavior for the copy. And no data errors

^that worked really well actually

even though events are device-global, streams are device-local, if i am not mistaken
thus, i am not certain about your first implementation

the second implementation seems solid

if the memory transactions are more asynchronous, you should also note more asynchronous kernel launches, with kernels then commencing after their preceding memory transactions in a more timely manner

i would also go as far as postulating that tcc should be less of an issue now, as you have now essentially rendered the host far more asynchronous, relative to the devices