Unpredictable nature of GPU action timing in Nsight

sobe118 · October 17, 2015, 5:39pm

I have a question about the picture below, which is using 3 GPU to compute an answer over time. All of the data resides on GPU 0 then is copied to GPU 1&2. Then computation then transfered back to GPU 0.
-Using Unified Virtual Addressing.

P2P/TCC was not used for the data movement - even more problems with timing and incorrect results occured when turning TCC on. (saving that to figure out later, unless there is a known bug?)

What would cause the variance that the picture shows such as:
–Memory copies taking longer randomly
–Pauses/gaps between actions

For time referance the pause in blue section is ~3 ms. Then Purple ~ 3.6 ms

I’m looking for suggestions on what to possibly look at to understand the behavior.
Or is this normal?

Addition:
After much more testing it just seems that the runtime is not perfectly consistant, but it still seems strange to have random 3 ms pauses. Or to have memory transfers some times take 2-5x their normal time (this occured, not in image though)

Any thoughts? Or anyone who has also seen this?

little_jimmy · October 18, 2015, 6:59am

i take it you are running windows?

you mention both random gaps in processing, and memory copies taking too long

with regards to the latter, i suppose you could accumulate all transfers across the pci bus, for all contexts (you seem to have 3), for a particular time window, to ascertain whether memory copies take too long

it also seems that you are mostly registering the ‘error’ in a single context, as opposed to multiple contexts
personally, i would take note of this

the host also seems to spend a lot of time on cudaStreamSynchronize()
i am wondering whether cudaStreamSynchronize() is the best approach, when managing multiple gpus - are you sure this is not affecting the host’s ability to manage all streams/ devices effectively, without spending too much time on a single instance at any time?

sobe118 · October 18, 2015, 7:43am

Thank you for your input. Yes I am using windows, the goal is to have the 3 compute GPU running in TCC mode but debugging still. TCC seems to change how my code is run.

Hmm, With out TCC it might be more benifical to transfer to the host once then to all GPU.

As for the cudaStreamSynchronize() I made a snipit of the logic below. The fisrt group of memory copies is the longest. (could copy alternating segments of each list to GPU to over lap?) I was trying to follow multi GPU guides but they often only deal with host to GPU memory copies. Eventually more computation will be added to this project and the ability to overlap data and compute time might increase.

			for (int k = 0; k < CALC_GPU_COUNT; k++)
			{
				int i = deviceNumber[k];
                                    cudaSetDevice(i);
                                    //Using UVA copy from main GPU to temp data structures on all GPU - I herd there is no P2P memory broadcast?
                                    //Cant use async copy becuas of copying from the same data source - there is data overlap required for computation
				cudaMemcpy(dataA[i], main_dataA, count*sizeof(float)*3, cudaMemcpyDefault);
				cudaMemcpy(dataB[i], main_dataB, count2*sizeof(UINT), cudaMemcpyDefault);
				cudaMemcpy(dataC[i], main_dataC, count2*sizeof(UINT), cudaMemcpyDefault);
			}
			
			// Compute on picked GPU
			for (int k = 0; k < CALC_GPU_COUNT; k++)
			{
				int i = deviceNumber[k];cudaSetDevice(i);
				calculate<< <blocks, threads, 0, streams[i] >> >(dataA[i], dataB[i], dataC[i], other_values, workGroup_step*i);
			}

// Finish compute before data merge
for (int k = 0; k < CALC_GPU_COUNT; k++){
int i = deviceNumber[k];
cudaSetDevice(i);
cudaStreamSynchronize(streams[i]);
}

                            // Collect Data on main GPU
			cudaSetDevice(MainGPU);
			int total = SEGMENT_COUNT_CU;
			int loopGroup = 0;
			for (int k = 0; k < CALC_GPU_COUNT; k++)
			{
				int i = deviceNumber[k];
				/*  Merge Data  */
			}

If theres no clear explination no need to think to far into the issue, I should work out TCC mode before I look at this more minor artifact.

little_jimmy · October 18, 2015, 8:49am

i am not really a windows activist; hence my knowledge/ experience of said os is limited
however, i get the impression, that with tcc on/ off, synchronization, and the method of synchronization, is even more important, as windows/ tcc might impact what (work) gets out of the door when

“Cant use async copy becuas of copying from the same data source - there is data overlap required for computation”

data overlap meaning?
or, are you essentially referring to synchronization

and i think you are certainly paying a price for using synchronous memory copies
i would take it as implying the host can not schedule - get to scheduling - any kernels, until all memory copies are completed
this seems unnecessary, and limiting

i have to double check whether stream events are not cross-device
this would allow the host to better schedule work, whilst preserving what i perceive is a cross-device synchronization requirement only

sobe118 · October 18, 2015, 10:43am

When I said:
"Cant use async copy because of copying from the same data source
I read that if multiple memory copies were coping from the same location they could not overlap.
Before reading that I had tried it and it resulted in errors.
From what I understand async memcpy are normally from different sources to different destination if they are overlapping.

Then I meant this as a second part

there is data overlap required for computation"
Even though the computation is being divided across GPU, the work required often needs a majority of the data. The computation per calculation needs a set of neighboring data.

I have made independent streams for each device.

Another image showing streams and data rates.

The selected and the red line are the same data, then the black and blue are about the same data size.

There are times when the high/low speeds are flipped for GPU 1&2. Also times when they are both at 9 GB/s

sobe118 · October 18, 2015, 10:43am

Had an error and it was posted 2 times, wasnt sure how to delete this one

little_jimmy · October 18, 2015, 12:59pm

so, with ‘data overlap’ you essentially mean that the same data array/ matrix - or parts thereof - is needed by, and must be copied to multiple devices

“I read that if multiple memory copies were coping from the same location they could not overlap.”

i am not going to question this; however, in observing:
a) if the memory copies have the same direction (h2d seems safest), i am wondering whether the driver would not simply mutex lock the memory
b) if the overlap is in the source, and not the destination, i do not see the potential for harm done

“From what I understand async memcpy are normally from different sources to different destination if they are overlapping.”

the important point for me is what is implied in terms of the host’s action by a synchronous/ asynchronous memory operation
asynchronous memory copies in the same stream, are still synchronous with respect to each other; hence, issues of overlap should not be a concern
here, a consistent stream is implied to enforce synchronization; there are other methods as well
asynchronous memory copies in terms of the host, would also imply that the host would not wait on each memory copy; hence, it can issue all memory copies, and immediately move on to issuing the kernels

in returning to the point of overlap, i am confident that you can still use asynchronous memory copies, even if you have overlap, and see it as mandatory to guard against it
guarding against overlap would then simply imply additional synchronization required - a memory transaction can not commence, before certain others have completed
synchronous memory transactions is one way of securing such synchronization, but may not be a very effective method of synchronization

sobe118 · October 18, 2015, 1:15pm

Okay thanks I will try testing more.

I can try to make each GPU’s memcpy asynchronous again across different GPU, but the work on a single GPU needs to be synchronized due to the kernel needing the data before running.

I get partial data errors in the area between the work load divides when changing the initial copies to:

for (int k = 0; k < CALC_GPU_COUNT; k++)
{
int i = deviceNumber[k];
cudaSetDevice(i);
cudaMemcpyAsync(dataA[i], main_dataA, countsizeof(float)3, cudaMemcpyDefault, streams[i]);
cudaMemcpyAsync(dataB[i], main_dataB, count2sizeof(UINT), cudaMemcpyDefault, streams[i]);
cudaMemcpyAsync(dataC[i], main_dataC, count2sizeof(UINT), cudaMemcpyDefault, streams[i]);
}

little_jimmy · October 18, 2015, 1:21pm

have double-checked; events are cross-device

from the pg:

cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associated to a device that is different from the current device.
cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other.

“but the work on a single GPU needs to be synchronized due to the kernel needing the data before running”

i do not see a problem; little you can’t do without cudaStreamWaitEvent() (in more ways than one)

sobe118 · October 18, 2015, 2:22pm

I tested asynchronous copies from the main GPU0 to assisting GPU1&2 using the assisting GPU’s stream
-This resulted in all of the copies from the main GPU0 being synchronous on a single stream, but while on independent streams on the reviving GPU1&2

Then I tested launching the copies from the perspective of the main GPU0 on different streams, to the other GPU
-This resulted in the copies on different GPU0 streams but still synchronous

From this I will guess that only one memory transfer off of a GPU can be done at a time, forced synchronous.

So next I tested copying the data from GPU0 to the CPU, then transferring it to GPU1&2 each on their own steams
-This resulted in true asynchronous behavior for the copy. And no data errors

^that worked really well actually

little_jimmy · October 18, 2015, 3:20pm

even though events are device-global, streams are device-local, if i am not mistaken
thus, i am not certain about your first implementation

the second implementation seems solid

if the memory transactions are more asynchronous, you should also note more asynchronous kernel launches, with kernels then commencing after their preceding memory transactions in a more timely manner

i would also go as far as postulating that tcc should be less of an issue now, as you have now essentially rendered the host far more asynchronous, relative to the devices

Topic		Replies	Views
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1628	October 30, 2015
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2165	January 18, 2023
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	796	September 11, 2016
Cannot get any stream parallelism. CUDA Programming and Performance	13	1254	December 31, 2019
Copies between CPU and GPU CUDA Programming and Performance	8	5327	November 3, 2009
Bug in cudaMemsetAsync or in Nsight VS Edition when visualizing cudaMemsetAsync execution CUDA Programming and Performance	13	1388	November 11, 2021
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	564	April 6, 2023
cudaMemcpy2DAsync not always fully synchronous CUDA Programming and Performance	11	1145	February 4, 2021
Copy-Compute Overlap Performance CUDA Programming and Performance	4	939	January 19, 2019
Can I use streaming to overlap kernels and data transfers in this scenario? CUDA Programming and Performance	13	245	July 5, 2024

Unpredictable nature of GPU action timing in Nsight

Related topics