more touch, more time

dgetrf · April 23, 2010, 6:56am

Hi there

I have a simple function like this

foo (float * b )
{
cudaMalloc((void**)&d_x, NMsizeof(float));
/*
write some value to d_x…
*/

      cudaMemcpy (b, d_x, M*N*sizeof(float), cudaMemcpyDeviceToDevice);

}

So d_x is like a buffer for b in the input parameter, which is also a buffer on GPU allocated by cudaMalloc.

Since the cudaMemcpy has nothing to do with the write-to-d_x part in the middle, I’d assume the cudaMemcpy should take same time for same M and N.

Now it turns out that the more value is written to d_x (different index), the slower the cudaMemcpy is. For example, when M=10240, N=1, if no touch to d_x at all, the time spent in cudaMemcpy is roughly 0.163 msec. If 1024 of M are touched, as the result of CublasSgemm, then the time of cudaMemcpy goes up to 27.733 msec.

Also, if d_x is a smaller buffer, say 1024x1, but the whole d_x is touched P times, then the time of cudaMemcpy is the same as if d_x is size 1024*P and all elements are touched just once.

Can anyone please explain to me why this is so? Looks to me like a cache interference… :huh:

Thanks a lot!

avidday · April 23, 2010, 7:56am

How and where are you timing this?

dgetrf · April 23, 2010, 8:46am

I put non-device-kernel into a .cpp file and use TAU, and to make sure of the overhead, I compare the performance with and without TAU, quite close.

Then any function can be timed by TAU

MisterAnderson42 · April 23, 2010, 12:06pm

Kernel launches are asynchronous, and cudaMemcpy inserts an implicit synchronization. Thus, the more time your kernel takes, the more time cudaMemcpy will take to execute. Wall clock timings in CUDA are ONLY correct if cudaThreadSynchronize() is called just before making the wall clock measurement.

dgetrf · April 23, 2010, 5:18pm

I put cudaThreadSynchronize() to the end of every function that I’m profiling and it seems to make sense now. Just I’m surprised that this would even affect function like CublasSgemm which I assume to have a cudaThreadSynchronize() by the end of it, no? but how?

dgetrf · April 23, 2010, 5:24pm

also ,could you explain a little bit more that “cudaMemcpy inserts an implicit synchronization. Thus, the more time your kernel takes, the more time cudaMemcpy will take to execute”…

if I only have one cudaMemcpy at the end, how does the kernel running before it influence the time of cudaMemcpy?

MisterAnderson42 · April 23, 2010, 5:42pm

Why would they do that? asynchronous launches are key to getting high performance in GPU codes. Any library that adds a cudaThreadSynchronize() in every call is doing so for no good reason.

I can try, but I already explained it.

cudaThreadSynchronize()

t1 = walltime()

call kernel that takes N milliseconds

t2 = walltime()

call cudamemcpy whos actual memory copy takes M milliseconds

t3 = walltime()

Because launches are asynchronous: t2-t1 = 0 (or really close to 0). But cudamempy has to implicitly syncrhonize as you might be copying results that are outputs from the kernel. So the N milliseconds of the kernel launch happens “inside” the cudamemcpy, and t3-t1 = M + N. Thus, as you increase the workload N, the cudaMemcpy appears to take longer and longer as you described.

dgetrf · April 23, 2010, 5:49pm

I think I get it about the longer cudamemcpy.

But to cublasSgemm, if the routine is not assumed synchronous, when can the user be sure that the result C is safe to use?

Why would they do that? asynchronous launches are key to getting high performance in GPU codes. Any library that adds a cudaThreadSynchronize() in every call is doing so for no good reason.

I can try, but I already explained it.
cudaThreadSynchronize()

t1 = walltime()

call kernel that takes N milliseconds

t2 = walltime()

call cudamemcpy whos actual memory copy takes M milliseconds

t3 = walltime()
Because launches are asynchronous: t2-t1 = 0 (or really close to 0). But cudamempy has to implicitly syncrhonize as you might be copying results that are outputs from the kernel. So the N milliseconds of the kernel launch happens “inside” the cudamemcpy, and t3-t1 = M + N. Thus, as you increase the workload N, the cudaMemcpy appears to take longer and longer as you described.

Tigga · April 23, 2010, 5:53pm

A kernel or memory copy in the same stream will wait for previous asynchronous kernels or memory copies in the same stream to finish before executing.

dgetrf · April 23, 2010, 6:11pm

any chance we could put cublas routine like cublassgemm and our own kernel into different stream?

Topic		Replies	Views
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25062	March 8, 2010
Kernel dimension influences cudaMemcpy? CUDA Programming and Performance	4	2413	September 26, 2007
Do the non-async calls sleep or burn CPU? CUDA Programming and Performance	20	22044	January 13, 2008
Possibly Studpid question bout cudaMemcpy CudaMemcpy getting slow by time CUDA Programming and Performance	4	2001	February 26, 2010
cudaMemcpyAsync cpu Load? CUDA Programming and Performance cuda	2	559	April 24, 2023
bets way to return a float value sync or assync CUDA Programming and Performance	26	10314	May 7, 2009
cudaMalloc() vs cudaMallocManaged() wrt to cudaMemcpy() CUDA Programming and Performance	9	4534	October 11, 2018
Trying to run cudaMemsetAsync in a more timely manner CUDA Programming and Performance	8	1263	September 15, 2019
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1402	January 10, 2022
cudaMemcpy is slow the first time used in a loop CUDA Programming and Performance cuda	3	1654	October 12, 2021

more touch, more time

Related topics