more touch, more time

Hi there

I have a simple function like this

foo (float * b )
{
cudaMalloc((void**)&d_x, NMsizeof(float));
/*
write some value to d_x…
*/

      cudaMemcpy (b, d_x, M*N*sizeof(float), cudaMemcpyDeviceToDevice);

}

So d_x is like a buffer for b in the input parameter, which is also a buffer on GPU allocated by cudaMalloc.

Since the cudaMemcpy has nothing to do with the write-to-d_x part in the middle, I’d assume the cudaMemcpy should take same time for same M and N.

Now it turns out that the more value is written to d_x (different index), the slower the cudaMemcpy is. For example, when M=10240, N=1, if no touch to d_x at all, the time spent in cudaMemcpy is roughly 0.163 msec. If 1024 of M are touched, as the result of CublasSgemm, then the time of cudaMemcpy goes up to 27.733 msec.

Also, if d_x is a smaller buffer, say 1024x1, but the whole d_x is touched P times, then the time of cudaMemcpy is the same as if d_x is size 1024*P and all elements are touched just once.

Can anyone please explain to me why this is so? Looks to me like a cache interference… :huh:

Thanks a lot!

How and where are you timing this?

I put non-device-kernel into a .cpp file and use TAU, and to make sure of the overhead, I compare the performance with and without TAU, quite close.

Then any function can be timed by TAU

Kernel launches are asynchronous, and cudaMemcpy inserts an implicit synchronization. Thus, the more time your kernel takes, the more time cudaMemcpy will take to execute. Wall clock timings in CUDA are ONLY correct if cudaThreadSynchronize() is called just before making the wall clock measurement.

I put cudaThreadSynchronize() to the end of every function that I’m profiling and it seems to make sense now. Just I’m surprised that this would even affect function like CublasSgemm which I assume to have a cudaThreadSynchronize() by the end of it, no? but how?

also ,could you explain a little bit more that “cudaMemcpy inserts an implicit synchronization. Thus, the more time your kernel takes, the more time cudaMemcpy will take to execute”…

if I only have one cudaMemcpy at the end, how does the kernel running before it influence the time of cudaMemcpy?

Why would they do that? asynchronous launches are key to getting high performance in GPU codes. Any library that adds a cudaThreadSynchronize() in every call is doing so for no good reason.

I can try, but I already explained it.

cudaThreadSynchronize()

t1 = walltime()

call kernel that takes N milliseconds

t2 = walltime()

call cudamemcpy whos actual memory copy takes M milliseconds

t3 = walltime()

Because launches are asynchronous: t2-t1 = 0 (or really close to 0). But cudamempy has to implicitly syncrhonize as you might be copying results that are outputs from the kernel. So the N milliseconds of the kernel launch happens “inside” the cudamemcpy, and t3-t1 = M + N. Thus, as you increase the workload N, the cudaMemcpy appears to take longer and longer as you described.

I think I get it about the longer cudamemcpy.

But to cublasSgemm, if the routine is not assumed synchronous, when can the user be sure that the result C is safe to use?

A kernel or memory copy in the same stream will wait for previous asynchronous kernels or memory copies in the same stream to finish before executing.

any chance we could put cublas routine like cublassgemm and our own kernel into different stream?