So d_x is like a buffer for b in the input parameter, which is also a buffer on GPU allocated by cudaMalloc.
Since the cudaMemcpy has nothing to do with the write-to-d_x part in the middle, I’d assume the cudaMemcpy should take same time for same M and N.
Now it turns out that the more value is written to d_x (different index), the slower the cudaMemcpy is. For example, when M=10240, N=1, if no touch to d_x at all, the time spent in cudaMemcpy is roughly 0.163 msec. If 1024 of M are touched, as the result of CublasSgemm, then the time of cudaMemcpy goes up to 27.733 msec.
Also, if d_x is a smaller buffer, say 1024x1, but the whole d_x is touched P times, then the time of cudaMemcpy is the same as if d_x is size 1024*P and all elements are touched just once.
Can anyone please explain to me why this is so? Looks to me like a cache interference… :huh:
Kernel launches are asynchronous, and cudaMemcpy inserts an implicit synchronization. Thus, the more time your kernel takes, the more time cudaMemcpy will take to execute. Wall clock timings in CUDA are ONLY correct if cudaThreadSynchronize() is called just before making the wall clock measurement.
I put cudaThreadSynchronize() to the end of every function that I’m profiling and it seems to make sense now. Just I’m surprised that this would even affect function like CublasSgemm which I assume to have a cudaThreadSynchronize() by the end of it, no? but how?
Why would they do that? asynchronous launches are key to getting high performance in GPU codes. Any library that adds a cudaThreadSynchronize() in every call is doing so for no good reason.
I can try, but I already explained it.
t1 = walltime()
call kernel that takes N milliseconds
t2 = walltime()
call cudamemcpy whos actual memory copy takes M milliseconds
t3 = walltime()
Because launches are asynchronous: t2-t1 = 0 (or really close to 0). But cudamempy has to implicitly syncrhonize as you might be copying results that are outputs from the kernel. So the N milliseconds of the kernel launch happens “inside” the cudamemcpy, and t3-t1 = M + N. Thus, as you increase the workload N, the cudaMemcpy appears to take longer and longer as you described.