cudaMemcpy() Best approach when you need to call it many times?

zeus13i · March 8, 2010, 2:42am

Hi all.

My program is structured as such:

#define N			 1,000

#define BIG_NUM 1,000,000

for (int n=0; n<BIG_NUM; ++n){

	int array[N];

	kernel<<<...>>>(...);

   cudaMemcpy(array, array_d,..., 'D2H');

   printArray(array);

}

That is, I print out the contents of my array[N] after each iteration of the loop (this is necessary as I need to know what it looks like at each time-step).

Some simple experimentation suggests that this simple cudaMemcpy()+print take up >90% of the program run-time (making it slower than its CPU counterpart!). As such, I’d like to, somehow, implement this more efficiently.

Is there a standard solution to this?

Two ideas that I have are:

Allocate a massive amount of memory for my array [say, a few GB (I have a C1060)] and then just do a cudaMemcpy when it is full. {But this might take just as long?}
Somehow use Asynchronous kernel calls? I have never used this before and I find the documentation on it somewhat confusing.

If anyone has any ideas or could lead me in the right direction I’d very much appreciate it.

Thank you in advance.

tmurray · March 8, 2010, 3:49am

cudaMemcpy probably isn’t actually taking that long–that will synchronize and wait for the kernel to complete. Launching a kernel is (almost) always asynchronous; when you call kernel<<<…>>>(…);, it’s actually just queuing work for the GPU to perform at some point. It won’t block the CPU and wait for that kernel to finish or anything like that. However, since cudaMemcpy is a synchronous function, it implies that you want the results to be visible, so that will block the CPU until the GPU becomes idle (indicating that all of your work has completed).

If you want more sensible measurements of whether it’s your kernel or the memcpy that’s taking a long time without a lot of effort, set the environment variable CUDA_LAUNCH_BLOCKING to 1. This will force kernel calls to act synchronously, which means that the memcpy timing will represent only the memcpy and not both the memcpy and the kernel.

zeus13i · March 8, 2010, 4:04am

Thanks for the reply Tim.

How does one do this?

But, how can it be the kernel if, when I comment out the cudaMemcpy lines it takes 10% of the time?

tmurray · March 8, 2010, 4:07am

in Linux, CUDA_LAUNCH_BLOCKING=1 ./application. in Windows, go to a command prompt, type set CUDA_LAUNCH_BLOCKING=1, then run your app.

It would only take 10% of the time because you’re not waiting for the GPU to complete if you’re not either synchronizing explicitly with cudaThreadSynchronize or calling another synchronous function later on.

zeus13i · March 8, 2010, 4:18am

As a matter of fact, I am using cudaThreadSynchronise() (Not sure what it does, just playing it safe External Image)

inline void solveCable(TYPE* ivolts, TYPE* volts, const unsigned nBlocks){	

	#pragma unroll

	for (int b=0; b<nBlocks; ++b){

		solveCable_B<<<1,NUM_TPB_cS_B>>>(ivolts, volts, b, nBlocks);

		cudaThreadSynchronize();

	}

}

for (unsigned step = 0; step < SPC; ++step){

		solveCable(ivolts, volts, nBlocks);

		cudaMemcpy(volts,..., D2H);

		printArray(volts);

./application took 36s

CUDA_LAUNCH_BLOCKING=1 ./application took 35s

CUDA_LAUNCH_BLOCKING=1 ./application && commenting out cudaThreadSynchronize() took 34s

I’m confused… what does this mean?

tmurray · March 8, 2010, 4:22am

well, I assumed you were doing some other timing to see which calls in your app were taking a specific amount of time. comparing with/without cudaThreadSynchronize doesn’t necessarily mean anything.

if you’re only launching one block you are using 1/(number between 2 and 30) of the GPU’s processing power. you really need to launch at least as many blocks as you have SMs (read the programming guide if you don’t know what this is), preferably many times that amount.

zeus13i · March 8, 2010, 4:30am

I realise I’m under-utilising the GPU when only using 1 block, however I believe the nature of this particular algorithm (which is serial by nature) restricts me to doing it like so. what happens is I iterate over an array and the computation for each element in the array is dependant on the previous one… so I don’t think that can be parallelised :(
Thanks for pointing that out, though.

A question that still remains is will there be a difference in time to copy, say, 10mb 10 times or 100mb once?
And what can I do about my problem, aside from optimizing the kernel function itself?

tmurray · March 8, 2010, 4:34am

that probably could be parallelized with scan: http://en.wikipedia.org/wiki/Prefix_sum

copying large chunks is always faster than small ones because there is some level of driver overhead that is amortized as you copy more data.

zeus13i · March 8, 2010, 4:51am

Thank you very much.

Topic		Replies	Views
cudaMemcpy execution time CUDA Programming and Performance	5	6777	June 17, 2010
cudaMemcpy is slow the first time used in a loop CUDA Programming and Performance cuda	3	1653	October 12, 2021
Do the non-async calls sleep or burn CPU? CUDA Programming and Performance	20	22044	January 13, 2008
Slow memory transfers CUDA Programming and Performance	7	1990	May 23, 2011
Latency when running a cuda code CUDA Programming and Performance	10	3386	December 30, 2020
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3144	July 14, 2010
more touch, more time CUDA Programming and Performance	9	1985	April 23, 2010
CUDA trouble CUDA Programming and Performance	3	977	March 19, 2013
Problem: cuda calls are synchronized CUDA Programming and Performance	17	2843	February 18, 2011
cudaMemcpy sometimes doesn't work CUDA Programming and Performance	5	4477	November 13, 2008

cudaMemcpy() Best approach when you need to call it many times?

Related topics