cudaMemcpy() Best approach when you need to call it many times?

Hi all.

My program is structured as such:

#define N			 1,000

#define BIG_NUM 1,000,000

for (int n=0; n<BIG_NUM; ++n){

	int array[N];

	kernel<<<...>>>(...);

   cudaMemcpy(array, array_d,..., 'D2H');

   printArray(array);

}

That is, I print out the contents of my array[N] after each iteration of the loop (this is necessary as I need to know what it looks like at each time-step).

Some simple experimentation suggests that this simple cudaMemcpy()+print take up >90% of the program run-time (making it slower than its CPU counterpart!). As such, I’d like to, somehow, implement this more efficiently.

Is there a standard solution to this?

Two ideas that I have are:

  1. Allocate a massive amount of memory for my array [say, a few GB (I have a C1060)] and then just do a cudaMemcpy when it is full. {But this might take just as long?}

  2. Somehow use Asynchronous kernel calls? I have never used this before and I find the documentation on it somewhat confusing.

If anyone has any ideas or could lead me in the right direction I’d very much appreciate it.

Thank you in advance.

cudaMemcpy probably isn’t actually taking that long–that will synchronize and wait for the kernel to complete. Launching a kernel is (almost) always asynchronous; when you call kernel<<<…>>>(…);, it’s actually just queuing work for the GPU to perform at some point. It won’t block the CPU and wait for that kernel to finish or anything like that. However, since cudaMemcpy is a synchronous function, it implies that you want the results to be visible, so that will block the CPU until the GPU becomes idle (indicating that all of your work has completed).

If you want more sensible measurements of whether it’s your kernel or the memcpy that’s taking a long time without a lot of effort, set the environment variable CUDA_LAUNCH_BLOCKING to 1. This will force kernel calls to act synchronously, which means that the memcpy timing will represent only the memcpy and not both the memcpy and the kernel.

Thanks for the reply Tim.

How does one do this?

But, how can it be the kernel if, when I comment out the cudaMemcpy lines it takes 10% of the time?

in Linux, CUDA_LAUNCH_BLOCKING=1 ./application. in Windows, go to a command prompt, type set CUDA_LAUNCH_BLOCKING=1, then run your app.

It would only take 10% of the time because you’re not waiting for the GPU to complete if you’re not either synchronizing explicitly with cudaThreadSynchronize or calling another synchronous function later on.

As a matter of fact, I am using cudaThreadSynchronise() (Not sure what it does, just playing it safe :/)

inline void solveCable(TYPE* ivolts, TYPE* volts, const unsigned nBlocks){	

	#pragma unroll

	for (int b=0; b<nBlocks; ++b){

		solveCable_B<<<1,NUM_TPB_cS_B>>>(ivolts, volts, b, nBlocks);

		cudaThreadSynchronize();

	}

}

for (unsigned step = 0; step < SPC; ++step){

		solveCable(ivolts, volts, nBlocks);

		cudaMemcpy(volts,..., D2H);

		printArray(volts);

./application took 36s

CUDA_LAUNCH_BLOCKING=1 ./application took 35s

CUDA_LAUNCH_BLOCKING=1 ./application && commenting out cudaThreadSynchronize() took 34s

I’m confused… what does this mean?

well, I assumed you were doing some other timing to see which calls in your app were taking a specific amount of time. comparing with/without cudaThreadSynchronize doesn’t necessarily mean anything.

if you’re only launching one block you are using 1/(number between 2 and 30) of the GPU’s processing power. you really need to launch at least as many blocks as you have SMs (read the programming guide if you don’t know what this is), preferably many times that amount.

I realise I’m under-utilising the GPU when only using 1 block, however I believe the nature of this particular algorithm (which is serial by nature) restricts me to doing it like so. what happens is I iterate over an array and the computation for each element in the array is dependant on the previous one… so I don’t think that can be parallelised :(
Thanks for pointing that out, though.

A question that still remains is will there be a difference in time to copy, say, 10mb 10 times or 100mb once?
And what can I do about my problem, aside from optimizing the kernel function itself?

that probably could be parallelized with scan: http://en.wikipedia.org/wiki/Prefix_sum

copying large chunks is always faster than small ones because there is some level of driver overhead that is amortized as you copy more data.

Thank you very much.