bets way to return a float value sync or assync

Given the following:

  1. I want to return a single float value from a kernel.
  2. I don’t need the value right away. The value its used to determine when to exit a loop where several kernels are called, but I don’t care if the loop is executed one more time than it should.
  3. I want my program to resume calling other kernels as soon as possible.

Right now I’m using cudaMemcpy.

Is there a better way of doing this:

I’m guessing cudaMemcpyAsync would take more than just calling cudaMemcpy and wait for the value. Is this right?
Is cudaMemcpyToSymbol faster?

Thanks

Leave it somewhere in global memory where all kernels can access.

something like this:

int *pLoopCounter;

cudaMalloc( pLoopCounter, sizeof(int));

KernelToCalculateLoopCounter<<<...>>>( <all sort of params>, pLoopCounter );   // --> This kernel calculates pLoopCounter[0] and exits.

KernelA<<<....>>>( <all sort of params>, <more params>, pLoopCounter );  // --> Read pLoopCounter[0] in the kernel

KernelB<<<....>>>( <all sort of params>, <more params>, pLoopCounter );  // --> Read pLoopCounter[0] in the kernel

...

Hope this is what you meant :)

eyal

Actually no, I need the value on the host (its the host who determines when to stop the loop in wich the kernels are being called).

Also I was mesuring times and without copying the value to the host my application took arround 95.877480 (ms) to run the loop 1000 times.
When I start copying the value to the host this value went up to 371.656189 (ms).

:(

Don’t know if assync would take less time since I’m using vista (cuda 2.1)

So I have a new question. Its possible to copy a value to the host from within a kernel? That way I could solve my problem.

THIS MESSAGE IS DUPLICATED (sorry about that)

Actually no, I need the value on the host (its the host who determines when to stop the loop in wich the kernels are being called).

Also I was mesuring times and without copying the value to the host my application took arround 95.877480 (ms) to run the loop 1000 times.
When I start copying the value to the host this value went up to 371.656189 (ms).

sad.gif

Don’t know if assync would take less time since I’m using vista (cuda 2.1)

So I have a new question. Its possible to copy a value to the host from within a kernel? That way I could solve my problem.

There is got to be a better way a single call cudaMemcpy takes arround 100 ms which is more than the time I need to run several kernels 1000 times.

Do you have a maximum to the times those kernels should be called? max iterations that should be done on the host?

eyal

Something is fishy with those benchmark numbers. In all tests I’ve done, cudaMemcpying a single 4-byte value only takes ~10-20 microseconds, roughly the same as the empty kernel launch overhead.

No, I want to stop when a given error is achieved. So I need to pass the error to the host.

I think the results are correct (I’m using vista 64 bits). Here are the results when I don’t copy the float from the device to the host (loop is done 1000 times):

device...................: 0

Name.....................: GeForce GTX 280 [1296Mhz - supports CUDA 1.3]

Multi-Processors.........: 30

Global mem...............: 1073741824

Const mem................: 65536

Shared mem per block.....: 16384

Regs per block...........: 16384

Max threads per block....: 512

Max threads dim..........: (512,512,64)

Max grid size............: (65535,65535,1)

Warp size................: 32

Mem pitch................: 262144

Texture Alignment........: 256

Device overlap...........: 0

kernel Timeout Enabled...: 0

Size of floating type....: 4

Processing time: 98.540314 (ms)

and now when I include the code inside of the loop (its done 1000 times)

cudaMemcpy(&rms, d_rms, sizeof(float), cudaMemcpyDeviceToHost);

to get the float (I remove the info from the device):

Processing time: 376.773438 (ms)

If I try to get the value every 512 times (in this case 1 time):

if ((epoch & 511) == 0) cudaMemcpy(&rms, d_rms.Pointer(), sizeof(CUDA_FLOATING_TYPE), cudaMemcpyDeviceToHost);
Processing time: 185.019043 (ms)

and if I try to get the value every 128 times:

if ((epoch & 127) == 0) cudaMemcpy(&rms, d_rms.Pointer(), sizeof(CUDA_FLOATING_TYPE), cudaMemcpyDeviceToHost);
Processing time: 199.616013 (ms)

sorry once again I reply in the wrong place (please view the message bellow). I think my benchmarks are correct.

Usually MisterAnderson is correct :) the timings indeed look very high.

Can you please post the host code?

eyal

Driver overhead on Vista is enormous compared to other OSes.

(zero-copy is the answer)

zero-copy is a cuda 2.2 feature. right?

yes.

Here it goes:

// ...

float rms;

int i = 0;

do {

	kA<<<NA, dimA, LA * sizeof(float)>>>(dA, dB, dC);

	kA<<<NA, dimB, LB * sizeof(float)>>>(dC, dD, dE);

	

	kB<<<NA, dimC, (LC + LD) * sizeof(float)>>>(dE, dF, dG, dH, dI, dRMS);

	kC<<<NA, dimD, (LD * (LE + 1)) * sizeof(float)>>>(dE, dF, dI, dJ);

	kC<<<NA, dimE, (LE * (LF + 1)) * sizeof(float)>>>(dC, dD, dJ, dK);

	

	kD<<<dimC, NA, 2 * NA * sizeof(float)>>>(dE, dI, dF, dL, dM, dN, U, D, M, dRMS);

	kE<<<dimB, NA, NA * sizeof(float)>>>(dC, dE, dJ, dD, dO, dP, dQ, U, D, M);

	kE<<<dimA, NA, NA * sizeof(float)>>>(dA, dC, dK, dB, dR, dS, dT, U, D, M);  

	

	i++;

	

	cudaMemcpy(&rms, dRMS, sizeof(float), cudaMemcpyDeviceToHost);

} while (i < 1000); // (rms > RMS_STOP);

Indeed, but I didn’t think it was that massive. 280 us on vista vs ~10 us in linux? I’ll benchmark this independantly on my Vista box tomorrow to confirm.

You don’t actually include the timing calls. Are you calling cudaThreadSynchronize() before every wall clock time measurement? If you are not, then your processing time without the cudaMemcpy is probably missing out on the time of several hundred queued kernel calls. The implicit sync inside cudaMemcpy would be acting as the cudaThreadSynchronize() and thus causing the time of all those queued kernels to be measured.

One way to reduce the overhead would be to only check for convergence every 5-10 iterations or so. You could even setup a stream and have an async cudaMemcpy copy the value back and then queue up the next 5 kernels. So your condition would be lagging behind by 5 iterations, but the memcpy would be completely overlapped with kernel executions and it’s time hidden.

Regardless of which method you use, zero-copy will indeed make life easier for this kind of check.

Async is available for Vista in 2.2 right? I am on Vista also, and I have only a 1.1 device. <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

Yes, async is supported in Vista as of 2.2.

CUDA 2.2 is out :)

No, I’m not calling cudaThreadSynchronize().

I have something like this:

// ...

unsigned int timer = 0;

cutCreateTimer(&timer);

cutStartTimer(timer);

int i = 0;	

do {

	kA<<<NA, dimA, LA * sizeof(float)>>>(dA, dB, dC);

	kA<<<NA, dimB, LB * sizeof(float)>>>(dC, dD, dE);

	

	kB<<<NA, dimC, (LC + LD) * sizeof(float)>>>(dE, dF, dG, dH, dI, dRMS);

	kC<<<NA, dimD, (LD * (LE + 1)) * sizeof(float)>>>(dE, dF, dI, dJ);

	kC<<<NA, dimE, (LE * (LF + 1)) * sizeof(float)>>>(dC, dD, dJ, dK);

	

	kD<<<dimC, NA, 2 * NA * sizeof(float)>>>(dE, dI, dF, dL, dM, dN, U, D, M, dRMS);

	kE<<<dimB, NA, NA * sizeof(float)>>>(dC, dE, dJ, dD, dO, dP, dQ, U, D, M);

	kE<<<dimA, NA, NA * sizeof(float)>>>(dA, dC, dK, dB, dR, dS, dT, U, D, M);

	   

	i++;

	

	cudaMemcpy(&rms, dRMS, sizeof(float), cudaMemcpyDeviceToHost);

} while (i < 1000); // (rms > RMS_STOP);

cutStopTimer(timer);

printf("Processing time: %f (ms)\n", cutGetTimerValue(timer));

cutDeleteTimer(timer);

I’ll try on linux or XP.