bets way to return a float value sync or assync

Noel_Lopes · May 5, 2009, 2:19pm

Given the following:

I want to return a single float value from a kernel.
I don’t need the value right away. The value its used to determine when to exit a loop where several kernels are called, but I don’t care if the loop is executed one more time than it should.
I want my program to resume calling other kernels as soon as possible.

Right now I’m using cudaMemcpy.

Is there a better way of doing this:

I’m guessing cudaMemcpyAsync would take more than just calling cudaMemcpy and wait for the value. Is this right?
Is cudaMemcpyToSymbol faster?

Thanks

eyalhir74 · May 5, 2009, 2:25pm

Leave it somewhere in global memory where all kernels can access.

something like this:

int *pLoopCounter;

cudaMalloc( pLoopCounter, sizeof(int));

KernelToCalculateLoopCounter<<<...>>>( <all sort of params>, pLoopCounter );   // --> This kernel calculates pLoopCounter[0] and exits.

KernelA<<<....>>>( <all sort of params>, <more params>, pLoopCounter );  // --> Read pLoopCounter[0] in the kernel

KernelB<<<....>>>( <all sort of params>, <more params>, pLoopCounter );  // --> Read pLoopCounter[0] in the kernel

...

Hope this is what you meant :)

eyal

Noel_Lopes · May 5, 2009, 2:39pm

Actually no, I need the value on the host (its the host who determines when to stop the loop in wich the kernels are being called).

Also I was mesuring times and without copying the value to the host my application took arround 95.877480 (ms) to run the loop 1000 times.
When I start copying the value to the host this value went up to 371.656189 (ms).

:(

Don’t know if assync would take less time since I’m using vista (cuda 2.1)

So I have a new question. Its possible to copy a value to the host from within a kernel? That way I could solve my problem.

THIS MESSAGE IS DUPLICATED (sorry about that)

Noel_Lopes · May 5, 2009, 2:41pm

Actually no, I need the value on the host (its the host who determines when to stop the loop in wich the kernels are being called).

Also I was mesuring times and without copying the value to the host my application took arround 95.877480 (ms) to run the loop 1000 times.
When I start copying the value to the host this value went up to 371.656189 (ms).

sad.gif

Don’t know if assync would take less time since I’m using vista (cuda 2.1)

So I have a new question. Its possible to copy a value to the host from within a kernel? That way I could solve my problem.

Noel_Lopes · May 5, 2009, 2:53pm

There is got to be a better way a single call cudaMemcpy takes arround 100 ms which is more than the time I need to run several kernels 1000 times.

eyalhir74 · May 5, 2009, 3:55pm

Do you have a maximum to the times those kernels should be called? max iterations that should be done on the host?

eyal

MisterAnderson42 · May 5, 2009, 4:09pm

Something is fishy with those benchmark numbers. In all tests I’ve done, cudaMemcpying a single 4-byte value only takes ~10-20 microseconds, roughly the same as the empty kernel launch overhead.

Noel_Lopes · May 5, 2009, 4:22pm

No, I want to stop when a given error is achieved. So I need to pass the error to the host.

Noel_Lopes · May 5, 2009, 4:38pm

I think the results are correct (I’m using vista 64 bits). Here are the results when I don’t copy the float from the device to the host (loop is done 1000 times):

device...................: 0

Name.....................: GeForce GTX 280 [1296Mhz - supports CUDA 1.3]

Multi-Processors.........: 30

Global mem...............: 1073741824

Const mem................: 65536

Shared mem per block.....: 16384

Regs per block...........: 16384

Max threads per block....: 512

Max threads dim..........: (512,512,64)

Max grid size............: (65535,65535,1)

Warp size................: 32

Mem pitch................: 262144

Texture Alignment........: 256

Device overlap...........: 0

kernel Timeout Enabled...: 0

Size of floating type....: 4

Processing time: 98.540314 (ms)

and now when I include the code inside of the loop (its done 1000 times)

cudaMemcpy(&rms, d_rms, sizeof(float), cudaMemcpyDeviceToHost);

to get the float (I remove the info from the device):

Processing time: 376.773438 (ms)

If I try to get the value every 512 times (in this case 1 time):

if ((epoch & 511) == 0) cudaMemcpy(&rms, d_rms.Pointer(), sizeof(CUDA_FLOATING_TYPE), cudaMemcpyDeviceToHost);

Processing time: 185.019043 (ms)

and if I try to get the value every 128 times:

if ((epoch & 127) == 0) cudaMemcpy(&rms, d_rms.Pointer(), sizeof(CUDA_FLOATING_TYPE), cudaMemcpyDeviceToHost);

Processing time: 199.616013 (ms)

Noel_Lopes · May 5, 2009, 4:40pm

sorry once again I reply in the wrong place (please view the message bellow). I think my benchmarks are correct.

eyalhir74 · May 5, 2009, 6:47pm

Usually MisterAnderson is correct :) the timings indeed look very high.

Can you please post the host code?

eyal

tmurray · May 5, 2009, 6:48pm

Driver overhead on Vista is enormous compared to other OSes.

(zero-copy is the answer)

Noel_Lopes · May 5, 2009, 7:25pm

zero-copy is a cuda 2.2 feature. right?

tmurray · May 5, 2009, 8:45pm

yes.

Noel_Lopes · May 5, 2009, 8:50pm

Here it goes:

// ...

float rms;

int i = 0;

do {

	kA<<<NA, dimA, LA * sizeof(float)>>>(dA, dB, dC);

	kA<<<NA, dimB, LB * sizeof(float)>>>(dC, dD, dE);

	

	kB<<<NA, dimC, (LC + LD) * sizeof(float)>>>(dE, dF, dG, dH, dI, dRMS);

	kC<<<NA, dimD, (LD * (LE + 1)) * sizeof(float)>>>(dE, dF, dI, dJ);

	kC<<<NA, dimE, (LE * (LF + 1)) * sizeof(float)>>>(dC, dD, dJ, dK);

	

	kD<<<dimC, NA, 2 * NA * sizeof(float)>>>(dE, dI, dF, dL, dM, dN, U, D, M, dRMS);

	kE<<<dimB, NA, NA * sizeof(float)>>>(dC, dE, dJ, dD, dO, dP, dQ, U, D, M);

	kE<<<dimA, NA, NA * sizeof(float)>>>(dA, dC, dK, dB, dR, dS, dT, U, D, M);  

	

	i++;

	

	cudaMemcpy(&rms, dRMS, sizeof(float), cudaMemcpyDeviceToHost);

} while (i < 1000); // (rms > RMS_STOP);

MisterAnderson42 · May 5, 2009, 11:33pm

Indeed, but I didn’t think it was that massive. 280 us on vista vs ~10 us in linux? I’ll benchmark this independantly on my Vista box tomorrow to confirm.

You don’t actually include the timing calls. Are you calling cudaThreadSynchronize() before every wall clock time measurement? If you are not, then your processing time without the cudaMemcpy is probably missing out on the time of several hundred queued kernel calls. The implicit sync inside cudaMemcpy would be acting as the cudaThreadSynchronize() and thus causing the time of all those queued kernels to be measured.

One way to reduce the overhead would be to only check for convergence every 5-10 iterations or so. You could even setup a stream and have an async cudaMemcpy copy the value back and then queue up the next 5 kernels. So your condition would be lagging behind by 5 iterations, but the memcpy would be completely overlapped with kernel executions and it’s time hidden.

Regardless of which method you use, zero-copy will indeed make life easier for this kind of check.

Jamie_K · May 6, 2009, 4:28pm

Async is available for Vista in 2.2 right? I am on Vista also, and I have only a 1.1 device. <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

tmurray · May 6, 2009, 5:05pm

Yes, async is supported in Vista as of 2.2.

Noel_Lopes · May 7, 2009, 8:42am

CUDA 2.2 is out :)

Noel_Lopes · May 7, 2009, 8:47am

Indeed, but I didn’t think it was that massive. 280 us on vista vs ~10 us in linux? I’ll benchmark this independantly on my Vista box tomorrow to confirm.

You don’t actually include the timing calls. Are you calling cudaThreadSynchronize() before every wall clock time measurement? If you are not, then your processing time without the cudaMemcpy is probably missing out on the time of several hundred queued kernel calls. The implicit sync inside cudaMemcpy would be acting as the cudaThreadSynchronize() and thus causing the time of all those queued kernels to be measured.

One way to reduce the overhead would be to only check for convergence every 5-10 iterations or so. You could even setup a stream and have an async cudaMemcpy copy the value back and then queue up the next 5 kernels. So your condition would be lagging behind by 5 iterations, but the memcpy would be completely overlapped with kernel executions and it’s time hidden.

Regardless of which method you use, zero-copy will indeed make life easier for this kind of check.

No, I’m not calling cudaThreadSynchronize().

I have something like this:

// ...

unsigned int timer = 0;

cutCreateTimer(&timer);

cutStartTimer(timer);

int i = 0;	

do {

	kA<<<NA, dimA, LA * sizeof(float)>>>(dA, dB, dC);

	kA<<<NA, dimB, LB * sizeof(float)>>>(dC, dD, dE);

	

	kB<<<NA, dimC, (LC + LD) * sizeof(float)>>>(dE, dF, dG, dH, dI, dRMS);

	kC<<<NA, dimD, (LD * (LE + 1)) * sizeof(float)>>>(dE, dF, dI, dJ);

	kC<<<NA, dimE, (LE * (LF + 1)) * sizeof(float)>>>(dC, dD, dJ, dK);

	

	kD<<<dimC, NA, 2 * NA * sizeof(float)>>>(dE, dI, dF, dL, dM, dN, U, D, M, dRMS);

	kE<<<dimB, NA, NA * sizeof(float)>>>(dC, dE, dJ, dD, dO, dP, dQ, U, D, M);

	kE<<<dimA, NA, NA * sizeof(float)>>>(dA, dC, dK, dB, dR, dS, dT, U, D, M);

	   

	i++;

	

	cudaMemcpy(&rms, dRMS, sizeof(float), cudaMemcpyDeviceToHost);

} while (i < 1000); // (rms > RMS_STOP);

cutStopTimer(timer);

printf("Processing time: %f (ms)\n", cutGetTimerValue(timer));

cutDeleteTimer(timer);

I’ll try on linux or XP.

Topic		Replies	Views
Fast copy (device->kernel) after aync kernel call CUDA Programming and Performance	11	3189	July 6, 2015
massive hiccups when transferring flags back to host challenging the commonly quoted 2-10us latency CUDA Programming and Performance	14	38113	February 1, 2011
Cuda code performance CUDA Programming and Performance	14	3071	December 16, 2014
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16287	January 30, 2011
Iterative computations Creating of efficient iterative computations using CUDA CUDA Programming and Performance	16	2274	January 25, 2011
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1322	January 10, 2022
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6048	November 16, 2010
Why does my kernel take too long occasionally? CUDA Programming and Performance	21	8744	October 13, 2010
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13923	September 5, 2008
Time for Splitting up Memory Transfers? CUDA Programming and Performance	8	3364	August 14, 2010

bets way to return a float value sync or assync

Related topics