bets way to return a float value sync or assync

Well, then you aren’t actually measuring the time taken to execute the kernels. Kernel launches are asynchronous, the CPU code continues to execute while the GPU runs kernels. The driver is capable of queing up a hundreds of kernel calls in a row without blocking the CPU.

I’m interested in measure the overall time needed to solve a particular problem (not interested in how much a given kernel will take to execute). I will need to know the rms to decide when to stop and getting these 4 bytes will increases extremely the time needed to solve the problem. :(

He has a call to cudaMemcpy at the end of the loop

cudaMemcpy(&rms, dRMS, sizeof(float), cudaMemcpyDeviceToHost);

the measurements should be fine no?

I have a similar problem (simulated annealing - neural networks) and I found that it was faster to set a flag along with the RMS calculation which would tell the device what to do next. For me, seemed like the only solution was to make the decision on the device.

I’m new to this though, so there is probably a better solution.

Tim

Yes, I know. And what I keep trying to tell you is if you have

measure time1

for (int i = 1; i < 1000; i++)

	kerenel<<<grid,threads>>>(args)

measure time2

then time2 - time1 is NOT the time taken to execute 1000 kernels. It is the time taken to execute N kernels where N < 1000 and undefined (likely only a couple hundred, maybe as much as 900). Therefore, you are not measuring the time for your whole problem, you are measuring an undefined quantity.

Indeed. When the memcpy is there, the measurements are correct. However, this entire discussion is attempting to answer the question why the time is so much higher with the memcpy compared to when the memcpy is commented out :)

I can do the decision at the device, but how will I inform the host to stop (That’s why I asked before if I could access host memory from a kernel - but I think its not possible). Did you solve this?

Hopefully version 2.2 will allow me to solve this situation with the zero copy.

Cuda 2.2 seems to alleviate this problem. Results for a different device GeForce 8600 GT:

device...................: 0

Name.....................: GeForce 8600 GT [1188Mhz - supports CUDA 1.1]

Multi-Processors.........: 4

Global mem...............: 536870912

Const mem................: 65536

Shared mem per block.....: 16384

Regs per block...........: 8192

Max threads per block....: 512

Max threads dim..........: (512,512,64)

Max grid size............: (65535,65535,1)

Warp size................: 32

Mem pitch................: 262144

Texture Alignment........: 256

Device overlap...........: 1

kernel Timeout Enabled...: 0

Size of floating type....: 4

Processing time: 780.588196 (ms) <= without cudaMemcpy

Processing time: 1154.239868 (ms)  <= with cudaMemcpy in every epoch

Processing time: 797.087158 (ms)  <= with cudaMemcpy every 256 epochs