Well, then you aren’t actually measuring the time taken to execute the kernels. Kernel launches are asynchronous, the CPU code continues to execute while the GPU runs kernels. The driver is capable of queing up a hundreds of kernel calls in a row without blocking the CPU.
I’m interested in measure the overall time needed to solve a particular problem (not interested in how much a given kernel will take to execute). I will need to know the rms to decide when to stop and getting these 4 bytes will increases extremely the time needed to solve the problem. :(
He has a call to cudaMemcpy at the end of the loop
cudaMemcpy(&rms, dRMS, sizeof(float), cudaMemcpyDeviceToHost);
the measurements should be fine no?
I have a similar problem (simulated annealing - neural networks) and I found that it was faster to set a flag along with the RMS calculation which would tell the device what to do next. For me, seemed like the only solution was to make the decision on the device.
I’m new to this though, so there is probably a better solution.
Tim
Yes, I know. And what I keep trying to tell you is if you have
measure time1
for (int i = 1; i < 1000; i++)
kerenel<<<grid,threads>>>(args)
measure time2
then time2 - time1 is NOT the time taken to execute 1000 kernels. It is the time taken to execute N kernels where N < 1000 and undefined (likely only a couple hundred, maybe as much as 900). Therefore, you are not measuring the time for your whole problem, you are measuring an undefined quantity.
Indeed. When the memcpy is there, the measurements are correct. However, this entire discussion is attempting to answer the question why the time is so much higher with the memcpy compared to when the memcpy is commented out :)
I can do the decision at the device, but how will I inform the host to stop (That’s why I asked before if I could access host memory from a kernel - but I think its not possible). Did you solve this?
Hopefully version 2.2 will allow me to solve this situation with the zero copy.
Cuda 2.2 seems to alleviate this problem. Results for a different device GeForce 8600 GT:
device...................: 0
Name.....................: GeForce 8600 GT [1188Mhz - supports CUDA 1.1]
Multi-Processors.........: 4
Global mem...............: 536870912
Const mem................: 65536
Shared mem per block.....: 16384
Regs per block...........: 8192
Max threads per block....: 512
Max threads dim..........: (512,512,64)
Max grid size............: (65535,65535,1)
Warp size................: 32
Mem pitch................: 262144
Texture Alignment........: 256
Device overlap...........: 1
kernel Timeout Enabled...: 0
Size of floating type....: 4
Processing time: 780.588196 (ms) <= without cudaMemcpy
Processing time: 1154.239868 (ms) <= with cudaMemcpy in every epoch
Processing time: 797.087158 (ms) <= with cudaMemcpy every 256 epochs