I want to return a single float value from a kernel.
I don’t need the value right away. The value its used to determine when to exit a loop where several kernels are called, but I don’t care if the loop is executed one more time than it should.
I want my program to resume calling other kernels as soon as possible.
Right now I’m using cudaMemcpy.
Is there a better way of doing this:
I’m guessing cudaMemcpyAsync would take more than just calling cudaMemcpy and wait for the value. Is this right?
Is cudaMemcpyToSymbol faster?
Actually no, I need the value on the host (its the host who determines when to stop the loop in wich the kernels are being called).
Also I was mesuring times and without copying the value to the host my application took arround 95.877480 (ms) to run the loop 1000 times.
When I start copying the value to the host this value went up to 371.656189 (ms).
:(
Don’t know if assync would take less time since I’m using vista (cuda 2.1)
So I have a new question. Its possible to copy a value to the host from within a kernel? That way I could solve my problem.
Actually no, I need the value on the host (its the host who determines when to stop the loop in wich the kernels are being called).
Also I was mesuring times and without copying the value to the host my application took arround 95.877480 (ms) to run the loop 1000 times.
When I start copying the value to the host this value went up to 371.656189 (ms).
sad.gif
Don’t know if assync would take less time since I’m using vista (cuda 2.1)
So I have a new question. Its possible to copy a value to the host from within a kernel? That way I could solve my problem.
Something is fishy with those benchmark numbers. In all tests I’ve done, cudaMemcpying a single 4-byte value only takes ~10-20 microseconds, roughly the same as the empty kernel launch overhead.
I think the results are correct (I’m using vista 64 bits). Here are the results when I don’t copy the float from the device to the host (loop is done 1000 times):
device...................: 0
Name.....................: GeForce GTX 280 [1296Mhz - supports CUDA 1.3]
Multi-Processors.........: 30
Global mem...............: 1073741824
Const mem................: 65536
Shared mem per block.....: 16384
Regs per block...........: 16384
Max threads per block....: 512
Max threads dim..........: (512,512,64)
Max grid size............: (65535,65535,1)
Warp size................: 32
Mem pitch................: 262144
Texture Alignment........: 256
Device overlap...........: 0
kernel Timeout Enabled...: 0
Size of floating type....: 4
Processing time: 98.540314 (ms)
and now when I include the code inside of the loop (its done 1000 times)
Indeed, but I didn’t think it was that massive. 280 us on vista vs ~10 us in linux? I’ll benchmark this independantly on my Vista box tomorrow to confirm.
You don’t actually include the timing calls. Are you calling cudaThreadSynchronize() before every wall clock time measurement? If you are not, then your processing time without the cudaMemcpy is probably missing out on the time of several hundred queued kernel calls. The implicit sync inside cudaMemcpy would be acting as the cudaThreadSynchronize() and thus causing the time of all those queued kernels to be measured.
One way to reduce the overhead would be to only check for convergence every 5-10 iterations or so. You could even setup a stream and have an async cudaMemcpy copy the value back and then queue up the next 5 kernels. So your condition would be lagging behind by 5 iterations, but the memcpy would be completely overlapped with kernel executions and it’s time hidden.
Regardless of which method you use, zero-copy will indeed make life easier for this kind of check.
Async is available for Vista in 2.2 right? I am on Vista also, and I have only a 1.1 device. <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />