the 8800 GTX does support async calls. And I believe in the async demo you can find timing examples. CudaEventRecord is used for calculating GPU time. CUT_TIMER_* related stuff (like in the bandwithtest example) calculates CPU time.
What the 8800 GTX does not support is transferring page-locked memory to/from the device while the device is executing a kernel.
Indeed. In the case of non-page-locked memory, CUDA uses a pre-allocated DMA buffer, copies your data there, initiates a DMA transfer, synchronizes then copies the next chunk… and so on. This is much more involved than simply one fire and forget DMA operation, which happens if the memory is page-locked.
Both the 8800 and later cards support this kind of asynchronous operation, the only thing that changed with later cards is that they can overlap DMA and kernel execution.
I’m having strange experience with asynchronous kernel launch on 8800 GTX. When I make 2 different kernel launch consecutively on a stream, only the last kernel launch is asynchronous. For example, let say I make two kernel launch in a stream, each takes 1s and 2s respectively to finish. After that I do some CPU computation that will take 5s. What I expect is that the whole thing will take 5s (3s of GPU computation will be run in parallel with the CPU computation). However, it actually takes 6 seconds. Only the second kernel launch (2s) is run concurrently with the CPU computation.
This only happens when the two kernels are different. If I use the same kernel (only change the grid / block size) then it works as expected. Any explanation?
the return value is 1. It means that it does SUPPORT the feature that the device can concurrently copy memory between host and device while executing a kernel (I’m using GeForce 8800 GTX). So, does the cuDeviceGetAttribute function gives the wrong result ?