Asynchronous data transfer

CUDA 1.1 supports asynchronous data transfers.
But one needs also a device capable of compute capability 1.1 ?
Otherwise do I have any advantages of asynchronous data transfers?
The example simpleStreams tells me that not …

Any GPU can benefit from CPU/GPU concurrency. If you call the *Async variants of Memcpy, the call will return before the memcpy has been performed. You have to synchronize CPU/GPU with streams or events to make sure the CPU and GPU don’t operate on the same data at the same time.

GPUs with 1.1 compute capability can memcpy and process kernels concurrently. This is a separate and complementary capability that is accessed similarly with the async memcpy functions and streams/events for synchronization.

simpleStreams is intended to highlight memcpy/kernel processing concurrency more than CPU/GPU concurrency, although it does get some benefit because the CPU work of queueing up the memcpy’s and kernel launches is overlapped with the GPU performing the commands.

Per Amdahl’s Law, the benefits of concurrency depend on how evenly divided the work is between the CPU and GPU (or memcpy/kernel processing). For CPU/GPU concurrency, the maximum performance benefit of 2x is achieved if the two devices spend equal time processing; for memcpy/processing concurrency, the maximum benefit of 2x is achieved if the app spends equal amounts of time transferring and processing data.

If the time spent is not evenly divided, the maximum performance benefit drops off from 2x.


I’m running on a 8800 GTS 320, which has compute capability 1.0, but the CU_DEVICE_ATTRIBUTE_GPU_OVERLAP attribute still returns 1, which contradicts what you are saying above.

I.e, the following code:

   cuDeviceGetName(name, 256,dev);

    printf("device     = %s\n", name);    

   int major, minor;

    cuDeviceComputeCapability(&major, &minor, dev);

    printf("capability = %d.%d\n", major, minor);

   int cap;

    cuDeviceGetAttribute(&cap, CU_DEVICE_ATTRIBUTE_GPU_OVERLAP, dev);            

    printf("overlap    = %d\n", cap);


device = GeForce 8800 GTS

capability = 1.0

overlap = 1

Am I able to run kernels concurrently with data transfers or not?




Am I able to run kernels concurrently with data transfers or not?[/qoute]


[quote name=‘AndreiB’ date=‘Dec 31 2007, 12:50 AM’]

Thanks Andrei, but then what does it mean that the call:

cuDeviceGetAttribute(&cap, CU_DEVICE_ATTRIBUTE_GPU_OVERLAP, dev);

returns 1 in cap? The CUDA 1.1 manual (section E.2.6) clearly states that this attribute returns “1 if the device can concurrently copy memory between host and device while executing a kernel, or 0 if not.”

Is that a API/documentation bug or am I simply misunderstanding something? Does anyone else get this behavior?


I cannot try this right now, but this seem like a driver bug…

The DeviceQuery Project from the 2.0 SDK prints:

There is 1 device supporting CUDA

Device 0: "GeForce 8800 GTX"

  Major revision number:                         1

  Minor revision number:                         0

  Total amount of global memory:                 804978688 bytes

  Number of multiprocessors:                     16

  Number of cores:                               128

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          262144 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    1.35 GHz

  Concurrent copy and execution:                 No


The simpleStream project from the 2.0 SDK prints:

GeForce 8800 GTX does not have compute capability 1.1 or later

memcopy:	30.44

kernel:  44.83

non-streamed:	75.23 (75.27 expected)

8 streams:	75.94 (48.64 expected with compute capability 1.1 or later)



Clearly, an overlap of kernel execution and memcpy is not supported as the latter program does not show any speed-up.

I am also using an 8800 GTS and I get the same result when calling the cuDeviceGetAttribute method from the Driver API; i.e. compute capability 1.0 and the result 1 when checking the CU_DEVICE_ATTRIBUTE_GPU_OVERLAP property. This seems very weird since the card clearly does not support overlap, and there’s no speedup in the simpleStreams SDK example. I am using CUDA 1.1 though.

There was a bug in 1.1 for the CU_DEVICE_ATTRIBUTE_GPU_OVERLAP property.