two (newbie?) questions asynchroneous host->device memcpy+events


I posted it to wrong forum it should be CUDA Programming and Development, as there is no delete button could moderators move this post where it belongs, please?

  1. I would like to know if there’s a list of devices which support CU_DEVICE_ATTRIBUTE_GPU_OVERLAP (copies between page-locked host memory and device memory concurrently with kernel execution). I am interested if 8800 Ultra supports it.

  2. I do not understand well how does cudaEventRecord works. In CUDA programming guide there is short passage in which reads:
    “Two operations from different streams cannot run concurrently if either page-locked host memory allocation, a device memory allocation, a device memory set, a device-device memory copy, or an event recording occurs in-between them.”

Does it mean that when I have more than one stream and I record an event in any single stream, execution is blocked for all streams until event is recorded?

When an event is recorded? Let’s say that I have 2 streams A and B. I call kernel on A and record event on this stream afterwards. I call kernel on B which computations last longer then kernel working on A. Will event recording happen after B completes its operation or will it be interrupted?
What overhead does event recording have?

How costly is cudaEventCreate and cudaEventDestroy?

Thanks in advance,

The answer to 1 is no, the G80 cards don’t support this. The other is a good question which I cannot answer either.

That’s strange. I wrote simple program to test if card supports it and answered YES both for 8800 GTX and 8600M.

#include <cuda.h>

#include <iostream>

int main(){

        if (CUDA_SUCCESS == cuInit(0)){

                std::cout << "OK" << std::endl;


       CUdevice device;

       if (CUDA_SUCCESS == cuDeviceGet(&device, 0)){

                std::cout << "OK" << std::endl;


       int a = 0;

       if (CUDA_SUCCESS == cuDeviceGetAttribute(&a, CU_DEVICE_ATTRIBUTE_GPU_OVERLAP, device)){

                std::cout << "OK" << std::endl;


       if (a == 1){

                std::cout << "YES" << std::endl;

        } else {

                std::cout << "NO" << std::endl;



8600M is Compute 1.1 capable, so also supports copies between page-locked host memory and device memory concurrently with kernel execution, the 8800GTX is Compute 1.0 capable and does not support concurrent copies. I will try your code when I am back at work the 2nd of january on the 8800GTX I have there if somebody else doesn’t beat me to it.

Hi All

We are Using 8800 GTX, And wanted to Know if This device Supports Asynchronous Calls, We have run The above code with Some modification and Results showd that It supports and we cant get correct result from a code employing Asynchrous Calls.

Can Anybody Clarify me the true pict



Sounds like a driver bug. As a workaround, you can key off the major/minor compute versions (GPUs that report 1.1 support the functionality, ones reporting 1.0 do not).

If you specify a stream to your cu(da)EventRecord call, it should not disrupt concurrency. Most of the function calls named in the programming guide (e.g. device-device memcpy) don’t give the app the opportunity to specify a stream, which does inhibit concurrency.

As a related question, are there different versions of the G80 at compute capability 1.0 and 1.1? The programming guide appendix A has a table of 1.0 vs 1.1 capable devices. Where does the G84 fit into this?

The simpleStream example would indicated that running multiple streams on a < 1.1 device doesn’t yield more performance, just flexibility in scheduling processing. Is this correct?

Also I’m curious why the mobile (8800M GTX) is 1.1 while the 8800 GTX and 8800 Ultra are not: is it just the product development cycle?

It is just a naming issue, the 8800M is a G92 core, whereas the 8800GTX & Ultra are G80 cores (as far as I understood when reading something about the 8800M)

Devices based on G84 (which is developed later than G80 but is essentialy just a scaled down G80) are 1.1. So if you see device is capable of 1.1 then it is certainly based on G84 or later (G92 for 8800 GT and 8800 GTS 512 MB).

According to information in

Devices with Compute Capability 1.1 can overlap a kernel

and a memcopy as long as they are issued in different streams.  Kernels are serialized.  Thus, if n pairs are launched, streamed approach can reduce the memcopy cost to the (1/n)th of a single copy of the entire data set.

So, you shouldn’t expect any performance improvement for 1.0 devices.

8800M GTX is very different from 8800 GTX. I guess there is marketing strategy behind this. BTW, there is even more disappointing example: 8800 GTS. There are two ‘flavours’ of this board, one based on G80 with 320 or 640 MB of memory, 96 stream processors and 320-bit memory bus, and another one based on G92 with 512 MB of memory, 128 stream processors and 256-bit memory bus, and clocked at higher clock rate . Funny, isn’t it? :)

EDIT: typos.

Right, I see the 8800 GT is a G92. The programming guide indicates that the 8800 GTS is compute 1.0, yet there are multiple references in the press stating the 8800 GTS is a G92 GPU. So different 8800 GTS versions??

Odd and confusing. Thanks…

I have a GTX 8800, and it reports being able to overlap. The simpleStreams example shows no improvement in performance.

The 8800 GTX (and any other G80 based card) will not overlap I/O and kernel execution.

I opened a bug, cuDeviceGetAttribute should report 0 for CU_DEVICE_ATTRIBUTE_GPU_OVERLAP on G80 hardware.

Does anybody have any reports on this second one? As it is 1.1 capable and has a higher clockrate than a 8800GTX it might be that the performance of this device is higher in practise (the slower memory clock might be not an issue with overlap of data transfer & execution if you calculate enough in your kernels)

What matters is the memory bandwidth your app makes use of. 90% of my application is global memory bandwidth bound (70 GiB/s in real simulations), so I would fully expect that the new 1.1 boards would be significantly slower in my app. If your application has a high enough ratio of FLOPs to memory ops that you aren’t memory bound then the new 1.1 cards will probably be faster for you.

I too have been facing some problem with cudaeventrecord on individual streams.
I am interested in capturing the events happening in a single CUDA stream . As per the document i could do this
by using cudaEventRecord with second argument as the stream itself. However the result it gives looks like it doesnt care about the stream. The following code is expected to get the elapsed time for stream 0 which should be almost 0 as there is no operations carried out in stream[0] between the start and stop events. But I get elapsed time about 300 msec which is actually the computation time of stream[1]. Is there any property i need to set that i’m missing out.

cudaEventRecord(start, streams[0]);
// asynchronously launch nstreams kernels, each operating on its own portion of data
//init_array<<<blocks, threads, 0, streams[0]>>>(d_a + 0 * n / nstreams, d_c, niterations);
init_array<<<blocks, threads, 0, streams[1]>>>(d_a + 1 * n / nstreams, d_c, niterations);
cudaEventRecord(stop, streams[0]);

CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time, start, stop) );
printf(“elapsed time:%.2f\tstream:%d\n”,elapsed_time, stream[o]);

Thanks in advance!

You should start a new thread. Your question isn’t really related to the OP, and resurrecting year-old threads in general is a bad idea because CUDA changes a lot over the course of one year (especially in this sort of functionality).

Thanks for the direction. I already had a new thread on cuda development discussion. Unfortunately , i havent got any response there , so i just tried here some luck.

what do you mean of I/O? memory operations or other I/O operations?

and are you sure the hardware (G80 based card) does’t support overlap?

do you know Geforce 9600 GT support parallelization of kernel execution and memory copy?

my GPU is 9600 GT, core is G94. below is the simplestream run results:

memcopy: 33.51

kernel: 40.80

non-streamed: 74.86 (74.31 expected)

8 streams: 75.12 (44.99 expected with compute capability 1.1 or later)


does it not support overlaping?

It does support overlapping. I’m not getting the expected result on GTX260 either. Has this bug been filed? I forget.