is kernel in stream 0 asynchronous?

hello everyone,
I’m new to cuda world. Can anyone tell me whether kernel execution in stream 0 is asynchronous or not. If it is, why does the following-up cudaMemcpy can read the result from device correctly without calling cudaThreadSynchronize()?If not, why do all documents say that it’s asynchronous?

cudaMemcpy() has an implicit synchronization, unlike cudaMemcpyAsync().

thank you, please have look at the following code.I found that the longer cpu sleeps, the longer the application executes. Does it implies that kernel and cpu thread is serial?

int main( void ) {

// capture the start time

cudaEvent_t     start, stop;

cudaEventCreate( &start );

cudaEventCreate( &stop );

cudaEventRecord( start, 0 );

int c;

int *dev_c;

cudaMalloc( (void**)&dev_c, sizeof(int) ) ;

add<<<1,1>>>( 2, 7, dev_c );


cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost ) );

// get stop time, and display the timing results

cudaEventRecord( stop, 0 ));

cudaEventSynchronize( stop );

float   elapsedTime;

cudaEventElapsedTime( &elapsedTime, start, stop ) ;

printf( "Time to generate:  %3.1f ms\n", elapsedTime );

cudaEventDestroy( start );

cudaEventDestroy( stop ) ;

printf( "2 + 7 = %d\n", c );

cudaFree( dev_c ) ;


return 0;


Cuda calls are not async on some OSes, so you will not any benefits of running cpu and gpu in parallel.

What are the units for the sleep function? Microseconds? The kernel probably finished in 50 microseconds, so your runtime is dominated by the sleep function.

Thanks for your patients. Yes, the units for the sleep function are microseconds. Actually, when I call Sleep(0), the result is 551.4ms. And when sleeping 50ms, the result is 601.4ms. And when sleeping 450ms, the result is 1001.4ms… Does not my computer support overlapping or someting else? I’am crazy… My graphic card is GT 310 with computing capability 1.2.

Is there some official documents about that? My os is win7…

In order to overlap anything, you must allow the kernel to do some significant work that can be overlapped with something on the host. If your kernel execution time is fully determined by launch overhead, it makes no difference whether the kernel launch is synchronous or asynchronous.

Having said that, kernel launches on Windows may appear as if they were synchronous because the driver batches kernel launches there. Place a cudaStreamQuery(0) directly after the kernel launch in order to immediately send it to the GPU.

wow!! I got it. Your answer is exactly to the point! Thank you very much!

Units for the Sleep function are milliseconds. And it used to be that Sleep could not reliably provide granularity below 10 ms (for example, the call Sleep(1) would make your application sleep for anywhere between zero and 10 ms). Not sure how it works on Win7 now.

Oh, and if you are using Windows 7, my statement about the minimum time for a kernel is wrong. Kernel launch on Windows with the display driver takes a lot longer than on Linux.