is kernel in stream 0 asynchronous?

guanlelennon · April 21, 2011, 2:50pm

hello everyone,
I’m new to cuda world. Can anyone tell me whether kernel execution in stream 0 is asynchronous or not. If it is, why does the following-up cudaMemcpy can read the result from device correctly without calling cudaThreadSynchronize()?If not, why do all documents say that it’s asynchronous?

seibert · April 21, 2011, 2:54pm

cudaMemcpy() has an implicit synchronization, unlike cudaMemcpyAsync().

guanlelennon · April 21, 2011, 3:05pm

thank you, please have look at the following code.I found that the longer cpu sleeps, the longer the application executes. Does it implies that kernel and cpu thread is serial?

int main( void ) {

// capture the start time

cudaEvent_t     start, stop;

cudaEventCreate( &start );

cudaEventCreate( &stop );

cudaEventRecord( start, 0 );

int c;

int *dev_c;

cudaMalloc( (void**)&dev_c, sizeof(int) ) ;

add<<<1,1>>>( 2, 7, dev_c );

Sleep(1000);

cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost ) );

// get stop time, and display the timing results

cudaEventRecord( stop, 0 ));

cudaEventSynchronize( stop );

float   elapsedTime;

cudaEventElapsedTime( &elapsedTime, start, stop ) ;

printf( "Time to generate:  %3.1f ms\n", elapsedTime );

cudaEventDestroy( start );

cudaEventDestroy( stop ) ;

printf( "2 + 7 = %d\n", c );

cudaFree( dev_c ) ;

getchar();

return 0;

}

Lev · April 21, 2011, 3:40pm

thank you, please have look at the following code.I found that the longer cpu sleeps, the longer the application executes. Does it implies that kernel and cpu thread is serial?

int main( void ) {
// capture the start time

cudaEvent_t     start, stop;

cudaEventCreate( &start );

cudaEventCreate( &stop );

cudaEventRecord( start, 0 );

int c;

int *dev_c;

cudaMalloc( (void**)&dev_c, sizeof(int) ) ;
add<<<1,1>>>( 2, 7, dev_c );
Sleep(1000);
cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost ) );
// get stop time, and display the timing results

cudaEventRecord( stop, 0 ));

cudaEventSynchronize( stop );

float   elapsedTime;

cudaEventElapsedTime( &elapsedTime, start, stop ) ;

printf( "Time to generate:  %3.1f ms\n", elapsedTime );
cudaEventDestroy( start );
cudaEventDestroy( stop ) ;

printf( "2 + 7 = %d\n", c );

cudaFree( dev_c ) ;

getchar();

return 0;
}

Cuda calls are not async on some OSes, so you will not any benefits of running cpu and gpu in parallel.

seibert · April 21, 2011, 8:21pm

What are the units for the sleep function? Microseconds? The kernel probably finished in 50 microseconds, so your runtime is dominated by the sleep function.

guanlelennon · April 22, 2011, 1:09am

Thanks for your patients. Yes, the units for the sleep function are microseconds. Actually, when I call Sleep(0), the result is 551.4ms. And when sleeping 50ms, the result is 601.4ms. And when sleeping 450ms, the result is 1001.4ms… Does not my computer support overlapping or someting else? I’am crazy… My graphic card is GT 310 with computing capability 1.2.

guanlelennon · April 22, 2011, 1:26am

Is there some official documents about that? My os is win7…

tera · April 22, 2011, 2:02am

In order to overlap anything, you must allow the kernel to do some significant work that can be overlapped with something on the host. If your kernel execution time is fully determined by launch overhead, it makes no difference whether the kernel launch is synchronous or asynchronous.

Having said that, kernel launches on Windows may appear as if they were synchronous because the driver batches kernel launches there. Place a cudaStreamQuery(0) directly after the kernel launch in order to immediately send it to the GPU.

guanlelennon · April 22, 2011, 2:31am

wow!! I got it. Your answer is exactly to the point! Thank you very much!

hamster143 · April 22, 2011, 10:00pm

Units for the Sleep function are milliseconds. And it used to be that Sleep could not reliably provide granularity below 10 ms (for example, the call Sleep(1) would make your application sleep for anywhere between zero and 10 ms). Not sure how it works on Win7 now.

seibert · April 23, 2011, 6:34pm

Oh, and if you are using Windows 7, my statement about the minimum time for a kernel is wrong. Kernel launch on Windows with the display driver takes a lot longer than on Linux.

Topic		Replies	Views
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1458	January 10, 2022
Do the non-async calls sleep or burn CPU? CUDA Programming and Performance	20	22045	January 13, 2008
Problem: cuda calls are synchronized CUDA Programming and Performance	17	2843	February 18, 2011
some cuda question CUDA Programming and Performance	6	980	December 23, 2015
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	838	February 1, 2024
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1045	December 15, 2022
Cuda slow performance after process sleep/wait CUDA Programming and Performance	1	1247	June 14, 2022
cudaMemcpyAsync CUDA Programming and Performance	10	20746	October 16, 2015
Why some synchronize function make cudaMemcpyAsync and kernal in different stream work in sequential CUDA Programming and Performance	2	6543	March 1, 2011
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13126	March 30, 2011

is kernel in stream 0 asynchronous?

Related topics