some cuda question

s002wjh · December 23, 2015, 4:03pm

just have some question I like to verify.

cuda event is a timer run on GPU, give better result and performance compare to using CPU timer?
if I have 2 kernel one after another without data dependency such

kernel2 <<< grid, block, 0 >>>(din,o1) ;
kernel3 <<< grid, block, 0 >>>(din,o2) ;
cudamemcpy(h,o1,D2H)
cudamemcpy(h,o2,D2H)
is this running in concurrent or serial if the default stream is 0. if its serial do I just declare 2 stream so it can run in concurrent mode? is it synchronous or asynchronous

njuffa · December 23, 2015, 4:44pm

Never use the default stream if you want operations from different streams to execute concurrently. Also, never use cudaMemcpy() in that case, use cudaMemcpyAsync(). For most use cases, a high-resolution CPU timer is fully adequate for measurements down to one micro-second resolution.

Robert_Crovella · December 23, 2015, 4:52pm

CUDA events are managed by the CUDA runtime. The CUDA runtime is something like an operating system that manages the GPUs. CUDA events inherently understand CUDA operations, whereas host-based timing methods do not. This means if you launch a CUDA event into a stream, for example, it will obey stream semantics (e.g. execution order). CUDA event based timing may give a better result and performance depending on which CPU/host timing method you are using, and other factors as well.

The code shown will run “serially” with respect to CUDA/the GPU. All CUDA operations launched into the same stream will be executed in-order. This is a somewhat separate concept from the concept of synchronous vs. asynchronous. Note that the 3rd parameter in the kernel launch configuration: <<<grid,block,0>>> is not the stream ID.

Yes. You can cause the 2 kernels shown to have the opportunity to run concurrently by launching them into separate streams.

cudaStream_t s0, s1;
cudaStreamCreate(&s0); cudaStreamCreate(&s1);
kernel2 <<< grid, block, 0, s0 >>>(din,o1) ;
kernel3 <<< grid, block, 0, s1 >>>(din,o2) ;

note that this does not guarantee concurrent execution. Concurrent execution may not be witnessed if the first kernel launched is large enough to fill the machine.

kernel calls are asynchronous. The host thread will continue on with the next line of code even before the kernel actually starts running. (If you think about this carefully, you will realize this would be a necessary condition in order for the above kernel2 and kernel3 to run concurrently.) cudaMemcpy is a synchronous call, but there is an asynchronous version cudaMemcpyAsync

Relevant doc section:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution

relevant sample code:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simplestreams

s002wjh · December 23, 2015, 5:54pm

alright thx.
so cudamemcpy is blocking host till kernel finish before copy back? vs cudamemcpyasync is non blocking and copy data back when its stream kernel is finished?

njuffa · December 23, 2015, 6:22pm

What exactly are you planning to time using CUDA events? Greg Smith has pointed out on multiple occasions that cudaEventRecord() is not recommended for timing CPU-side activities. See for example his post here:

[url]gpgpu - Strategies for timing CUDA Kernels: Pros and Cons? - Stack Overflow

s002wjh · December 23, 2015, 7:26pm

I’m using cuda event for kernels, I read some doc(more like ppl comment from other forum) that use host timer or cuda event can achieve the same. but cuda event is better for kernel time.

njuffa · December 23, 2015, 7:44pm

The best way to look at kernel execution time, together with excellent visualization that helps with examining concurrency issues, is with the CUDA visual profiler.

Topic		Replies	Views
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1339	January 10, 2022
Very newbie questions on synchronisation between GPU & CPU, and time measurement CUDA Programming and Performance	4	485	December 17, 2017
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	35986	July 12, 2011
is kernel in stream 0 asynchronous? CUDA Programming and Performance	10	3710	April 23, 2011
CUDA event timer or C++11 <chrono> timers, which one should I use? CUDA Programming and Performance	4	3841	May 21, 2019
Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams CUDA Programming and Performance	4	236	April 1, 2024
cuda visual profiler CUDA Programming and Performance	12	8167	July 30, 2008
processing time check CUDA Programming and Performance	5	551	November 16, 2010
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25008	March 8, 2010
My streams are not running concurrently CUDA Programming and Performance	7	1740	March 6, 2018

some cuda question

Related topics