some cuda question

just have some question I like to verify.

  1. cuda event is a timer run on GPU, give better result and performance compare to using CPU timer?

  2. if I have 2 kernel one after another without data dependency such

kernel2 <<< grid, block, 0 >>>(din,o1) ;
kernel3 <<< grid, block, 0 >>>(din,o2) ;
cudamemcpy(h,o1,D2H)
cudamemcpy(h,o2,D2H)
is this running in concurrent or serial if the default stream is 0. if its serial do I just declare 2 stream so it can run in concurrent mode? is it synchronous or asynchronous

Never use the default stream if you want operations from different streams to execute concurrently. Also, never use cudaMemcpy() in that case, use cudaMemcpyAsync(). For most use cases, a high-resolution CPU timer is fully adequate for measurements down to one micro-second resolution.

CUDA events are managed by the CUDA runtime. The CUDA runtime is something like an operating system that manages the GPUs. CUDA events inherently understand CUDA operations, whereas host-based timing methods do not. This means if you launch a CUDA event into a stream, for example, it will obey stream semantics (e.g. execution order). CUDA event based timing may give a better result and performance depending on which CPU/host timing method you are using, and other factors as well.

The code shown will run “serially” with respect to CUDA/the GPU. All CUDA operations launched into the same stream will be executed in-order. This is a somewhat separate concept from the concept of synchronous vs. asynchronous. Note that the 3rd parameter in the kernel launch configuration: <<<grid,block,0>>> is not the stream ID.

Yes. You can cause the 2 kernels shown to have the opportunity to run concurrently by launching them into separate streams.

cudaStream_t s0, s1;
cudaStreamCreate(&s0); cudaStreamCreate(&s1);
kernel2 <<< grid, block, 0, s0 >>>(din,o1) ;
kernel3 <<< grid, block, 0, s1 >>>(din,o2) ;

note that this does not guarantee concurrent execution. Concurrent execution may not be witnessed if the first kernel launched is large enough to fill the machine.

kernel calls are asynchronous. The host thread will continue on with the next line of code even before the kernel actually starts running. (If you think about this carefully, you will realize this would be a necessary condition in order for the above kernel2 and kernel3 to run concurrently.) cudaMemcpy is a synchronous call, but there is an asynchronous version cudaMemcpyAsync

Relevant doc section:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution

relevant sample code:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simplestreams

alright thx.
so cudamemcpy is blocking host till kernel finish before copy back? vs cudamemcpyasync is non blocking and copy data back when its stream kernel is finished?

What exactly are you planning to time using CUDA events? Greg Smith has pointed out on multiple occasions that cudaEventRecord() is not recommended for timing CPU-side activities. See for example his post here:

[url]gpgpu - Strategies for timing CUDA Kernels: Pros and Cons? - Stack Overflow

I’m using cuda event for kernels, I read some doc(more like ppl comment from other forum) that use host timer or cuda event can achieve the same. but cuda event is better for kernel time.

The best way to look at kernel execution time, together with excellent visualization that helps with examining concurrency issues, is with the CUDA visual profiler.