cuda event is a timer run on GPU, give better result and performance compare to using CPU timer?
if I have 2 kernel one after another without data dependency such
kernel2 <<< grid, block, 0 >>>(din,o1) ;
kernel3 <<< grid, block, 0 >>>(din,o2) ;
is this running in concurrent or serial if the default stream is 0. if its serial do I just declare 2 stream so it can run in concurrent mode? is it synchronous or asynchronous
Never use the default stream if you want operations from different streams to execute concurrently. Also, never use cudaMemcpy() in that case, use cudaMemcpyAsync(). For most use cases, a high-resolution CPU timer is fully adequate for measurements down to one micro-second resolution.
CUDA events are managed by the CUDA runtime. The CUDA runtime is something like an operating system that manages the GPUs. CUDA events inherently understand CUDA operations, whereas host-based timing methods do not. This means if you launch a CUDA event into a stream, for example, it will obey stream semantics (e.g. execution order). CUDA event based timing may give a better result and performance depending on which CPU/host timing method you are using, and other factors as well.
The code shown will run “serially” with respect to CUDA/the GPU. All CUDA operations launched into the same stream will be executed in-order. This is a somewhat separate concept from the concept of synchronous vs. asynchronous. Note that the 3rd parameter in the kernel launch configuration: <<<grid,block,0>>> is not the stream ID.
Yes. You can cause the 2 kernels shown to have the opportunity to run concurrently by launching them into separate streams.
note that this does not guarantee concurrent execution. Concurrent execution may not be witnessed if the first kernel launched is large enough to fill the machine.
kernel calls are asynchronous. The host thread will continue on with the next line of code even before the kernel actually starts running. (If you think about this carefully, you will realize this would be a necessary condition in order for the above kernel2 and kernel3 to run concurrently.) cudaMemcpy is a synchronous call, but there is an asynchronous version cudaMemcpyAsync
What exactly are you planning to time using CUDA events? Greg Smith has pointed out on multiple occasions that cudaEventRecord() is not recommended for timing CPU-side activities. See for example his post here: