I will include cudaEvents as soon as I can, excellent idea! it’s just that in the SDK projects they used cutil timer and I used in my code for timing.
About the kernel launches, well if they would run asynchronously I would not get the same results that the sequential algorithm in the CPU, but I found this on the programming guide which I found really confusing:
[codebox]4.5.1.5 Asynchronous Concurrent Execution
In order to facilitate concurrent execution between host and device, some runtime
functions are asynchronous: Control is returned to the application before the device
has completed the requested task. These are:
�� Kernel launches through global functions or cuLaunchGrid() and
cuLaunchGridAsync();
…
Applications manage concurrency through streams. A stream is a sequence of
operations that execute in order. Different streams, on the other hand, may execute
their operations out of order with respect to one another or concurrently.
Any kernel launch, memory set, or memory copy function without a stream
parameter or with a zero stream parameter begins only after all preceding operations
are done, including operations that are part of streams, and no subsequent operation
may begin until it is done. Kernel launches for which no stream parameter is
provided and memory copies without an Async suffix are assigned to the default
zero stream.[/codebox]
This is what I get: The application will receive control before my kernel(wich is a global function) has completed and so it will stall in that a call to a next kernel launch, memory set, or memory copy function without a stream until the previous kernel is done, because I do kernel launches with zero stream parameter, right?
Now does that include cudaEvents? and what about cutil timers?