cudaEventRecord functionality

I am using CUDA 1.1 in a Linux environment (Ubuntu 7.04) with an 8800GT card.

According to the 1.1 Programming Guide, specifying a stream parameter when calling cudaEventRecord should lead to the event being recorded after all preceding operations in the stream have been completed. I have noticed that this is not strictly true. If a kernel is executed in one stream and cudaEventRecord is called subsequently on a different stream, the event does not actually get recorded until after the kernel finishes execution. cudaEventSynchronize has the same behavior.

Is this the intended behavior for the two functions?

On a related note, there is an error in the Programming Guide, page 104 of the pdf. The second parameter for cudaEventRecord should be of type cudaStream_t, not CUstream, since this is a Runtime API function.

I have a similar question. I am a bit confused when it comes to how to measure the time using streams. It says in the programming guide that events in the zero stream are recorded after all preceding tasks/operations from all streams are completed by the device. I have understood that the zero stream is run as a default stream if no stream parameters are given, but does this mean that no matter how many streams you run, to benchmark the running time of all the streams, you have to do the following?

cudaEventRecord(start, 0)

All the streams do their work

cudaEventRecord(stop, 0)