I am using CUDA 1.1 in a Linux environment (Ubuntu 7.04) with an 8800GT card.
According to the 1.1 Programming Guide, specifying a stream parameter when calling cudaEventRecord should lead to the event being recorded after all preceding operations in the stream have been completed. I have noticed that this is not strictly true. If a kernel is executed in one stream and cudaEventRecord is called subsequently on a different stream, the event does not actually get recorded until after the kernel finishes execution. cudaEventSynchronize has the same behavior.
Is this the intended behavior for the two functions?
On a related note, there is an error in the Programming Guide, page 104 of the pdf. The second parameter for cudaEventRecord should be of type cudaStream_t, not CUstream, since this is a Runtime API function.