Concurrent Kernel Execution Synchronization


I was using cudaStreamSynchronize( streamid ) to synchronize computing kernel, but I tought that maybe it is better to do synchronization based on the event. I see a decrease in efficiency now (for defining dependency I have to use 8 events).

So which situation cudaStreamSynchronize is working better and in which situation event based synchronization work better?

I have another question about forcing some kernels to run parallely but slowly, without modifiying the internal of kernel or rewritng kernels. Do we have something to apply limitations on resources (SM or blocks) and force kernels to run parallely and slowoly?