Zero-copy access sync question

Hi, in CUDA, kernel launch is async.
And when use zero copy access function in CUDA 2.2Beta.
How to make sure all of the data be processed by the device kernel and can be used to do following process in host without use memory copy to sync the kernel?