cudaThreadSynchronize

Is it a must to use cudaThreadSynchronize() ,after kernel call ,before memcopy device memory to host memory?

Thanks
Miki

No, host<->device memcopys implicitly synchronize, ie. calling them after kernel execution will block until the kernel finishes and then copy.

Unless you actually want asynchronous h<->d transfers - then you use Streams and memcpyAsyncs.