I have some code that copies some data to a GPU buffer and then runs a model, it looks like this:
auto state = cudaMemcpyAsync(buffers_[images], img_blob_.get(), buffer_sizes_[images],
It runs well, but is there a possibility that execution might run in some circumstances before all of the input data has copied across? If so, should I use cudaMemcpy instead? or cudaStreamSynchronize(stream_); ? Would there be any meaningful difference?
I assume that “runs a model” means executing a kernel on the GPU. From the snippet it is not clear how that kernel is invoked. In other words, I do not recognize context_->executeV2 as a built-in CUDA mechanism and assume it is an abstraction layer you have built around kernel launches.
Operations issued to the same CUDA stream will be executed by the hardware in the order they were enqueued by the software. By issuing the cudaMemcpyAsync and then launching the kernel, both to the same CUDA stream, it is ensured that the data transfer is completed prior to the kernel operating on that data.
While simple CUDA programs may use synchronous cudaMemcpy calls, it is very common to see high-performance CUDA-accelerated applications make use of multiple CUDA streams and use cudaMemcpyAsync throughout. Often, this also requires some sort of inter-stream synchronization, see the documentation for that.