Perfomance issue with memcpy

In my case coping data takes 40% of all time if I comment lines with it I get significant speedup. Is it OK?

before = GetTick();

Here >>> CHECK(cudaMemcpyAsync(buffers[inputIndex], input, INPSIZE, cudaMemcpyHostToDevice, stream));
context.enqueue(batchSize, buffers, stream, nullptr);
And here >>> CHECK(cudaMemcpyAsync(output, buffers[outputIndex], OUTSIZE, cudaMemcpyDeviceToHost, stream));

after = GetTick();

printTimeInterval(before, after);


In general, copying memory between the device and host can be a bottleneck in GPU-accelerated computation, so it’s best to be smart about your data transfers when possible for good performance. See this guide for memory optimizations on CUDA best practices:

Can GPU work while some another process coping its data to device? If yes when some pipeline could be established, right?

Yes, that’s essentially what doing an asynchronous memcpy would be, see this section:

There’s also a bit of an explanation in this SO post: