cudaMemcpyAsync and malloc

Hello

I need to use cudaMemcpyAsync to copy a buffer from the device to the host. However the host buffer I am provided with was allocated with malloc and not cudaHostAlloc. Will my code still provide the correct result? From my tests so far it appears to provide the correct result but I just want to double check that this is guaranteed across platforms. I realize that since I am using malloc the copy operation will be serialized, but at this point I am just interested in correctness.

Thank you

It will work, but it will have synchronizing behavior.

If you don’t used a pinned allocation, then cudaMemcpyAsync “falls back” to behaving like cudaMemcpy

So stream and concurrency behavior may not be exactly what you expect.

I can’t guarantee anything about the behavior or correctness of your application. I can only say what I’ve said above. The data will still be copied, as if you used cudaMemcpy.

Let say that cudaMemcpyAsync is assigned to a stream (streams[0]).
If right after the cudaMemcpyAsync call which uses the malloc buffer I have a cudaStreamSynchronize(streams[0]) call, will I still be guaranteed that the device data will
have been copied to the host buffer after the line with cudaStreamSynchronize(streams[0]) or could there be synchronization problems because I used a buffer allocated with malloc?

Thanks

When cudaMemcpyAsync() is passed a pointer to an non-pinned host buffer it automagically turns into a fully synchronous cudaMemcpy(). This is makes for a safe fallback, but often has negative performance implications which may not be immediately obvious.