I need to use cudaMemcpyAsync to copy a buffer from the device to the host. However the host buffer I am provided with was allocated with malloc and not cudaHostAlloc. Will my code still provide the correct result? From my tests so far it appears to provide the correct result but I just want to double check that this is guaranteed across platforms. I realize that since I am using malloc the copy operation will be serialized, but at this point I am just interested in correctness.
It will work, but it will have synchronizing behavior.
If you don’t used a pinned allocation, then cudaMemcpyAsync “falls back” to behaving like cudaMemcpy
So stream and concurrency behavior may not be exactly what you expect.
I can’t guarantee anything about the behavior or correctness of your application. I can only say what I’ve said above. The data will still be copied, as if you used cudaMemcpy.
Let say that cudaMemcpyAsync is assigned to a stream (streams[0]).
If right after the cudaMemcpyAsync call which uses the malloc buffer I have a cudaStreamSynchronize(streams[0]) call, will I still be guaranteed that the device data will
have been copied to the host buffer after the line with cudaStreamSynchronize(streams[0]) or could there be synchronization problems because I used a buffer allocated with malloc?
When cudaMemcpyAsync() is passed a pointer to an non-pinned host buffer it automagically turns into a fully synchronous cudaMemcpy(). This is makes for a safe fallback, but often has negative performance implications which may not be immediately obvious.