What are the ways I can copy data from host to device more efficiently?
cudaMemcpyAsync() from pinned memory are certainly the most efficient ways to copy data from host memory to device memory for data sizes that exceed some threshold level, perhaps around a few kbytes or larger. If you have a smaller amount of data that you need access to, then putting it in pinned/zero-copy host memory, and passing it to the kernel via pointer that way may be more efficient, depending on usage pattern. Finally, if you have on the order of 100 bytes or less, it may be “most efficient” to pass the data as a kernel parameter, eg. a struct.