Hi everyone! I implemented code where I read several images in parallel using threads and copy the image data to the GPU with cudaMemcpyAsync, according to the following pseudocode:
band_threads.emplace_back([&, i]() {
device_band[i] = read_device_band();
HANDLE_ERROR(cudaMemcpyAsync(device_bands[i], host_bands[i], band_bytes, cudaMemcpyHostToDevice, streams[i]));
}
for thread in band_threads {
thread.start()
}
Is there any benefit to using cudaMemcpyAsync? Or is it the same as using cudaMemcpy?