The xfer type of Async is available in gdsio, and from the profiler result, different streams are indeed fired. However, I saw the stack trace showed that gdsio still calls cuFileWrite. Is there any trick here?
By the way, I found that the BatchIOSubmit and Sync takes a long time to execute (both 2ms) for 4 control blocks. During submission. DtoH/HtoD copies are performed. Does it reflect ant potential issues?
Can you please share your use case and what you are trying to accomplish with Async Mode.
Async Mode is using a polling logic with cuFileWrite between user library. and kernel driver. Submit + poll. However this mode is not asynchronous w.r.t user facing APIs.
This mode is for testing poll mode functionality in NVMe for smaller 4K-8KB IO size. We have not seen benefit with larger IOs as the library has to constantly poll for completion of IO.
Async stream based APIs are not yet available in cuda 12.1 release.
cuFileBatchIOSubmit → synchronous submission
cuFileBatchIOGetStatus → asynchronous completion.
Depending on the batch size and file system, the IO submission can be taking time.
How big is the batch size ?
what is the typical cost of submitting directio to the fs for single IO request time.
Batch Submission time = (submission cost of single IO) * number of batch entries.
The batch size is 1MB, and there are 4 batches in the mentioned case. My use case is offloading tensors during computation.
In original implementation that copies tensor to RAM and then write to NVMe, the speed is 1.5x faster than GDS batch IO.