Where are the async functions?

ExtremeViscent · April 30, 2023, 10:11pm

The xfer type of Async is available in gdsio, and from the profiler result, different streams are indeed fired. However, I saw the stack trace showed that gdsio still calls cuFileWrite. Is there any trick here?

By the way, I found that the BatchIOSubmit and Sync takes a long time to execute (both 2ms) for 4 control blocks. During submission. DtoH/HtoD copies are performed. Does it reflect ant potential issues?

kmodukuri · May 2, 2023, 7:08pm

ExtremeViscent,

Can you please share your use case and what you are trying to accomplish with Async Mode.

Async Mode is using a polling logic with cuFileWrite between user library. and kernel driver. Submit + poll. However this mode is not asynchronous w.r.t user facing APIs.

This mode is for testing poll mode functionality in NVMe for smaller 4K-8KB IO size. We have not seen benefit with larger IOs as the library has to constantly poll for completion of IO.

Async stream based APIs are not yet available in cuda 12.1 release.

cuFileBatchIOSubmit → synchronous submission
cuFileBatchIOGetStatus → asynchronous completion.

Depending on the batch size and file system, the IO submission can be taking time.

How big is the batch size ?
what is the typical cost of submitting directio to the fs for single IO request time.

Batch Submission time = (submission cost of single IO) * number of batch entries.

ExtremeViscent · May 3, 2023, 10:04am

Hi Kmodukuri,
The batch size is 1MB, and there are 4 batches in the mentioned case. My use case is offloading tensors during computation.

In original implementation that copies tensor to RAM and then write to NVMe, the speed is 1.5x faster than GDS batch IO.

Topic		Replies	Views
Understanding the latency of 4K read workload in gdsio Storage cuda , gds	1	74	July 23, 2025
cuFile Async APIs GPU-Accelerated Libraries gds	5	1719	October 25, 2024
Getting the best performance from NVIDIA GPUDirect storage APIs (batch io) GPU-Accelerated Libraries gds	5	949	June 29, 2023
Understanding Read and Write Op Counts in Async GDS Operations GPU-Accelerated Libraries gds	0	324	February 27, 2024
GPU Direct Storage cuFILE API Asynchronous read/write not found in Cuda toolkit 12.1 GPU-Accelerated Libraries cuda	2	443	April 23, 2024
GPU Direct Storage: cuFileWrite concurrently to kernel execution CUDA Programming and Performance	0	453	January 7, 2022
NVIDIA GDS output exceeds NVMe device throughput GPU-Accelerated Libraries gds	10	744	January 9, 2024
Written data wraps when Submitting writes > max_direct_io_size_kb GPU-Accelerated Libraries gds	3	79	August 23, 2024
Io_limit too small GPU-Accelerated Libraries gds	5	565	October 9, 2025
Gds tools gdsio ,the Throughput is less then 500M CUDA Programming and Performance cuda	1	1060	August 29, 2022

Where are the async functions?

Related topics